Reinforcement learning project

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
hgm
Posts: 26134
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Reinforcement learning project

Post by hgm » Sun Jan 31, 2021 10:37 pm

I am new to reinforcement learning, but I decide to give it a try. My plan is roughly this:

1) I will make a simple alpha-beta engine to play Janggi (Korean Chess), based on the move generator I have already written. (This currently exists as a CECP engine that moves randomly.)
2) I will give it a hand-crafted evaluation that takes all its parameters from an ini file (starting with piece values, PST, mobility).
3) I will equip it with a self-play command, which will write the position at the end of the PV of every move to a file.
4) I will have it play a huge number of super-fast games (e.g. a few thousand nodes per move). The first N moves (N=20?) will be played purely randomly, to guarantee game diversity and very unbalanced positions.
5) I will use these positions for Texel-tuning the evaluation.
6) I will then loop back to (4), using the tuned evaluation parameters.
7) When the evaluation parameters do not significantly change anymore, I am done tuning.
8) I will then play a few long-TC games, and have a Janggi expert comment on what strategic errors the engine makes. This should produce ideas for new evaluation terms to add, which would recognize the involved patterns.
9) With this extended evaluation, I then go back to (4), using the already optimized values for the old parameters, to see if beneficial weights for the new terms can be found.

I am curious what this will lead to!

Ferdy
Posts: 4602
Joined: Sun Aug 10, 2008 1:15 pm
Location: Philippines

Re: Reinforcement learning project

Post by Ferdy » Sun Jan 31, 2021 11:14 pm

That is interesting. Looking forward on the outcome.

User avatar
xr_a_y
Posts: 1501
Joined: Sat Nov 25, 2017 1:28 pm
Location: France

Re: Reinforcement learning project

Post by xr_a_y » Mon Feb 01, 2021 7:35 pm

hgm wrote:
Sun Jan 31, 2021 10:37 pm
4) I will have it play a huge number of super-fast games (e.g. a few thousand nodes per move). The first N moves (N=20?) will be played purely randomly, to guarantee game diversity and very unbalanced positions.
If you wanna learn on game result this won't be good. If you just add a small random perturbation in the eval of the first 20 moves, it will be ok.

Also you can do this with more than a few thousand nodes per move I think. A hundred thousand will probably be fine.
For standard chess, it takes less than a day to generate 100M good data, more than enough for Texel tuning where you only need some millions.

User avatar
hgm
Posts: 26134
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: Reinforcement learning project

Post by hgm » Mon Feb 01, 2021 8:44 pm

xr_a_y wrote:
Mon Feb 01, 2021 7:35 pm
hgm wrote:
Sun Jan 31, 2021 10:37 pm
4) I will have it play a huge number of super-fast games (e.g. a few thousand nodes per move). The first N moves (N=20?) will be played purely randomly, to guarantee game diversity and very unbalanced positions.
If you wanna learn on game result this won't be good. If you just add a small random perturbation in the eval of the first 20 moves, it will be ok.
That would be good for game diversity, but it would create very few badly won / lost positions, or positions with awful positional characteristics. Perhaps 20 moves is overdoing it (this has to turn out in practice), but random movers are not that efficient in doing damage to each other.

Of course I will not include the positions from the random phase in the training set. (Since I wanted to take the positions at the end of the PV rather than those at the root, to ensure quiescence, that would happen automatically, as in the random phase there would not be a search or a PV.) Perhaps I should omit the first few positions after the randomization too, because they could be too tense tactically.

Why do you think this won't be good?
Also you can do this with more than a few thousand nodes per move I think. A hundred thousand will probably be fine.
For standard chess, it takes less than a day to generate 100M good data, more than enough for Texel tuning where you only need some millions.
Well, one day would be doable, but one hour would be better. Have people tried to really push this to the limit?

The fewer nodes I need, the more complex evaluations I will be able to afford without the slowdown becoming unacceptable. It also seems a waste of time to do very accurate tactics when the evaluation is still crappy; the games will be full of blunders anyway. Perhaps when I approach convergence it could help to increase the node count.

The 'few thousand' I mentioned was a rough guess; a perfect 4-ply alpha-beta tree with a branching factor of 40 would have about 3200 leave nodes. (But of course there will be QS, and move ordering won't be perfect; OTOH, there will be hash cuts.) And 4 ply should give reasonable play. Except perhaps in the late end-game.

It might be an idea to adjudicate games in that stage to a player that has a huge material advantage, even when the engine could not actually force the checkmate at the depth it is running. (Imagine that in Chess you would have to determine whether KBNK is a win...) At least as long at the evaluation is not yet equiped with the knowledge for what exactly is a win and what a draw. It seems reasonable to reinforce play that would lead to, say, KNNK (to use a FIDE example); it was mostly good. I don't need the engine to learn KNNK is a draw, as I can easily add that knowledge by hand, and once it was added it would no longer make the error to convert a winning position to KNNK. First of all I want to train the evaluation for getting good at reaching a large material advantage during the middle-game.

User avatar
xr_a_y
Posts: 1501
Joined: Sat Nov 25, 2017 1:28 pm
Location: France

Re: Reinforcement learning project

Post by xr_a_y » Tue Feb 02, 2021 3:33 pm

I was thinking 20 fully random moves might indeed lead to position with already too much material imbalance ( and thus not much to learn apart from piece values) or position with coming material loss due to some immediate tactics (something handle by search more than evaluation).

But as you pointed out, maybe random movers or kind to each other and the starting position won't be that bad.

If you are learning on game results and not position search score, I guess this is important to search a little and not use too small depth.

derjack
Posts: 13
Joined: Fri Dec 27, 2019 7:47 pm
Full name: Jacek Dermont

Re: Reinforcement learning project

Post by derjack » Tue Feb 02, 2021 5:49 pm

Nice, I'd like to see the result of this experiment. The parameters have some 'sane' starting values or are simply random/zeroes?

User avatar
hgm
Posts: 26134
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: Reinforcement learning project

Post by hgm » Tue Feb 02, 2021 9:37 pm

derjack wrote:
Tue Feb 02, 2021 5:49 pm
Nice, I'd like to see the result of this experiment. The parameters have some 'sane' starting values or are simply random/zeroes?
The piece values probably will get some sane values. I suppose most positional values can start at zero, except that perhaps initial PST must draw pieces to the center or the enemy Palace. With that the initial play should already be reasonable.

I am still thinking on what type of material evaluation to use, because just adding piece values is not a good model for Janggi or Xiangqi. The problem is that some of the pieces (including Kings) are confined to their own board half (or even a subset thereof), and can thus only be used for defense. One cannot really assign a value to those, as their contribution to the total depends on whether the opponent has something to defend against. I will probably start with assigning seperate attack and defense values to the pieces (initially the same, except for those that cannot reach the opponent King (which get attack value 0), and the Pawns (which get defense value 0)). A players total attack power minus opponent defense power should be a measure for how easy it would be to checkmate the opponent, and negative values can be set to 0 (or at least strongly scaled in that direction) because they represent redundant defense. The total evaluation would then be the difference between the thus clipped values of the individual players.

Henk
Posts: 6838
Joined: Mon May 27, 2013 8:31 am

Re: Reinforcement learning project

Post by Henk » Wed Feb 03, 2021 12:31 pm

How are you going to prevent a chess engine from playing ugly moves if it is automatically tuned.
That is moves that are valid but making the game unpleasant to watch or play against as a human.
Maybe ugly is not the right word.

User avatar
hgm
Posts: 26134
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: Reinforcement learning project

Post by hgm » Wed Feb 03, 2021 2:20 pm

Strong Chess seems to be ugly / boring Chess. Blame the inventor of the game, not the engine! And then start playing Shogi. :wink:

Henk
Posts: 6838
Joined: Mon May 27, 2013 8:31 am

Re: Reinforcement learning project

Post by Henk » Wed Feb 03, 2021 3:22 pm

You can use all the games of for instance 'Tartakower' to tune a chess engine but that would be supervised learning.

Post Reply