Reinforcement learning project

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Reinforcement learning project

Post by Ferdy »

hgm wrote: Sun Mar 07, 2021 10:49 pm I finally seem to have the tuning process working correctly. There were lots of stupid errors, and because the whole process happened 'under the hood', I had a hard time identifying them. It also revealed a bad bug in the engine: the evaluation forgot to add the piece value of the Pawns. This could happen because Pawns need special treatment: because Pawns do not promote in Janggi they essentially become worthless when they reach the last two ranks (where the King can sneak behind them), and on second rank extra Pawns on the 7th rank could also be useless.

I made a small program to generate synthetic training positions, which calculates the win probability of those based on some standard piece values, and then determine a win/loss outcome with that probability. On a file of 10,000 such positions the Texel tuner was able to extract piece values that were close to those used in the generation of the results (starting from all-zero piece values). Since the tuner currently works by random trial and error, keeping any improvement, it is excruciatingly slow. But at least I get a sensible result now. (No more negative Pawn values!)

I am now having the engine, fixed for the Pawn value, self-play 3000 games, at a nominal thinking 'time' of 100K nodes/move. This should produce about 250K quiet positions in a few hours. The plan is to first use this training set to optimize piece values and their dependence on game phase, keeping the other eval parameters fixed at the value they had during generation of the games. If the resulting piece values look reasonable, I will use those as a starting point for tuning the complete set of parameters with a smaller step. The current evaluation has 40 parameters (including the opening and end-game piece values). I suppose I will be able to use the same set of training positions also for tuning the weights of eval terms I have still to add, such as mobility. Many of the eval terms had been set to 0 for the self-play games anyway, to make sure the various values for the term would be equally sampled. For terms that are known to strongly affect the possibility to win I did not do that, though, otherwise too many games might end in a draw even in the presence of a winning advantage. (The comparable issue in orthodox Chess would be to not encourage pushing of Pawns, so that the engine at the fast TC used to generate the game would never realize it could promote them.)

I hope to have some results tomorrow.
I thought about engine1 vs engine2, they are the same except their piece values, pst, mobility etc. are taken randomly as starting values, or engine1 takes the random values and engine2 will take from your estimated best values. Let them play from random start positions, save the positions from this match for tuning later. If engine1 wins use it in the tuning. After the tuning use the tuned values against engine1, create a match between engine1 and the tuner values, save the positions from this match for tuning. If tuner values wins use it in the next tuning session. However if engine1 wins then the tuner fails to optimize the values or the optimal values are already reached as per engine capability at this moment and the given parameters under tuning. Perhaps generate more positions for tuning and repeat the tuning process, create more random start positions and use it in the match and save the positions from the match for tuning. After the tuning if engine1 still wins that ends the tuning, the values in engine1 are the best.

Tuning can be done again when there are new features added or new features exposed for tuning.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Reinforcement learning project

Post by chrisw »

Joost Buijs wrote: Mon Mar 15, 2021 5:03 pm This looks very much at what I was doing 4 years ago when I first started with Texel-tuning. It did the job, the only drawback is that it was slow, tuning about 800 weights from scratch (including PST's) with 4 million examples took almost a day on my old 10 core machine. When I tuned from scratch I kept the pawn value in the opening fixed at 70 cp, after the tuning process had finished the other piece values always looked very reasonable to me.

I used to keep the PST's averaged at zero, in fact piece values are not needed at all if you incorporate these into the PST's.
This is what they look like, the pawn value is now 69 cp because I usually re-tuned without keeping the pawn value fixed.

Code: Select all

// opening
int valuePawnMG = 69;
int valueKnightMG = 324;
int valueBishopMG = 340;
int valueRookMG = 514;
int valueQueenMG = 1053;

// endgame
int valuePawnEG = 117;
int valueKnightEG = 337;
int valueBishopEG = 368;
int valueRookEG = 624;
int valueQueenEG = 1198;
It needs good sample data preparation though, this is the most time consuming part.
Exactly. How do you define good and prepare good sample positions? That’s the problem.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reinforcement learning project

Post by hgm »

Well, I will definitely have to test the tuned values at some point. At the moment I am still in the stage of streamlining the tuning process, and improving the evaluation by adding new terms (or replacing them) that reduce the MSE on tuning. I had a bug in the generation of the positions: I sampled only one out of 9 positions from the games, but it still saved all the others as empty boards. (Well, actually full boards with all pieces on a0, as I save the piece lists.) So I did not really have 1M positions to tune on, but only 105K. Anyway, the tuner now ignores these.

What was also worrisome is that some of the positions in there had no King. No idea how that could happen, as the engine never generates moves that capture a King; it aborts the node during move generation with a +INF score when it sees such a capture. For now I assume there are too few of those to affect the tuning.

One thing I did is replace the PST for King and Advisors by a 'Palace Constellation Table', which takes account of their coordinated locations. There are 333 such Palace constellations in Janggi (with 0, 1 or 2 Advisors), but because of left-right symmetry only 177 are essentially different. So I tuned values for each of those, simultaneously with the PST for the other pieces. (105K training positions might be a little low to get good statistics on each of those. But the training games were generated with a neutral evaluation for the Palace constellation, so one can hope they are all used about equally often.) This gave the following scores:

Code: Select all

 ...  ...  A..  A..  ...  ...  A..  ...  ...  A..  ...  .A.  .A.  AA.  .A.    2
 ...  A..  ...  A..  ...  A..  ...  .A.  AA.  .A.  .A.  ...  A..  ...  ...    1
 K..  K..  K..  K..  KA.  KA.  KA.  K..  K..  K..  KA.  K..  K..  K..  KA.    0
  12  -23   28   24   78   71  -16  114  108   86   26   82   79   17   -4

 .A.  ..A  ..A  A.A  ..A  ..A  .AA  ...  ...  A..  ...  ...  .A.  ..A  ...
 .A.  ...  A..  ...  ...  .A.  ...  ..A  A.A  ..A  ..A  .AA  ..A  ..A  ...
 K..  K..  K..  K..  KA.  K..  K..  K..  K..  K..  KA.  K..  K..  K..  K.A
  44   13   33  -14   37  120  101   29   27  -19   74  100  -20   32   66

 ...  A..  ...  ...  .A.  ..A  ...  ...  ...  A..  A..  ...  ...  A..  ...
 A..  ...  ...  .A.  ...  ...  ..A  K..  K..  K..  K..  K..  K..  K..  KA.
 K.A  K.A  KAA  K.A  K.A  K.A  K.A  ...  A..  ...  A..  .A.  AA.  .A.  ...
  55    3    9   80   92  120  -16   17    4  -81  -51   -8  -44   48   35

 ...  A..  ...  .A.  .A.  AA.  .A.  .A.  ..A  ..A  A.A  ..A  ..A  .AA  ...
 KA.  KA.  KA.  K..  K..  K..  K..  KA.  K..  K..  K..  K..  KA.  K..  K.A
 A..  ...  .A.  ...  A..  ...  .A.  ...  ...  A..  ...  .A.  ...  ...  ...
 -28   65   29   39   -8   35    4  -40  -37  -23  -26  -47   93  -10   77

 ...  A..  ...  ...  .A.  ..A  ...  ...  A..  ...  ...  .A.  ..A  ...  K..
 K.A  K.A  K.A  KAA  K.A  K.A  K..  K..  K..  K..  KA.  K..  K..  K.A  ...
 A..  ...  .A.  ...  ...  ...  ..A  A.A  ..A  .AA  ..A  ..A  ..A  ..A  ...
-115   32   10  -25   89  -13   63  -60  -80   13   -7   23   48   16   18

 K..  K..  K..  K..  K..  K..  K..  K..  K..  K..  KA.  KA.  KA.  KA.  KA.
 ...  A..  A..  ...  ...  A..  .A.  .A.  AA.  .A.  ...  ...  A..  ...  .A.
 A..  ...  A..  .A.  AA.  .A.  ...  A..  ...  .A.  ...  A..  ...  .A.  ...
  -9   14  -42  -43  -74   51  -20 -119 -128   56  -11    4   -9   26   88

 K.A  K.A  K.A  K.A  K.A  KAA  K..  K..  K..  K..  K..  KA.  K.A  K..  K..
 ...  ...  A..  ...  .A.  ...  ..A  ..A  A.A  ..A  .AA  ..A  ..A  ...  ...
 ...  A..  ...  .A.  ...  ...  ...  A..  ...  .A.  ...  ...  ...  ..A  A.A
 -68   -1   14   11  -36    8  -35  -66   -4  -23  -94   50    3   -8  -73

 K..  K..  K..  KA.  K.A  K..  ...  ...  ...  ...  A..  A..  A..  ...  ...
 A..  ...  .A.  ...  ...  ..A  ...  ...  A..  A..  ...  ...  A..  .A.  .A.
 ..A  .AA  ..A  ..A  ..A  ..A  .K.  AK.  .K.  AK.  .K.  AK.  .K.  .K.  AK.
 -43  -94 -116 -111  -45  -69   82   50  -25   83  -42  108   75   94   80

 ...  A..  .A.  .A.  .A.  AA.  .A.  ..A  ..A  A.A  ...  ...  ...  ...  ...
 AA.  .A.  ...  ...  A..  ...  .A.  ...  A..  ...  ..A  A.A  ...  .K.  .K.
 .K.  .K.  .K.  AK.  .K.  .K.  .K.  AK.  .K.  .K.  AK.  .K.  AKA  ...  A..
  76   48  104   81  -24   36   45   50   57   -5   88   26   68   55   47

 ...  ...  A..  A..  A..  ...  ...  ...  A..  .A.  .A.  .A.  AA.  .A.  ..A
 AK.  AK.  .K.  .K.  AK.  .K.  .K.  AK.  .K.  .K.  .K.  AK.  .K.  .K.  .K.
 ...  A..  ...  A..  ...  .A.  AA.  .A.  .A.  ...  A..  ...  ...  .A.  A..
  53  -13   29    4   34    9   -7   57   83   23   50   76   56   94   47

 ..A  A.A  ...  ...  ...  .K.  .K.  .K.  .K.  AK.  AK.  AK.  .K.  .K.  .K.
 AK.  .K.  .KA  AKA  .K.  ...  ...  A..  A..  ...  ...  A..  ...  ...  A..
 ...  ...  A..  ...  A.A  ...  A..  ...  A..  ...  A..  ...  .A.  AA.  .A.
   4   39   44   34   12  -16  -89   78   28  -26  -83   19  -12 -102  -19

 AK.  .K.  .K.  .K.  AK.  .K.  .KA  .KA  AKA  .K.  .K.  .K.  ...  ...  ...
 ...  .A.  .A.  AA.  .A.  .A.  ...  A..  ...  ..A  A.A  ...  ...  ...  ...
 .A.  ...  A..  ...  ...  .A.  A..  ...  ...  A..  ...  A.A  ...  ...  ...
 -16 -101  -69 -125  -55   16   31   75  -53  -58   14  -70
Anyway, constellations that are obviously bad (King in front, Advisors behind it) in general get a negative score, those that look good (King in the back, behind Advisors) typically het a high score, so it seems the method is sound, even if the statistics is poor. The numbers are in centi-Pawn, btw.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Reinforcement learning project

Post by Joost Buijs »

chrisw wrote: Tue Mar 16, 2021 10:34 am Exactly. How do you define good and prepare good sample positions? That’s the problem.
It all boils down to trial and error, that's what makes it very time consuming. I'm busy with it for several years now and I still don't fully understand what will work and what not. Sometimes it makes me crazy and have to let it rest for some days, but I always continue in good spirit later on.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reinforcement learning project

Post by hgm »

I diversified the attack and protect bonuses in the evaluation, as a table mobilityBonus[attacker][victim]. Tuning that table improved the total squared error from 8753.52 to 8744.86. (This was for captures only; victim could not be EMPTY.) I suppose that could be significant. One cannot expect perfect result prediction, as results can only be 0, 0.5 or 1, and the expected result R must vary continuously between 0 and 1. If I use a draw model that has P(draw) = 2*R*(1-R), and assume the test positions are homogeneously distributed over R, this would mean 1/3 of all positions are draws (which is about right). The average squared error per game for perfect prediction of R would in that case be 1/12. As I have about 105k training games, that would give a total squared prediction error 8750 for perfect prediction.

So it seems the eval does as well as can be expected, and a reduction of the TSE of 9 could actually correspond to a significant fraction of the remaining imperfection.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reinforcement learning project

Post by hgm »

Ferdy wrote: Tue Mar 16, 2021 2:46 amI thought about engine1 vs engine2, they are the same except their piece values, pst, mobility etc. are taken randomly as starting values, or engine1 takes the random values and engine2 will take from your estimated best values. Let them play from random start positions, save the positions from this match for tuning later. If engine1 wins use it in the tuning. After the tuning use the tuned values against engine1, create a match between engine1 and the tuner values, save the positions from this match for tuning. If tuner values wins use it in the next tuning session. However if engine1 wins then the tuner fails to optimize the values or the optimal values are already reached as per engine capability at this moment and the given parameters under tuning. Perhaps generate more positions for tuning and repeat the tuning process, create more random start positions and use it in the match and save the positions from the match for tuning. After the tuning if engine1 still wins that ends the tuning, the values in engine1 are the best.

Tuning can be done again when there are new features added or new features exposed for tuning.
I don't think the positions from games between engine with different settings can be used for tuning. Because the WDL statistics would be biased in favor of the stronger settings, while the tuner would try to explain that bias through eval patterns it recognized in the position. One could try to compensate for the measured Elo difference, but it would not be obvious how to do that for positions half-way the game.

Suppose the tuned version is indeed better. (If not, it seems this project is pretty much still-born.) I wonder if I should then use that tuned version to play the games fro the next iteration. What I am afraid of is that the things that are tuned to be considered very bad will never occur anymore. So that the newly generated training set will no longer contain the information that these are bad, and the proper tuning for them will get lost. Perhaps this can be solved by keeping the old training positions, where the engine still did bad things, in the set forever.
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Reinforcement learning project

Post by Ferdy »

hgm wrote: Tue Mar 16, 2021 5:05 pm
Ferdy wrote: Tue Mar 16, 2021 2:46 amI thought about engine1 vs engine2, they are the same except their piece values, pst, mobility etc. are taken randomly as starting values, or engine1 takes the random values and engine2 will take from your estimated best values. Let them play from random start positions, save the positions from this match for tuning later. If engine1 wins use it in the tuning. After the tuning use the tuned values against engine1, create a match between engine1 and the tuner values, save the positions from this match for tuning. If tuner values wins use it in the next tuning session. However if engine1 wins then the tuner fails to optimize the values or the optimal values are already reached as per engine capability at this moment and the given parameters under tuning. Perhaps generate more positions for tuning and repeat the tuning process, create more random start positions and use it in the match and save the positions from the match for tuning. After the tuning if engine1 still wins that ends the tuning, the values in engine1 are the best.

Tuning can be done again when there are new features added or new features exposed for tuning.
I don't think the positions from games between engine with different settings can be used for tuning. Because the WDL statistics would be biased in favor of the stronger settings, while the tuner would try to explain that bias through eval patterns it recognized in the position. One could try to compensate for the measured Elo difference, but it would not be obvious how to do that for positions half-way the game.
This is reinforcement, let the engine be trained on positions it has encountered in the match. There can be positions where it draws and lost. For positions that it won this is its experience, the weights are there to guide it on how to win the game. Also the weights of the things it learned before as a win may change due to the draws and loses. In most cases when the tuned values is matched against the current best, there are loses and draws. This is also the time to check the effectiveness of the tuner. Is it able to improve from the positions where the engine experienced in the past?
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reinforcement learning project

Post by hgm »

I found an evaluation term that reduced the TSE quite a lot: I make a 9-bit hash key from all pieces that have a move into a certain Palace, and the number of Advisors in that Palace, and use it as index in a lookup table for a 'King-Safety' bonus. There are 52 combinations of 0-3 Pieces, and 50 combinations of 4 pieces. I could find a hashing scheme that would map all 0-to-3-piece combinations plus 0 or 1 Advisor and all 0-to-4 piece combinations plus 2 Advisors uniquely in that table. Combinations with more attackers hardly occur in my training set, and would map to the remaining 306 table elements in a non-unique way (or overlap with the mentioned positions). All these 306 entries could just hold some large bonus.

Tuning this reduced the TSE from 8741.82 to 8708.51 (and was not even fully converged). I am very curious what this will do in game play.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Reinforcement learning project

Post by Joost Buijs »

A lower TSE/MSE doesn't always mean that the result will be better. With just 105K positions there is a very big chance on over-fitting especially if you don't use some kind of regularization.

I usually keep 2 independent data-sets, one for training and another one for validation after each epoch, it happens often that the MSE on the training-set goes down and that the MSE on the validation-set goes up. In the end the only way to determine the quality of tuning is to have the engine play several thousand games which makes the whole process very time consuming.

My current data-set consists of several billion unique positions and I notice there still is some over-fitting, often the engine plays better after letting the GD run for just 3 epochs instead of letting it run until the tuning is fully converged. ML is not for the faint of heart, programming is a very exact pastime, on the contrary ML is so fuzzy that nobody really grasps what actually goes on behind the curtains.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reinforcement learning project

Post by hgm »

Indeed, this is what I see. The fully tuned engine is not just weaker than the untuned one: it is enormously weaker. Basically it loses each and every game against the untuned engine.

It is also obvious why this is: the PST get such high values that it very quickly sacrifices two or three Pawns to put its other pieces on the 'optimal' squares. Problem is that these squares are only optimal for attacking the Palace under conditions where sufficiently many Pawns participate in the attack: in Janggi you need those Pawns to compensate the fact the the opponent's Advisors defend the Palace, and your own Advisors cannot attack it. Without enough Pawns to trade away the Advisors, an attack has zero chance to succeed, no matter how well executed. So the PST bonuses you 'gained' at the expense of a few Pawns are in fact worthless. They are even worse than worthless, because with all hope for a victory gone, your only resort is to salvage the draw by defending your own Palace. And the locations good for attacking the enemy Palace in most cases are very bad for defending your own. So you get the worst of both worlds: you sacrificed Pawns to get your pieces in the worst possible locations positionally.

This is of course all caused by the fact that the evaluation function that was tuned was fundamentally flawed. It assumed the evaluation was purely additive, and did not account for the fact that you need completely different PST when ahead or equal in material (or at least in Pawns) then when behind.