A database for learning evaluation functions

AlvaroBegue · Post by **AlvaroBegue** » Fri Oct 28, 2016 11:11 am

Hi,

I have been thinking of doing this for a long time, and I am finally doing it. I am building a database of ~1.34M positions and associated game results, trying to have as few biases as possible. It should be suitable for learning an evaluation function using a neural network, or to tune parameters in a more traditional evaluation function.

This is the procedure I am using:
(1) Download a PGN database with all CCRL 40/4 games.
(2) Convert games to simple lists of moves, with one game per line.
(3) Use a modified version of my engine RuyDos that starts with a random position from each game (at least 20 plies into the game, and at least 10 plies before the end of the game) and extracts the position of the 1,000-th call to the static evaluation function.
(4) Play one game per position using Stockfish 6, using cutechess-cli with time control 30/1 (about 2 seconds per game, and with my 4 cores working concurrently, I expect this will take about a week).
(5) Join the positions and the results.

The first ten positions look like this:

Code: Select all

3r4/4k3/8/5p1R/8/1b2PB2/1P6/4K3 b - - 1-0
3nk2r/rp1b2pp/pR3p2/3P4/1Q3q2/3B1N2/5PPP/5RK1 w k - 1-0
1R6/7p/4k1pB/p1PpP3/2n4P/3K4/P4r2/8 b - - 0-1
3R4/5r1k/2b4p/5p2/1PB5/4q3/P4RPP/6K1 w - - 1/2-1/2
8/5kp1/p4n1p/3pK3/1B6/8/8/8 w - - 0-1
3q3k/1br2pp1/1p6/pP1pR1b1/3P4/P2Q2P1/1B5P/5RK1 b - - 1-0
2b1rbk1/1p1n1pp1/3p3p/6q1/1BB1P3/2N2P1P/R2Q2P1/6K1 w - - 1/2-1/2
2q3k1/5pp1/p3p2p/1p6/1n1P4/5PP1/PP1QN2P/3R2K1 w - - 1-0
8/3b1Q1p/p2p1pp1/4bk2/4r1q1/7P/P4PP1/3R1RK1 w - - 1-0
rq3rk1/2p2ppp/p2b4/1p1Rp1BQ/4P3/1P5P/1PP2PP1/3R2K1 b - - 1-0

Will anybody be interested in a database like this? Is there anything you would have done differently?

Thanks!
Álvaro.

brianr · Post by **brianr** » Fri Oct 28, 2016 1:21 pm

It depends on what you want as tuning approaches vary (score, bm, result). With the CCRL games you already have a result.

I just modified Stockfish to provide scores for FENs and have 100+MM. You could also use another super strong oracle engine like Komodo (less straightforward but doable with 'go depth <shallow>'

Giraffe's approach of making random moves is also helpful as tuning seems to require poor positions as well as "good" ones.

In any case on the order of 1MM positions is far to few, IMO

Finally, testing methodology and process must be precise. Adding tuning to testing for actual improvements requires meticulous attention to details. I have been spinning wheels with this combo for years with occasional 100 elo jumps (very few and far between)

kinderchocolate · Post by **kinderchocolate** » Fri Oct 28, 2016 2:31 pm

I'm indeed doing something like that. I prefer all three:

1. Get high-quality games from CCRL
2. Get low-quality games from FICS
3. Randomly generate

It's assumed that the network will be able to learn from high quality positions while not biased away from very common club-level positions. Randomly generator will hopefully fill up the remaining sampling space.

AlvaroBegue · Post by **AlvaroBegue** » Fri Oct 28, 2016 2:45 pm

brianr wrote:It depends on what you want as tuning approaches vary (score, bm, result). With the CCRL games you already have a result.

Of course, but using the result from the database has several problems.

* You don't see the same kind of positions that you will see during a search, because some moves in the search have terrible quality. Giraffe's random-move trick is a way to address this. I am addressing it much more directly, by extracting my positions from actual calls to the evaluation function.

* If you are trying to learn how good or bad a particular feature is (say, having a white queen on b7 early in the game), it is possible that engines only do that when it's a good idea, for reasons that your evaluation function may not understand. But the training would learn that it is a good idea, period.

* Using all the positions from each game means you have many positions with similar inputs and identical results, so in effect you have a lot less data than you think you have.

* You don't know if an early position was won by black because it was a good position for black or if the engine playing white was weak. You can try to use the Elo rating difference to adjust for this, but what I am producing is a more pure data set to concentrate on the virtues of the position.

The CPW page on Texel's Tuning Method expresses similar concerns.

I just modified Stockfish to provide scores for FENs and have 100+MM. You could also use another super strong oracle engine like Komodo (less straightforward but doable with 'go depth <shallow>'

Yes, this is a reasonable alternative, but I am afraid this could result in the network repeating the vices of the oracle. Imagine Stockfish doesn't like non-isolated doubled pawns in the opening (it's hard to find a realistic example, so just bear with me), but it turns out they don't actually hurt your winning chances. I think using results of games will be less affected by the vice in the engine I am learning from.

Giraffe's approach of making random moves is also helpful as tuning seems to require poor positions as well as "good" ones.

Agreed. But is one random move enough, or should it be several? I think sampling from the positions that the evaluation function actually sees eliminates this issue.

In any case on the order of 1MM positions is far to few, IMO

This is possibly true, and getting a sense for it is part of why I posted this question. I know AlphaGo used 30M positions to train their evaluation function, but I intend to have much simpler networks than what they used, so perhaps I don't need as much data.

We'll see. I might have to run this for much longer, or perhaps I have to switch to some cheaper alternative, if I find out I am starved for more positions.

Finally, testing methodology and process must be precise. Adding tuning to testing for actual improvements requires meticulous attention to details. I have been spinning wheels with this combo for years with occasional 100 elo jumps (very few and far between)

Agreed, but I haven't gotten that far yet. I have several ideas for how to create evaluation functions once I have this data, and in deciding among them I will definitely need to be disciplined about testing.

EDIT: Oh, and thanks for your comments! Much appreciated.

jdart · Post by **jdart** » Fri Oct 28, 2016 9:13 pm

MMTO, which is a technique widely used in Shogi programs, basically tunes the evaluation function to match moves made by a strong player. For computer chess this could be an engine used as an "oracle" such as Stockfish. Or known good players such as strong correspondence GMs. What this technique does is to train an evaluation to be similar to that of the oracle. For this you want a library of strong games with (mostly) correct moves.

A commoner procedure in chess is to extract all FENs from a PGN (except those in the early opening and those in the late endgame, i.e. tablebase range), obtain a score for these and do logistic regression based on the game result. What this does is to train the eval function to prefer moves that lead to winning results. For this typically you want games that include both good and not-so-good players so you get imbalanced positions included. CCRL games would probably work. (Beware though because some PGN collections have games lost on time or by forfeit and for these game result is not valid for tuning).

--Jon

cetormenter · Post by **cetormenter** » Sat Oct 29, 2016 2:10 am

Here is a database I am currently using for running through the texel method. It contains ~7.5 million positions from random runs of various dev version of Nirvana.

http://www.mediafire.com/file/azvxcpzp4 ... Method.rar

The end result is 3 files separated by wins/draws/losses from white's perspective.

brtzsnr · Post by **brtzsnr** » Sat Oct 29, 2016 5:03 am

Hi!

I'm traveling and thus have little time to reply inline to all, but here are my observations.

1) I published already a set of 725K test positions at https://bitbucket.org/zurichess/tuner/downloads . See quiet-labeled.epd. I use this set to train Zurichess which is ~2700 on CCRL.
2) These are very diverse quiet positions. As Matthew Lai pointed out a while ago using positions from CCRL games doesn't produce very good results for Giraffe. He did some random moves, I sampled from the millions of positions reached during many self plays of zurichess.
3) I use quiet positions to reduce the effect of tactical play. I just want to tune the evaluation without worrying about the search.
4) It's important that the score of the position (1-0, 1/2-1/2, 0-1) is from equal strength engines otherwise you can't know for sure if it's a losing position or a winning position. I used Stockfish 7 playing 40/5+0.05 games to evaluate each position.
5) Using 725k over 200k only gives about 3-6 Elo extra. Going to millions of positions won't help too much, only slow the tuner needlessly.
6) I'm using ADAM to tune my engine's weights and as of recently Arasan uses it too http://www.arasanchess.org/blog.shtml . ADAM gives pretty accurate minimum so I'm reasonable confident that this set is better than what I had before.

Hope this helps,

Robert Pope · Post by **Robert Pope** » Tue Nov 29, 2016 3:52 am

brtzsnr wrote:Hi!

I'm traveling and thus have little time to reply inline to all, but here are my observations.

1) I published already a set of 725K test positions at https://bitbucket.org/zurichess/tuner/downloads . See quiet-labeled.epd. I use this set to train Zurichess which is ~2700 on CCRL.
2) These are very diverse quiet positions. As Matthew Lai pointed out a while ago using positions from CCRL games doesn't produce very good results for Giraffe. He did some random moves, I sampled from the millions of positions reached during many self plays of zurichess.
3) I use quiet positions to reduce the effect of tactical play. I just want to tune the evaluation without worrying about the search.
4) It's important that the score of the position (1-0, 1/2-1/2, 0-1) is from equal strength engines otherwise you can't know for sure if it's a losing position or a winning position. I used Stockfish 7 playing 40/5+0.05 games to evaluate each position.
5) Using 725k over 200k only gives about 3-6 Elo extra. Going to millions of positions won't help too much, only slow the tuner needlessly.
6) I'm using ADAM to tune my engine's weights and as of recently Arasan uses it too http://www.arasanchess.org/blog.shtml . ADAM gives pretty accurate minimum so I'm reasonable confident that this set is better than what I had before.

Hope this helps,

After having bad luck with td-leaf, I went back to Texel Tuning and tuned Abbess' weights using the quiet-labeled file above. After 8 hours, I ended up with a set of weights that scored 75% (+187 ELO) relative to the untuned version. Not bad!

Next step: look for bizarre values that need to be smoothed out.

brtzsnr · Post by **brtzsnr** » Wed Nov 30, 2016 12:04 pm

Hi, Robert!

Glad to hear that I could help. +187 Elo is a lot and I look forward to the next release of your chess engine!

I have a few questions:
1) Did you use any regularization of the weights?
2) What are the pieces values (pawn ... queen) computed?
3) Which algorithm did you for optimizing? I use AdamGrad, but I was looking whether another optimizer can improve the minimum.
4) What is the value of your loss function? That is the value of E in https://chessprogramming.wikispaces.com ... ing+Method
5) Did you try with tanh activation function?

Robert Pope · Post by **Robert Pope** » Wed Nov 30, 2016 5:22 pm

brtzsnr wrote:Hi, Robert!

Glad to hear that I could help. +187 Elo is a lot and I look forward to the next release of your chess engine!

I have a few questions:
1) Did you use any regularization of the weights?
2) What are the pieces values (pawn ... queen) computed?
3) Which algorithm did you for optimizing? I use AdamGrad, but I was looking whether another optimizer can improve the minimum.
4) What is the value of your loss function? That is the value of E in https://chessprogramming.wikispaces.com ... ing+Method
5) Did you try with tanh activation function?

I should know this, but what do you mean by weight regularization?

I am using super-dumb for optimizing: pick a term, and increment/decrement by 1/2 centipawn until I hit a local minimum or a 50 cp change in weight. Then go to the next term.

I only look at n*E, which is about 45000, down from 49000 at the start, IIRC. I don't know if that measure can compare across programs, though.

I only used the logistic function.

Maybe now that my eval terms are a bit more reasonable, some things like LMR will be more workable.

Here are my piece weights. I haven't had a chance to see how my piece/square tables skew them.

Code: Select all

material   midgame  endgame
1 pawn	 100 	 100 
2 pawns	 255 	 201 
3 pawns	 338 	 299 
4 pawns	 423 	 394 
5 pawns	 510 	 491 
6 pawns	 590 	 594 
7 pawns	 681 	 682 
8 pawns	 775 	 760 
1 knight	 317 	 280 
2 knight	 666 	 542 
1 bishop	 326 	 294 
2 bishops	 701 	 623 
1 rook	 466 	 517 
2 rooks	 924 	 1,031 
queen	 967 	 948

Perhaps the most interesting thing to come out so far is that my mobility bonus switched from positive to negative. I'll have to test removing it entirely.

Another idea I had was to change the weighting of each epd line: it seems that the earlier in the game you are, the more likely the result will be influenced by a blunder by one side, so early moves should be given a lower weight in the optimization process.

A database for learning evaluation functions

A database for learning evaluation functions

Re: A database for learning evaluation functions

Re: A database for learning evaluation functions

Re: A database for learning evaluation functions

Re: A database for learning evaluation functions

Re: A database for learning evaluation functions

Re: A database for learning evaluation functions

Re: A database for learning evaluation functions

Re: A database for learning evaluation functions

Re: A database for learning evaluation functions