A database for learning evaluation functions

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

A database for learning evaluation functions

Post by AlvaroBegue »

Hi,

I have been thinking of doing this for a long time, and I am finally doing it. I am building a database of ~1.34M positions and associated game results, trying to have as few biases as possible. It should be suitable for learning an evaluation function using a neural network, or to tune parameters in a more traditional evaluation function.

This is the procedure I am using:
(1) Download a PGN database with all CCRL 40/4 games.
(2) Convert games to simple lists of moves, with one game per line.
(3) Use a modified version of my engine RuyDos that starts with a random position from each game (at least 20 plies into the game, and at least 10 plies before the end of the game) and extracts the position of the 1,000-th call to the static evaluation function.
(4) Play one game per position using Stockfish 6, using cutechess-cli with time control 30/1 (about 2 seconds per game, and with my 4 cores working concurrently, I expect this will take about a week).
(5) Join the positions and the results.

The first ten positions look like this:

Code: Select all

3r4/4k3/8/5p1R/8/1b2PB2/1P6/4K3 b - - 1-0
3nk2r/rp1b2pp/pR3p2/3P4/1Q3q2/3B1N2/5PPP/5RK1 w k - 1-0
1R6/7p/4k1pB/p1PpP3/2n4P/3K4/P4r2/8 b - - 0-1
3R4/5r1k/2b4p/5p2/1PB5/4q3/P4RPP/6K1 w - - 1/2-1/2
8/5kp1/p4n1p/3pK3/1B6/8/8/8 w - - 0-1
3q3k/1br2pp1/1p6/pP1pR1b1/3P4/P2Q2P1/1B5P/5RK1 b - - 1-0
2b1rbk1/1p1n1pp1/3p3p/6q1/1BB1P3/2N2P1P/R2Q2P1/6K1 w - - 1/2-1/2
2q3k1/5pp1/p3p2p/1p6/1n1P4/5PP1/PP1QN2P/3R2K1 w - - 1-0
8/3b1Q1p/p2p1pp1/4bk2/4r1q1/7P/P4PP1/3R1RK1 w - - 1-0
rq3rk1/2p2ppp/p2b4/1p1Rp1BQ/4P3/1P5P/1PP2PP1/3R2K1 b - - 1-0
Will anybody be interested in a database like this? Is there anything you would have done differently?

Thanks!
Álvaro.
brianr
Posts: 536
Joined: Thu Mar 09, 2006 3:01 pm

Re: A database for learning evaluation functions

Post by brianr »

It depends on what you want as tuning approaches vary (score, bm, result). With the CCRL games you already have a result.

I just modified Stockfish to provide scores for FENs and have 100+MM. You could also use another super strong oracle engine like Komodo (less straightforward but doable with 'go depth <shallow>'

Giraffe's approach of making random moves is also helpful as tuning seems to require poor positions as well as "good" ones.

In any case on the order of 1MM positions is far to few, IMO

Finally, testing methodology and process must be precise. Adding tuning to testing for actual improvements requires meticulous attention to details. I have been spinning wheels with this combo for years with occasional 100 elo jumps (very few and far between)
kinderchocolate
Posts: 454
Joined: Mon Nov 01, 2010 6:55 am
Full name: Ted Wong

Re: A database for learning evaluation functions

Post by kinderchocolate »

I'm indeed doing something like that. I prefer all three:

1. Get high-quality games from CCRL
2. Get low-quality games from FICS
3. Randomly generate

It's assumed that the network will be able to learn from high quality positions while not biased away from very common club-level positions. Randomly generator will hopefully fill up the remaining sampling space.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: A database for learning evaluation functions

Post by AlvaroBegue »

brianr wrote:It depends on what you want as tuning approaches vary (score, bm, result). With the CCRL games you already have a result.
Of course, but using the result from the database has several problems.

* You don't see the same kind of positions that you will see during a search, because some moves in the search have terrible quality. Giraffe's random-move trick is a way to address this. I am addressing it much more directly, by extracting my positions from actual calls to the evaluation function.

* If you are trying to learn how good or bad a particular feature is (say, having a white queen on b7 early in the game), it is possible that engines only do that when it's a good idea, for reasons that your evaluation function may not understand. But the training would learn that it is a good idea, period.

* Using all the positions from each game means you have many positions with similar inputs and identical results, so in effect you have a lot less data than you think you have.

* You don't know if an early position was won by black because it was a good position for black or if the engine playing white was weak. You can try to use the Elo rating difference to adjust for this, but what I am producing is a more pure data set to concentrate on the virtues of the position.

The CPW page on Texel's Tuning Method expresses similar concerns.

I just modified Stockfish to provide scores for FENs and have 100+MM. You could also use another super strong oracle engine like Komodo (less straightforward but doable with 'go depth <shallow>'
Yes, this is a reasonable alternative, but I am afraid this could result in the network repeating the vices of the oracle. Imagine Stockfish doesn't like non-isolated doubled pawns in the opening (it's hard to find a realistic example, so just bear with me), but it turns out they don't actually hurt your winning chances. I think using results of games will be less affected by the vice in the engine I am learning from.

Giraffe's approach of making random moves is also helpful as tuning seems to require poor positions as well as "good" ones.
Agreed. But is one random move enough, or should it be several? I think sampling from the positions that the evaluation function actually sees eliminates this issue.

In any case on the order of 1MM positions is far to few, IMO
This is possibly true, and getting a sense for it is part of why I posted this question. I know AlphaGo used 30M positions to train their evaluation function, but I intend to have much simpler networks than what they used, so perhaps I don't need as much data.

We'll see. I might have to run this for much longer, or perhaps I have to switch to some cheaper alternative, if I find out I am starved for more positions.

Finally, testing methodology and process must be precise. Adding tuning to testing for actual improvements requires meticulous attention to details. I have been spinning wheels with this combo for years with occasional 100 elo jumps (very few and far between)
Agreed, but I haven't gotten that far yet. I have several ideas for how to create evaluation functions once I have this data, and in deciding among them I will definitely need to be disciplined about testing.


EDIT: Oh, and thanks for your comments! Much appreciated.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: A database for learning evaluation functions

Post by jdart »

MMTO, which is a technique widely used in Shogi programs, basically tunes the evaluation function to match moves made by a strong player. For computer chess this could be an engine used as an "oracle" such as Stockfish. Or known good players such as strong correspondence GMs. What this technique does is to train an evaluation to be similar to that of the oracle. For this you want a library of strong games with (mostly) correct moves.

A commoner procedure in chess is to extract all FENs from a PGN (except those in the early opening and those in the late endgame, i.e. tablebase range), obtain a score for these and do logistic regression based on the game result. What this does is to train the eval function to prefer moves that lead to winning results. For this typically you want games that include both good and not-so-good players so you get imbalanced positions included. CCRL games would probably work. (Beware though because some PGN collections have games lost on time or by forfeit and for these game result is not valid for tuning).

--Jon
cetormenter
Posts: 170
Joined: Sun Oct 28, 2012 9:46 pm

Re: A database for learning evaluation functions

Post by cetormenter »

Here is a database I am currently using for running through the texel method. It contains ~7.5 million positions from random runs of various dev version of Nirvana.

http://www.mediafire.com/file/azvxcpzp4 ... Method.rar

The end result is 3 files separated by wins/draws/losses from white's perspective.
brtzsnr
Posts: 433
Joined: Fri Jan 16, 2015 4:02 pm

Re: A database for learning evaluation functions

Post by brtzsnr »

Hi!

I'm traveling and thus have little time to reply inline to all, but here are my observations.

1) I published already a set of 725K test positions at https://bitbucket.org/zurichess/tuner/downloads . See quiet-labeled.epd. I use this set to train Zurichess which is ~2700 on CCRL.
2) These are very diverse quiet positions. As Matthew Lai pointed out a while ago using positions from CCRL games doesn't produce very good results for Giraffe. He did some random moves, I sampled from the millions of positions reached during many self plays of zurichess.
3) I use quiet positions to reduce the effect of tactical play. I just want to tune the evaluation without worrying about the search.
4) It's important that the score of the position (1-0, 1/2-1/2, 0-1) is from equal strength engines otherwise you can't know for sure if it's a losing position or a winning position. I used Stockfish 7 playing 40/5+0.05 games to evaluate each position.
5) Using 725k over 200k only gives about 3-6 Elo extra. Going to millions of positions won't help too much, only slow the tuner needlessly.
6) I'm using ADAM to tune my engine's weights and as of recently Arasan uses it too http://www.arasanchess.org/blog.shtml . ADAM gives pretty accurate minimum so I'm reasonable confident that this set is better than what I had before.

Hope this helps,
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: A database for learning evaluation functions

Post by Robert Pope »

brtzsnr wrote:Hi!

I'm traveling and thus have little time to reply inline to all, but here are my observations.

1) I published already a set of 725K test positions at https://bitbucket.org/zurichess/tuner/downloads . See quiet-labeled.epd. I use this set to train Zurichess which is ~2700 on CCRL.
2) These are very diverse quiet positions. As Matthew Lai pointed out a while ago using positions from CCRL games doesn't produce very good results for Giraffe. He did some random moves, I sampled from the millions of positions reached during many self plays of zurichess.
3) I use quiet positions to reduce the effect of tactical play. I just want to tune the evaluation without worrying about the search.
4) It's important that the score of the position (1-0, 1/2-1/2, 0-1) is from equal strength engines otherwise you can't know for sure if it's a losing position or a winning position. I used Stockfish 7 playing 40/5+0.05 games to evaluate each position.
5) Using 725k over 200k only gives about 3-6 Elo extra. Going to millions of positions won't help too much, only slow the tuner needlessly.
6) I'm using ADAM to tune my engine's weights and as of recently Arasan uses it too http://www.arasanchess.org/blog.shtml . ADAM gives pretty accurate minimum so I'm reasonable confident that this set is better than what I had before.

Hope this helps,
After having bad luck with td-leaf, I went back to Texel Tuning and tuned Abbess' weights using the quiet-labeled file above. After 8 hours, I ended up with a set of weights that scored 75% (+187 ELO) relative to the untuned version. Not bad!

Next step: look for bizarre values that need to be smoothed out.
brtzsnr
Posts: 433
Joined: Fri Jan 16, 2015 4:02 pm

Re: A database for learning evaluation functions

Post by brtzsnr »

Hi, Robert!

Glad to hear that I could help. +187 Elo is a lot and I look forward to the next release of your chess engine!

I have a few questions:
1) Did you use any regularization of the weights?
2) What are the pieces values (pawn ... queen) computed?
3) Which algorithm did you for optimizing? I use AdamGrad, but I was looking whether another optimizer can improve the minimum.
4) What is the value of your loss function? That is the value of E in https://chessprogramming.wikispaces.com ... ing+Method
5) Did you try with tanh activation function?
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: A database for learning evaluation functions

Post by Robert Pope »

brtzsnr wrote:Hi, Robert!

Glad to hear that I could help. +187 Elo is a lot and I look forward to the next release of your chess engine!

I have a few questions:
1) Did you use any regularization of the weights?
2) What are the pieces values (pawn ... queen) computed?
3) Which algorithm did you for optimizing? I use AdamGrad, but I was looking whether another optimizer can improve the minimum.
4) What is the value of your loss function? That is the value of E in https://chessprogramming.wikispaces.com ... ing+Method
5) Did you try with tanh activation function?
I should know this, but what do you mean by weight regularization?

I am using super-dumb for optimizing: pick a term, and increment/decrement by 1/2 centipawn until I hit a local minimum or a 50 cp change in weight. Then go to the next term.

I only look at n*E, which is about 45000, down from 49000 at the start, IIRC. I don't know if that measure can compare across programs, though.

I only used the logistic function.

Maybe now that my eval terms are a bit more reasonable, some things like LMR will be more workable.

Here are my piece weights. I haven't had a chance to see how my piece/square tables skew them.

Code: Select all

material   midgame  endgame
1 pawn	 100 	 100 
2 pawns	 255 	 201 
3 pawns	 338 	 299 
4 pawns	 423 	 394 
5 pawns	 510 	 491 
6 pawns	 590 	 594 
7 pawns	 681 	 682 
8 pawns	 775 	 760 
1 knight	 317 	 280 
2 knight	 666 	 542 
1 bishop	 326 	 294 
2 bishops	 701 	 623 
1 rook	 466 	 517 
2 rooks	 924 	 1,031 
queen	 967 	 948 
Perhaps the most interesting thing to come out so far is that my mobility bonus switched from positive to negative. I'll have to test removing it entirely.

Another idea I had was to change the weighting of each epd line: it seems that the earlier in the game you are, the more likely the result will be influenced by a blunder by one side, so early moves should be given a lower weight in the optimization process.