Evaluation & Tuning in Chess Engines

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Kiudee
Posts: 27
Joined: Tue Feb 02, 2010 9:12 pm

Re: Evaluation & Tuning in Chess Engines

Post by Kiudee » Tue Aug 25, 2020 10:08 am

If you are interested in a global optimizer specifically adapted to the application of chess, I recently refactored the tuning code we use for Lc0 into its own tuning library:
https://github.com/kiudee/chess-tuning-tools

Since I want to make it as easy to use as possible, I also started work on the documentation here:
https://chess-tuning-tools.readthedocs.io/en/latest/

Let me know if anything is unclear there and could be improved.

jdart
Posts: 4096
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Evaluation & Tuning in Chess Engines

Post by jdart » Thu Aug 27, 2020 1:25 am

A few notes on Andy''s paper:

1. One reason to use full batches is that the step computation can be parallelized, by having each thread visit a subset of the positions. It is harder to parallelize batch gradient descent.

2. I use real numbers (doubles) when tuning and then convert to integer values (used in the non-tuning runtime eval) when outputting the final tuning result.

3. I have found it very useful for debugging to calculate the gradient for each term by finite differences and compare to the gradient calculated by the tuner. This has found many errors for me.

--Jon

AndrewGrant
Posts: 1023
Joined: Tue Apr 19, 2016 4:08 am
Location: U.S.A
Full name: Andrew Grant
Contact:

Re: Evaluation & Tuning in Chess Engines

Post by AndrewGrant » Thu Aug 27, 2020 3:34 am

jdart wrote:
Thu Aug 27, 2020 1:25 am
A few notes on Andy''s paper:

1. One reason to use full batches is that the step computation can be parallelized, by having each thread visit a subset of the positions. It is harder to parallelize batch gradient descent.

2. I use real numbers (doubles) when tuning and then convert to integer values (used in the non-tuning runtime eval) when outputting the final tuning result.

3. I have found it very useful for debugging to calculate the gradient for each term by finite differences and compare to the gradient calculated by the tuner. This has found many errors for me.

--Jon
1. That is the main motivation. I have the option to do mini-batches, but the results tend to be worse for some reason. I think mini-batches are better at finding rare data and exploiting it for the worse.

2. Likewise I use doubles for the entire tuning session, but I output a rounded parameter. From time to time I consider increasing the eval "grain" so not have so much rounded precision, but I feel its arrogant to think I can tune something within 0.5 cps.

AndrewGrant
Posts: 1023
Joined: Tue Apr 19, 2016 4:08 am
Location: U.S.A
Full name: Andrew Grant
Contact:

Re: Evaluation & Tuning in Chess Engines

Post by AndrewGrant » Thu Aug 27, 2020 7:48 am

Another +7 or so elo patch commited today based on these ideas. I added an additional ~10M positions to the dataset. This time they are taken in random samples from a batch of 1M self-play games of latest Ethereal at the time, (12.41), then resolved by playing out a deep Principle Variation (Depth 12 here). Merging these two datasets proved worthy.

Interestingly, In this commit I only tuned the NORMAL terms, none if the SAFETY or COMPLEXITY terms. Trying to tune everything at once was not doing much to swing the cost function in a meaningful way. I think this is because there interdependence between term types is a hindrance, not an asset, in tuning. For example, COMPLEXITY is expressly aimed at reducing the impact of the NORMAL terms. Those competing interests appear to get in the way. It adds enough noise that, on top of the typical GD noise, the decent is fruitless.

jdart
Posts: 4096
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Evaluation & Tuning in Chess Engines

Post by jdart » Fri Aug 28, 2020 4:54 pm

What was your time control for those 1 million games?

I have used a training set produced by sampling positions from games (including datasets from FICS) and then playing them out with Stockfish at a fairly fast time control. I also did some pruning of the positions to remove ones with very large imbalances/very large scores, and positions like KP vs K, for which I have embedded bitbases.

AndrewGrant
Posts: 1023
Joined: Tue Apr 19, 2016 4:08 am
Location: U.S.A
Full name: Andrew Grant
Contact:

Re: Evaluation & Tuning in Chess Engines

Post by AndrewGrant » Fri Sep 04, 2020 1:10 am

As a testament to the data creation, which was as follows:
1) Generate as many games as possible, using a mix of 1s+.01s and 2s+.02s using heavy adjudication
2) Select at random 10 positions from each of those games, and perform depth 12 searches on them.
3) Apply all of the moves in the PV of the depth 12 search to the position
4) Save the results

Ethereal, Weiss, and now Fabchess have found large elo gains tuning using these datasets. Weiss was able to implement the tuner a la the paper and found +50 elo in 60s+.6s with the initial tune. I believe Weiss was already tuning using an older dataset of mine, and with his own python framework. And I believe Fabchess has been tuning for a long time, although I don't know with what or with what methodology, aside from the addition of Adagrad.

I also tuned Ethereal using an FRC dataset, which returned +25 elo in 60s+.6s testing under FRC, and also passed [-3, 1] SPRT testing for standard chess. That dateset was ~15% the size of the others, suggestion heavily that FRC data is more diverse, and can make up for a lower sample size as a result.

Terje
Posts: 300
Joined: Tue Nov 19, 2019 3:34 am
Location: https://github.com/TerjeKir/weiss
Full name: Terje Kirstihagen

Re: Evaluation & Tuning in Chess Engines

Post by Terje » Fri Sep 04, 2020 2:37 am

For Weiss the small FRC set was slightly worse on its own, but adding it to the previous dataset gave a small gainer :)

fabianVDW
Posts: 146
Joined: Fri Mar 15, 2019 7:46 pm
Location: Germany
Full name: Fabian von der Warth

Re: Evaluation & Tuning in Chess Engines

Post by fabianVDW » Fri Sep 04, 2020 10:44 am

AndrewGrant wrote:
Fri Sep 04, 2020 1:10 am
Ethereal, Weiss, and now Fabchess have found large elo gains tuning using these datasets. Weiss was able to implement the tuner a la the paper and found +50 elo in 60s+.6s with the initial tune. I believe Weiss was already tuning using an older dataset of mine, and with his own python framework. And I believe Fabchess has been tuning for a long time, although I don't know with what or with what methodology, aside from the addition of Adagrad.
FabChess has been using the standard Stochastic Gradient Descent for a long time now. I believe both the adaption of adagrad and using the higher quality data made the new elo gainer successfull.
Author of FabChess: https://github.com/fabianvdW/FabChess
A UCI compliant chess engine written in Rust.
FabChessWiki: https://github.com/fabianvdW/FabChess/wiki
fabianvonderwarth@gmail.com

RubiChess
Posts: 275
Joined: Fri Mar 30, 2018 5:20 am
Full name: Andreas Matthies

Re: Evaluation & Tuning in Chess Engines

Post by RubiChess » Wed Sep 09, 2020 11:28 am

AndrewGrant wrote:
Fri Sep 04, 2020 1:10 am
As a testament to the data creation, which was as follows:
1) Generate as many games as possible, using a mix of 1s+.01s and 2s+.02s using heavy adjudication
2) Select at random 10 positions from each of those games, and perform depth 12 searches on them.
3) Apply all of the moves in the PV of the depth 12 search to the position
4) Save the results

Ethereal, Weiss, and now Fabchess have found large elo gains tuning using these datasets. Weiss was able to implement the tuner a la the paper and found +50 elo in 60s+.6s with the initial tune. I believe Weiss was already tuning using an older dataset of mine, and with his own python framework. And I believe Fabchess has been tuning for a long time, although I don't know with what or with what methodology, aside from the addition of Adagrad.

I also tuned Ethereal using an FRC dataset, which returned +25 elo in 60s+.6s testing under FRC, and also passed [-3, 1] SPRT testing for standard chess. That dateset was ~15% the size of the others, suggestion heavily that FRC data is more diverse, and can make up for a lower sample size as a result.
Hi Andrew.

I don't understand steps 3 and 4 in your "generate tuning data howto".
You have a random position from some game you played before and a pv with 12 moves following this position. And then?
What does "apply all of the moves in the depth 12 search" exaclty mean? Go to the positions 12 plies later? And then??
Or "for pvmoves = 1 to 12: apply move and do something (what?)"
And where do the results came from? We usually need a WDL score for each position, don't we? Where does it come from? Probably not from the orininal game as the depth 12 search would be useless?

AndrewGrant
Posts: 1023
Joined: Tue Apr 19, 2016 4:08 am
Location: U.S.A
Full name: Andrew Grant
Contact:

Re: Evaluation & Tuning in Chess Engines

Post by AndrewGrant » Wed Sep 09, 2020 11:39 am

RubiChess wrote:
Wed Sep 09, 2020 11:28 am
AndrewGrant wrote:
Fri Sep 04, 2020 1:10 am
As a testament to the data creation, which was as follows:
1) Generate as many games as possible, using a mix of 1s+.01s and 2s+.02s using heavy adjudication
2) Select at random 10 positions from each of those games, and perform depth 12 searches on them.
3) Apply all of the moves in the PV of the depth 12 search to the position
4) Save the results
Hi Andrew.

I don't understand steps 3 and 4 in your "generate tuning data howto".
You have a random position from some game you played before and a pv with 12 moves following this position. And then?
What does "apply all of the moves in the depth 12 search" exaclty mean? Go to the positions 12 plies later? And then??
Or "for pvmoves = 1 to 12: apply move and do something (what?)"
And where do the results came from? We usually need a WDL score for each position, don't we? Where does it come from? Probably not from the orininal game as the depth 12 search would be useless?

Here is an exchange I had with Alayan the other day, that paints a better picture.

Code: Select all

[8:50 AM] Andrews: Let me just lay out the whole process:
[8:50 AM] Alayan: Do it
[8:50 AM] Andrews: 1) I play a set of 1 million games of Ethereal vs Ethereal at 2s+.02s.
[8:50 AM] Andrews: 2) I parse those PGNs, toss out any games with fewer than 10 moves
[8:50 AM] Andrews: 3) From each PGN, I randomly sample 10 positions from the game.
[8:51 AM] Andrews: 4) Now I have 10 million positions. I perform a depth 12 search on all of them, and save the principle variation.
[8:51 AM] Andrews: 5) Now for each (position, principle variation), I take the position and apply each move in the PV to it.
[8:51 AM] Andrews: 6) I save that final position (IE, last position of the PV). Those are the lines listed in the final dataset.
[8:52 AM] Andrews: Note that my PV includes qsearch, so "tactical" resolutions are somewhat inherent. I demonstrated this by tuning with and without resolving the positions with an additional qsearch, and saw no difference.
[8:53 AM] Andrews: There is one big flaw: If position A has a known result of R, who is to say that A + 12 or more moves STILL should have the result of R.
[8:53 AM] Alayan: So, the final position that comes from the d12 PV is rated depending on the 2s+0.02s game result from which the position at the start of PV was extracted from
[8:53 AM] Andrews: Precisely.
[8:53 AM] Andrews: I can rationalize how I ended up with this (seemingly/somewhat) flawed system, but thats for another day.
[8:53 AM] Alayan: Yeah, that's because of this big flaw I didn't understand correctly when you explained me the first time, it simply didn't occur to me this could actually work
[8:54 AM] Andrews: TL;DR: Its important that the positions in the dataset are ones that Ethereal would reach if allowed to make a bunch of moves.
[8:54 AM] Andrews: Everything else is up for debate.
[8:54 AM] Alayan: Playing ultra-short games is good to reach a lot of uncommon position. d12 PV is also important to reach uncommon positions in games but relevant in tree.
[8:55 AM] Alayan: That's the flaw of "high-quality games" datasets, if you only get positions that end up being played in good quality games, you miss out on the mass of positions that will never get played but that need to be evaluated well to actually not make mistakes
[8:55 AM] Andrews: If I had all the compute in the world, I would go back and take my ~32 millon positions and play fresh games using 10s+.1s on them, and update the results for each entry accordingly. This is what I STARTED doing with our old datasets, but throwing in the towel and being a man and doing math
[8:56 AM] Andrews: Yeah, so we have "diverse" positions, but not necessarily "highly accurate" results for them.

Post Reply