Evaluation & Tuning in Chess Engines

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

jdart
Posts: 4367
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Evaluation & Tuning in Chess Engines

Post by jdart »

towforce wrote: Tue Aug 25, 2020 12:10 am I apologise if this comes across as critical, but I'd like to share a few quick thoughts on chess tuning.

* the constants are not going to be constant between position types (my remedy would be a polynomial relationship which would allow powers of the position attributes to multiply with each other. This would give rise to the problem of an infinite number of possible solutions, to which my remedy would be to define some rules for simplicity and then optimise for the simplest polynomial fit)
There are many effective algorithms. Most are approximate and are not guaranteed to find a global minimum. However, linear expressions are well-behaved and so the problem is more tractable than would otherwise be the case.

Interacting terms, non-linear terms etc. can be handled but with more complexity.
* most methods of optimising the eval expression are going to find local maxima rather than a global maximum. My remedy would be to use one of these big linear number crunchers like the type of software used to test supercomputer flops in "real world software" to find the global maximum.
In general the only way to find a global optimal point is exhaustive search (look at all possible points, at least to some granularity). There are global solvers (mostly proprietary and expensive) that will use heuristic algorithms to help, but you really only need something like that if you are pretty sure your problem is badly behaved with multiple local minima. Generally gradient descent is good enough for most machine learning applications.
Kiudee
Posts: 29
Joined: Tue Feb 02, 2010 10:12 pm
Location: Germany
Full name: Karlson Pfannschmidt

Re: Evaluation & Tuning in Chess Engines

Post by Kiudee »

If you are interested in a global optimizer specifically adapted to the application of chess, I recently refactored the tuning code we use for Lc0 into its own tuning library:
https://github.com/kiudee/chess-tuning-tools

Since I want to make it as easy to use as possible, I also started work on the documentation here:
https://chess-tuning-tools.readthedocs.io/en/latest/

Let me know if anything is unclear there and could be improved.
jdart
Posts: 4367
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Evaluation & Tuning in Chess Engines

Post by jdart »

A few notes on Andy''s paper:

1. One reason to use full batches is that the step computation can be parallelized, by having each thread visit a subset of the positions. It is harder to parallelize batch gradient descent.

2. I use real numbers (doubles) when tuning and then convert to integer values (used in the non-tuning runtime eval) when outputting the final tuning result.

3. I have found it very useful for debugging to calculate the gradient for each term by finite differences and compare to the gradient calculated by the tuner. This has found many errors for me.

--Jon
AndrewGrant
Posts: 1756
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: Evaluation & Tuning in Chess Engines

Post by AndrewGrant »

jdart wrote: Thu Aug 27, 2020 3:25 am A few notes on Andy''s paper:

1. One reason to use full batches is that the step computation can be parallelized, by having each thread visit a subset of the positions. It is harder to parallelize batch gradient descent.

2. I use real numbers (doubles) when tuning and then convert to integer values (used in the non-tuning runtime eval) when outputting the final tuning result.

3. I have found it very useful for debugging to calculate the gradient for each term by finite differences and compare to the gradient calculated by the tuner. This has found many errors for me.

--Jon
1. That is the main motivation. I have the option to do mini-batches, but the results tend to be worse for some reason. I think mini-batches are better at finding rare data and exploiting it for the worse.

2. Likewise I use doubles for the entire tuning session, but I output a rounded parameter. From time to time I consider increasing the eval "grain" so not have so much rounded precision, but I feel its arrogant to think I can tune something within 0.5 cps.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
AndrewGrant
Posts: 1756
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: Evaluation & Tuning in Chess Engines

Post by AndrewGrant »

Another +7 or so elo patch commited today based on these ideas. I added an additional ~10M positions to the dataset. This time they are taken in random samples from a batch of 1M self-play games of latest Ethereal at the time, (12.41), then resolved by playing out a deep Principle Variation (Depth 12 here). Merging these two datasets proved worthy.

Interestingly, In this commit I only tuned the NORMAL terms, none if the SAFETY or COMPLEXITY terms. Trying to tune everything at once was not doing much to swing the cost function in a meaningful way. I think this is because there interdependence between term types is a hindrance, not an asset, in tuning. For example, COMPLEXITY is expressly aimed at reducing the impact of the NORMAL terms. Those competing interests appear to get in the way. It adds enough noise that, on top of the typical GD noise, the decent is fruitless.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
jdart
Posts: 4367
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Evaluation & Tuning in Chess Engines

Post by jdart »

What was your time control for those 1 million games?

I have used a training set produced by sampling positions from games (including datasets from FICS) and then playing them out with Stockfish at a fairly fast time control. I also did some pruning of the positions to remove ones with very large imbalances/very large scores, and positions like KP vs K, for which I have embedded bitbases.
AndrewGrant
Posts: 1756
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: Evaluation & Tuning in Chess Engines

Post by AndrewGrant »

As a testament to the data creation, which was as follows:
1) Generate as many games as possible, using a mix of 1s+.01s and 2s+.02s using heavy adjudication
2) Select at random 10 positions from each of those games, and perform depth 12 searches on them.
3) Apply all of the moves in the PV of the depth 12 search to the position
4) Save the results

Ethereal, Weiss, and now Fabchess have found large elo gains tuning using these datasets. Weiss was able to implement the tuner a la the paper and found +50 elo in 60s+.6s with the initial tune. I believe Weiss was already tuning using an older dataset of mine, and with his own python framework. And I believe Fabchess has been tuning for a long time, although I don't know with what or with what methodology, aside from the addition of Adagrad.

I also tuned Ethereal using an FRC dataset, which returned +25 elo in 60s+.6s testing under FRC, and also passed [-3, 1] SPRT testing for standard chess. That dateset was ~15% the size of the others, suggestion heavily that FRC data is more diverse, and can make up for a lower sample size as a result.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
Terje
Posts: 347
Joined: Tue Nov 19, 2019 4:34 am
Location: https://github.com/TerjeKir/weiss
Full name: Terje Kirstihagen

Re: Evaluation & Tuning in Chess Engines

Post by Terje »

For Weiss the small FRC set was slightly worse on its own, but adding it to the previous dataset gave a small gainer :)
fabianVDW
Posts: 146
Joined: Fri Mar 15, 2019 8:46 pm
Location: Germany
Full name: Fabian von der Warth

Re: Evaluation & Tuning in Chess Engines

Post by fabianVDW »

AndrewGrant wrote: Fri Sep 04, 2020 3:10 am Ethereal, Weiss, and now Fabchess have found large elo gains tuning using these datasets. Weiss was able to implement the tuner a la the paper and found +50 elo in 60s+.6s with the initial tune. I believe Weiss was already tuning using an older dataset of mine, and with his own python framework. And I believe Fabchess has been tuning for a long time, although I don't know with what or with what methodology, aside from the addition of Adagrad.
FabChess has been using the standard Stochastic Gradient Descent for a long time now. I believe both the adaption of adagrad and using the higher quality data made the new elo gainer successfull.
Author of FabChess: https://github.com/fabianvdW/FabChess
A UCI compliant chess engine written in Rust.
FabChessWiki: https://github.com/fabianvdW/FabChess/wiki
fabianvonderwarth@gmail.com
RubiChess
Posts: 585
Joined: Fri Mar 30, 2018 7:20 am
Full name: Andreas Matthies

Re: Evaluation & Tuning in Chess Engines

Post by RubiChess »

AndrewGrant wrote: Fri Sep 04, 2020 3:10 am As a testament to the data creation, which was as follows:
1) Generate as many games as possible, using a mix of 1s+.01s and 2s+.02s using heavy adjudication
2) Select at random 10 positions from each of those games, and perform depth 12 searches on them.
3) Apply all of the moves in the PV of the depth 12 search to the position
4) Save the results

Ethereal, Weiss, and now Fabchess have found large elo gains tuning using these datasets. Weiss was able to implement the tuner a la the paper and found +50 elo in 60s+.6s with the initial tune. I believe Weiss was already tuning using an older dataset of mine, and with his own python framework. And I believe Fabchess has been tuning for a long time, although I don't know with what or with what methodology, aside from the addition of Adagrad.

I also tuned Ethereal using an FRC dataset, which returned +25 elo in 60s+.6s testing under FRC, and also passed [-3, 1] SPRT testing for standard chess. That dateset was ~15% the size of the others, suggestion heavily that FRC data is more diverse, and can make up for a lower sample size as a result.
Hi Andrew.

I don't understand steps 3 and 4 in your "generate tuning data howto".
You have a random position from some game you played before and a pv with 12 moves following this position. And then?
What does "apply all of the moves in the depth 12 search" exaclty mean? Go to the positions 12 plies later? And then??
Or "for pvmoves = 1 to 12: apply move and do something (what?)"
And where do the results came from? We usually need a WDL score for each position, don't we? Where does it come from? Probably not from the orininal game as the depth 12 search would be useless?