Experiments in generating Texel Tuning data

lithander · Post by **lithander** » Mon Jan 24, 2022 5:33 pm

algerbrex wrote: ↑Mon Jan 24, 2022 3:26 pm I first downloaded the 3.5 million high-quality human chess games from here: https://rebel13.nl/download/data.html, and then used Scid to select all games played between players 2700 and higher, and only selected games from the last 20 or so years (2001-2020). This gave me ~21K games to select my FENs from.

Why only high quality games? The positions encountered and moves played by master players will only be a small fraction of what your engine is likely required to evaluate during search.

algerbrex wrote: ↑Mon Jan 24, 2022 3:26 pm I'm going to be doing a series of experiments with generating and using Texel Tuning data, so for this first trial run, I went as basic as possible. I only excluded FENs:

that had a king in check or checkmate

where the qsearch score and raw eval score differed by more than 25 centi-pawn

I've never done it myself (also used Zurichess so far, as you know^^) but I'd try not to exclude non-quiet positions but instead just to make them quiet myself in exactly the way my engine does it: Call qsearch, get a PV, play the moves and use the resulting quiet position to train on. Because if my engine would encounter the original position that would be exactly what it'd do before calling eval.

algerbrex wrote: ↑Mon Jan 24, 2022 3:26 pm This selection criterion gave me ~1.4M games, which I then used to tune all of Blunder's evaluation parameters from scratch. The tuning session ran for roughly 13 hours with rather anti-climatic results:

...and that's why I never did that with MinimalChess and am waiting for (hopefully faster) Leorik instead. Yes, like you I want to get rid of the "flaw" of having tuned on external data but I don't want to waste excessive amounts of computation and time on it.

algerbrex wrote: ↑Mon Jan 24, 2022 3:26 pm But at least for now, Zurichess's dataset has still been the best dataset Blunder's used for tuning.

I would probably, personally be okay to pay a few ELO's for the "100% by my own making" tag on my engine. But certainly not 153, haha.

algerbrex wrote: ↑Mon Jan 24, 2022 3:26 pm And he also made sure to take each position and play it out with Stockfish, to ensure the accuracy of the W/L/D scores. This clearly gave a very high-quality dataset. I may explore doing this for this next go round. But I would like to find a way to avoid having to introduce Stockfish into the equation.

I totally get your motivation to exclude Stockfish. I would even want to exclude any external source for the FENs e.g. where you used a set of PGN's from rebel I plan to play my own games. But that's like an endgoal. If you suspect that the accuracy of W/L/D is at fault then use stockfish, confirm or falsify your theory. Then act accordingly. Only the final result has to be Stockfish-free (of course it hasn't but I respect your goal) but you can always use it as a tool to setup a working process.

algerbrex · Post by **algerbrex** » Mon Jan 24, 2022 6:00 pm

chrisw wrote: ↑Mon Jan 24, 2022 5:01 pm There may be some purists who complain if you use SF, but if you use it for WDL game check/result only, then there is no extraction of evaluation function values. Seems perfectly legit.

Sure, I get that. I'm more curious about what I could achieve without using external knowledge. But this isn't necessarily a hill I have to die on.

chrisw wrote: ↑Mon Jan 24, 2022 5:01 pm In any case you can only afford to do relative low depth playouts with inevitable noise. Is there any good reason why SF self-play would be any better than any other engine for this purpose? SF is not known for low depth search with high accuracy.

No particular reason and that's a good point I didn't consider. I'm perfectly fine with using any well know, strong, high-quality engine. I'm open to trying whatever engine will give reasonably accurate WDL results with low-depth playouts.

algerbrex · Post by **algerbrex** » Mon Jan 24, 2022 6:42 pm

lithander wrote: ↑Mon Jan 24, 2022 5:33 pm Why only high quality games? The positions encountered and moves played by master players will only be a small fraction of what your engine is likely required to evaluate during search.

Huh, good point. Not sure why the need for low-quality games slipped my mind. That's likely what's causing a lot of issues. I'm only "teaching" Blunder examples of good patterns, but no examples of bad patterns, so it's not learning well bad squares for pieces, bad pawn structure, bad king safety, etc. At least not to a high enough degree to improve the evaluations knowledge.

Thanks, I'm going to make sure to try to account for this in my next experiment this weekend

lithander wrote: ↑Mon Jan 24, 2022 5:33 pm I've never done it myself (also used Zurichess so far, as you know^^) but I'd try not to exclude non-quiet positions but instead just to make them quiet myself in exactly the way my engine does it: Call qsearch, get a PV, play the moves and use the resulting quiet position to train on. Because if my engine would encounter the original position that would be exactly what it'd do before calling eval.

Right, that's on my list of methods to try. In these experiments, I wanted to try the most basic approach first, which in my mind was just to exclude quiet positions. For the next couple of sessions, I wanted to try to take each FEN string and do a 2-3 depth search, and play the PV out, and save the resulting position.

lithander wrote: ↑Mon Jan 24, 2022 5:33 pm ...and that's why I never did that with MinimalChess and am waiting for (hopefully faster) Leorik instead. Yes, like you I want to get rid of the "flaw" of having tuned on external data but I don't want to waste excessive amounts of computation and time on it.

Yep. Although I think what you said is also true about not necessarily needing millions of positions to tune, as long as your dataset is high-quality. For a long time I got away with using only 400K positions using Zurichess's dataset, and I only got about ~25 Elo when I expanded that to 800K. So I think for my next session I'll only be using 800K.

lithander wrote: ↑Mon Jan 24, 2022 5:33 pm I would probably, personally be okay to pay a few ELO's for the "100% by my own making" tag on my engine. But certainly not 153, haha.

Yep, me either. In fact, I'd like an Elo gain, but perhaps that's too ambitious right now

lithander wrote: ↑Mon Jan 24, 2022 5:33 pm I totally get your motivation to exclude Stockfish. I would even want to exclude any external source for the FENs e.g. where you used a set of PGN's from rebel I plan to play my own games. But that's like an endgoal. If you suspect that the accuracy of W/L/D is at fault then use stockfish, confirm or falsify your theory. Then act accordingly. Only the final result has to be Stockfish-free (of course it hasn't but I respect your goal) but you can always use it as a tool to setup a working process.

True. I'm approaching all of this with a very experimental mindset. I'm open to trying many, many different approaches to find an approach that balances originality and strength. So using Stockfish for WDL results will happen at some point down my line of experiments.

At the end of all of this, I'd like to write up a little "paper" documenting the whole process, so it might be helpful to future developers.

lithander · Post by **lithander** » Mon Jan 24, 2022 6:47 pm

algerbrex wrote: ↑Mon Jan 24, 2022 6:42 pm True. I'm approaching all of this with a very experimental mindset. I'm open to trying many, many different approaches to find an approach that balances originality and strength. So using Stockfish for WDL results will happen at some point down my line of experiments.

At the end of all of this, I'd like to write up a little "paper" documenting the whole process, so it might be helpful to future developers.

Much appreciated. I'll follow your progress reports with great interest as I imagine myself to be in a similar situation in a few months.

algerbrex · Post by **algerbrex** » Mon Jan 24, 2022 6:54 pm

lithander wrote: ↑Mon Jan 24, 2022 6:47 pm Much appreciated. I'll follow your progress reports with great interest as I imagine myself to be in a similar situation in a few months.

Thanks! I'll be sure to try to keep posting consistent updates then and merge them into a paper in the end.

jp · Post by jp » Sat Jan 29, 2022 7:34 am

algerbrex wrote: ↑Sat Oct 30, 2021 12:14 am From these positions, I loaded 400K into the tuner and let it run while observing its tweaking.

From both sets, it seems the values the parameters were being tweaked to are inferior to the current values. Particularly, the mobility parameters (knight mobility, bishop mobility, rook endgame, and middle mobility, and queen midgame and endgame mobility) are driven from the current values to one or zero. This seems to indicate the tuner is attempting to make mobility irrelevant in the evaluation. This definitely seems to minimize the mean square error, but I highly doubt making the engine blind to mobility is going to make it play better (from my current testing, with the features Blunder now has, mobility is currently worth ~50-60 Elo)..

To what exactly do you refer when you write "the tuner"? Is it your own code?

algerbrex · Post by **algerbrex** » Sat Jan 29, 2022 4:38 pm

jp wrote: ↑Sat Jan 29, 2022 7:34 am
algerbrex wrote: ↑Sat Oct 30, 2021 12:14 am From these positions, I loaded 400K into the tuner and let it run while observing its tweaking.

From both sets, it seems the values the parameters were being tweaked to are inferior to the current values. Particularly, the mobility parameters (knight mobility, bishop mobility, rook endgame, and middle mobility, and queen midgame and endgame mobility) are driven from the current values to one or zero. This seems to indicate the tuner is attempting to make mobility irrelevant in the evaluation. This definitely seems to minimize the mean square error, but I highly doubt making the engine blind to mobility is going to make it play better (from my current testing, with the features Blunder now has, mobility is currently worth ~50-60 Elo)..
To what exactly do you refer when you write "the tuner"? Is it your own code?

Yup, Blunder has its own tuner: https://github.com/algerbrex/blunder/bl ... r/tuner.go

jp · Post by jp » Fri Feb 04, 2022 9:50 am

algerbrex wrote: ↑Sat Jan 29, 2022 4:38 pm Yup, Blunder has its own tuner: https://github.com/algerbrex/blunder/bl ... r/tuner.go

Thanks. It'd be useful if there were some semi-universal tuner that one could use to experiment with generic engines, but I suppose that would need some sort of standard format for (the output of) eval functions.

(I see that RuyTune might be something along those lines.)

algerbrex · Post by **algerbrex** » Fri Feb 04, 2022 6:31 pm

jp wrote: ↑Fri Feb 04, 2022 9:50 am
algerbrex wrote: ↑Sat Jan 29, 2022 4:38 pm Yup, Blunder has its own tuner: https://github.com/algerbrex/blunder/bl ... r/tuner.go
Thanks. It'd be useful if there were some semi-universal tuner that one could use to experiment with generic engines, but I suppose that would need some sort of standard format for (the output of) eval functions.

(I see that RuyTune might be something along those lines.)

Agreed. I wanted something similar when I started experimenting with Texel Tuning in Blunder. Of course, it wasn't very difficult to hand-roll, but it'd still be convenient.

I haven't checked out RuyTune much, but I remember the name. Maybe that'll be something to look into as well.

algerbrex · Post by **algerbrex** » Fri Jul 01, 2022 3:44 pm

After releasing Blunder 8.0.0, I realized I had a lot of games saved from doing testing the past month, so I decided to do a quick experiment re-visiting my old FEN generation code from Blunder 7.6.0, to see if I could extend the Zurichess quiet dataset and gain some strength.

I started by doing a little research by checking conditions in other FEN generators to remember what sort of positions would be undesirable to include in tuning sessions (e.g. positions with high ply ( e.g. ply > 200), book moves where ply < 10, positions where one side is in check etc.), and made sure to exclude those from being generated. Most importantly however, rather than using my old approach of just grabbing as many valid FENs as possible from each game, I randomly selected at most 10 from each game. Doing this seemed to help significantly with getting quality data and avoiding too many similar positions.

After this, I removed all duplicate positions from the whole set of FENs (probably around 200-300), which in the end gave me roughly ~1.3M FENs to use for tuning.

Rather than re-tune the entire evaluation using this new dataset I generated, I decided to first just try extending the Zurichess dataset with 300K randomly chosen positions from the 1.3M, which brought the Zurichess dataset to ~1M FENs. I then re-tuned Blunder's entire evaluation using this extended dataset and ran a quick 2000-game match @ 10+0.1s with 16MB hashes, and the result was actually pretty nice:

Code: Select all

Score of Blunder 8.0.0-enhanced-eval vs Blunder 8.0.0: 617 - 529 - 854  [0.522] 2000
...      Blunder 8.0.0-enhanced-eval playing White: 335 - 234 - 431  [0.550] 1000
...      Blunder 8.0.0-enhanced-eval playing Black: 282 - 295 - 423  [0.493] 1000
...      White vs Black: 630 - 516 - 854  [0.528] 2000
Elo difference: 15.3 +/- 11.5, LOS: 99.5 %, DrawRatio: 42.7 %
SPRT: llr 1.42 (48.4%), lbound -2.94, ubound 2.94

I'm going to do a little more experimenting with re-tuning Blunder's evaluation, including longer time controls and trying to re-tune it using only data from the dataset I generated, but for now, I accepted the changes in the dev branch as a strength gain.

If anyone's interested, I can upload the dataset a little later. Have no clue if it'll be beneficial to another engine besides Blunder or not:

Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data

Re: Experiments in generating Texel Tuning data