Generating your own datasets

jmcd · Post by **jmcd** » Tue Mar 14, 2023 5:35 am

Recently I decided that I wanted to move away from using the Zurichess dataset for tuning and make my own. I've generated a PGN of about 20 thousand or so self play games using the same cutechess parameters as Zurichess. I parse the file and create a set of ~1.5 million positions for tuning. The main problem I'm encountering is that I seem to be doing something wrong with how I select positions for the dataset. With the zurichess dataset, I get a MSE of ~0.055, but on my own dataset its around 0.09. This results in pretty useless values.

My question is, what are the standard guidelines for selecting positions for the dataset? Currently I am omitting the first 8 moves of the game, checks, and checkmates. On valid positions, I search it to a depth of 3 and use the position at the end of the principal variation (my pv doesnt include quiescent moves, perhaps this is an issue?).

Any help would be appreciated!

jmcd · Post by **jmcd** » Tue Mar 14, 2023 5:57 am

Update: I added quiescent moves to PV and it improved it a little, but still far worse than the zurichess dataset.

lithander · Post by **lithander** » Tue Mar 14, 2023 1:32 pm

I recently weaned Leorik off the Zurichess dataset and wrote a few posts about it in my devlog.

Not only did I want to create my own data. As an extra challenge I made Leorik forget everything it learned from tuning with 3rd party data. So the first version I did selfplay games with only knew basic material values and it was playing pretty dumb. But the data I generated with that version was good enough to tune a stronger version and with that I created better data. Repeating that cycle a dozen times I arrived at the old strength.

Things I observed that might help you, too:

Don't worry about the MSE being worse than Zurichess dataset. It just means that your data is harder to predict correctly not that it is unfit to tune an evaluation.
Sourcing 1.5M positions from just 20K games means many positions are very similar to each other. Sampling only 5 positions per game from a set of 300K games would also give you 1.5M positions but now you'd likely get better results.
If you start selfplay from the start-position with no book the deterministic nature of engines would create very similar games. You want variety. So using an opening book will help with that. But I also added some randomness to my engine to get more variety in the games.
When you have positions the question is how do you label them: I just labeled all positions taken from a game with the games outcome. E.g. all positions from a game won by white were labeled as "winning for white" and it worked surprisingly well. I'm sure many positions were mislabeled due to blunders (explaining the high MSE=0,42024 compared with the MSE = 0,247370 on the Zurichess dataset) but it erred in a way that balanced out.
How tho chose positions? I think common wisdom is to only chose quiet ones because that's what your evaluation is going to be called on. But I didn't exclude noisy positions instead ran them through my QSearch to make them quiet.
At some point when my evaluation was already decent I added some "confirmation bias" and basically filtered positions where the label did not fit the evaluation. It's labelled as a win for white but the evaluation thinks black is 200cp ahead? Either the label is wrong (because in the game black blundered) or the evaluation isn't sophisticated enough to make sense of why it's winning. In either case: let's just ignore it
I always tuned my evaluation from scratch again but with different set's of data. Over time I would phase the oldest training-data out but I never only tuned on the most recent one. With my approach quantity seemed to be more important than quality. But something around 2M positions was always enough and when I hit the plateau generating more data didn't make any difference anymore.

jmcd · Post by **jmcd** » Tue Mar 14, 2023 9:06 pm

Thanks for the thorough response.

I plan on tuning from scratch as well once I am confident my method for generating datasets works. But first that means making something that beats or matches the zurichess values. That is a relief that lower MSE does not equate to better positions, but I am still curious to know what specifically makes the zurichess dataset so easy to evaluate. I'm trying to adhere to their methods until I get this working, and then will expand and add my own ideas later.

Currently I am using the same opening book at the zurichess dataset. 2moves_v1.epd. I am not sure if these are great for getting a diversity of games since its only 2 moves, but since zuri used it I assume its sufficient. Your point about using fewer positions from each game seems like a great idea, and from the looks of it, it is the most likely explanation as to why my method was failing. I'm generating a larger dataset now, but it takes me forever to get 200-300k games. Your idea about confirmation bias is something I had been considering as well. I'll probably try it out eventually.

Edit: I've also disabled the 'repeat' argument in the cutechess command, since it seems pointless and a good way to get duplicate positions.

JoAnnP38 · Post by **JoAnnP38** » Wed Mar 15, 2023 1:50 pm

lithander wrote: ↑Tue Mar 14, 2023 1:32 pm Don't worry about the MSE being worse than Zurichess dataset. It just means that your data is harder to predict correctly not that it is unfit to tune an evaluation. +1

This is my understanding as well.

I myself am not yet collecting self-tests until I feel comfortable, I'm getting enough variety. I am tuning instead on a mixture of grand master and computer-v-computer games (other than my own) for now. I think now that my tuning seems to be working well (if not necessarily fast enough) I am going to strip my evaluation back down to material only and start building back up one feature at a time while verifying that the feature is worthwhile in terms of ELO as I go. I started off with everything (and the kitchen sink) in my evaluation and its resulted play that is too slow and too low in Elo. I may also do the same thing in Search by stripping back down to a basic Negamax with alpha-beta pruning and then building back up feature by feature so I can see if the addition of some algorithm or heuristic actually helps or hurts.

jmcd · Post by **jmcd** » Thu Mar 16, 2023 9:06 pm

So I've generated a much larger dataset and switched to only using 1 in every 5 positions. 1.1M positions, with duplicates removed if they occurred in the same game. Even with these changes though, I'm still getting bad values from tuning. -150 Elo from retuning everything, with certain parameters getting very poor values. Normally tempo is 20,20 on the zurichess dataset, but it converges on 60,20 with my self generated dataset. If I force tempo to 20,20 with the new dataset, the Elo loss drops to -70. Is there any likely explanation for why this isnt working? I thought perhaps the tempo failure might be a hint.

KhepriChess · Post by **KhepriChess** » Sat Apr 01, 2023 9:59 pm

lithander wrote: ↑Tue Mar 14, 2023 1:32 pm But I didn't exclude noisy positions instead ran them through my QSearch to make them quiet.

I'm having trouble understanding how this produces accurate results. If you have a game that white won and you grab a random noisy position from that game, running it through QSearch to make it quiet might produce a position that isn't winning for white? Or am I wrong?

lithander · Post by **lithander** » Sun Apr 02, 2023 12:54 am

KhepriChess wrote: ↑Sat Apr 01, 2023 9:59 pm I'm having trouble understanding how this produces accurate results. If you have a game that white won and you grab a random noisy position from that game, running it through QSearch to make it quiet might produce a position that isn't winning for white? Or am I wrong?

I don't know if every engine is like that but in my engine the evaluation is only ever asked to evaluate quiet positions. And these positions are made quiet by running them through QSearch. As long as there are winning captures we play them, only when there's nothing to capture without losing more afterwards than you have gained the side-to-move returns "Stand Pat" score. This score is computed by the static evaluation that we're trying to tune. So it makes sense to me that the training data also should only contain such positions.

KhepriChess · Post by **KhepriChess** » Wed Apr 05, 2023 12:06 am

Well, I'm actually a little surprised that it actually...works. Of course it should, but it's just nice to everything come together and work. I generated roughly 190,000 positions (off of 55,000 games), ran it through the tuner, and got back values that increased playing strength. Definitely some odd results, like a lot of negative numbers in the PSTs, but apparently it's better than nothing.

It has highlighted yet another downside to doing this all in javascript: it takes forever to parse even just 55,000 games. That alone took 11-ish hours just to generate the positions (though I did find one slow spot in code while it was running, I don't think fixing that will add a huge improvement). Plus the days it takes to just play tens-of-thousands of games. But I'll keep at it and see how far I can get. Worst case, I just go back to using another dataset.

Witek · Post by **Witek** » Thu Apr 06, 2023 4:07 am

1.5M positions is way to small data set. I got huge gains (over 100 Elo) by just increasing data size from 80M to 800M positions. AFAIK Stockfish uses > 20B positions.

Quality of games is not that important. A common practice is to run game at low node count per move or at low depth. Top engines run games at 10k nodes per move or even lower and achieve best results. This way I'm able to generate ~1M games in 24 hours.

Opening is really important. You shouldn't use "normal" openings because they are balanced and repetitive. For example, I'm using DFRC openings with few (4-12) random moves played at the beginning. This way there is great variety and there will be totally winning/losing openings present in the training set, which is good. You can also try with multi-pv to introduce more variety or inject some random moves during the game, but this will mess up the game outcome.

For positions selection, I'm removing positions in check and where played move was not quiet (capture/promotion). I'm not doing any leaf position extraction from qsearch, because it doesn't make sense IMO. The move score / game result relates to the root position played in game, not some random leaf position deep in pv line. Also, I'm removing positions which score is above some big treshold (e.g. +2000cp) and positions with high half-move counter where the game ended with draw.

Generating your own datasets

Generating your own datasets

Re: Generating your own datasets

Re: Generating your own datasets

Re: Generating your own datasets

Re: Generating your own datasets

Re: Generating your own datasets

Re: Generating your own datasets

Re: Generating your own datasets

Re: Generating your own datasets

Re: Generating your own datasets