Generating your own datasets

JoAnnP38 · Post by **JoAnnP38** » Thu Apr 06, 2023 11:06 am

Witek wrote: ↑Thu Apr 06, 2023 4:07 am 1.5M positions is way to small data set. I got huge gains (over 100 Elo) by just increasing data size from 80M to 800M positions. AFAIK Stockfish uses > 20B positions.

I think it depends on what you are tuning. I believe you have to keep in mind that tuning a PST requires much less data than turning an NN, because during regression all of that data gets combined into a single square value for a piece. It's very simple and is only intended to make sure your pieces are in the right position to increase the probability of introducing a tactical imbalance or a mate. There is only so much information that can be encoded into a single square of a PST. I find that there is not much difference between a well-sampled, 1 million position data set and 8 million. If your sampling is biased or from a restricted population, then perhaps more is better. But I think the law of diminishing returns applies in this case.

lithander · Post by **lithander** » Thu Apr 06, 2023 1:54 pm

Witek wrote: ↑Thu Apr 06, 2023 4:07 am 1.5M positions is way to small data set. I got huge gains (over 100 Elo) by just increasing data size from 80M to 800M positions. AFAIK Stockfish uses > 20B positions.

Quality of games is not that important. A common practice is to run game at low node count per move or at low depth. Top engines run games at 10k nodes per move or even lower and achieve best results. This way I'm able to generate ~1M games in 24 hours.

Opening is really important. You shouldn't use "normal" openings because they are balanced and repetitive. For example, I'm using DFRC openings with few (4-12) random moves played at the beginning. This way there is great variety and there will be totally winning/losing openings present in the training set, which is good. You can also try with multi-pv to introduce more variety or inject some random moves during the game, but this will mess up the game outcome.

For positions selection, I'm removing positions in check and where played move was not quiet (capture/promotion). I'm not doing any leaf position extraction from qsearch, because it doesn't make sense IMO. The move score / game result relates to the root position played in game, not some random leaf position deep in pv line. Also, I'm removing positions which score is above some big treshold (e.g. +2000cp) and positions with high half-move counter where the game ended with draw.

Thanks for all the details! That's really interesting insight into your data generation approach.

But like to JoAnn I wonder if you can say that 1.5M is generally a too small a dataset. When tuning PESTO style evaluations with the slow texel approach many people used the Zurichess dataset with only 725K position and had great results!

Intuitively it would make sense if the necessary size of the dataset depends on the complexity of the evaluation.

But let's say you want to tune multiple PSTs based on the king position (I think you mentioned you did something like that at some point with Caissa) you probably need 100x more games or something. Otherwise you just don't get enough samples for rare king positions.

And if you use NNUE you need more. If you use King-centric NNUE like Stockfish with a NN that has tens of thousands of inputs you need more again and I guess that's why they have 20B positions.

Generating your own datasets

Re: Generating your own datasets

Re: Generating your own datasets