Do You need more than 1000 games really?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Jouni
Posts: 3281
Joined: Wed Mar 08, 2006 8:15 pm

Do You need more than 1000 games really?

Post by Jouni »

SF dev version against SF14. 2 succcesive runs:

+131=810-59 +25 ELO
+132=808-60 +25 ELO
Jouni
sovaz1997
Posts: 261
Joined: Sun Nov 13, 2016 10:37 am

Re: Do You need more than 1000 games really?

Post by sovaz1997 »

Jouni wrote: Thu Sep 09, 2021 4:46 pm SF dev version against SF14. 2 succcesive runs:

+131=810-59 +25 ELO
+132=808-60 +25 ELO
Run 1000 times 1000 games and see different results ;)
Zevra 2 is my chess engine. Binary, source and description here: https://github.com/sovaz1997/Zevra2
Zevra v2.5 is last version of Zevra: https://github.com/sovaz1997/Zevra2/releases
User avatar
yurikvelo
Posts: 710
Joined: Sat Dec 06, 2014 1:53 pm

Re: Do You need more than 1000 games really?

Post by yurikvelo »

Jouni wrote: Thu Sep 09, 2021 4:46 pm SF dev version against SF14. 2 succcesive runs:

+131=810-59 +25 ELO
+132=808-60 +25 ELO
take any run from fishtest, it is in 200-game chunks
https://tests.stockfishchess.org/tests/ ... d8b78eb5f4


Make cumulative graph for WDL or ELO, 1000 games increment and see deviation.
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Do You need more than 1000 games really?

Post by hgm »

Need for what? Obviously one 1000-game match is enough here to get a likelihood of superiority very close to 100%. You have about 200 non-draws, and the standard deviation in the number of wins for that would be 50%*sqrt(200) = 7 points. And you are 36 points away from equality. That is about 5 standard deviations.
Jouni
Posts: 3281
Joined: Wed Mar 08, 2006 8:15 pm

Re: Do You need more than 1000 games really?

Post by Jouni »

3. try shows You need more than 1000 games :) :

+134=788-78 +19 ELO
Jouni
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: Do You need more than 1000 games really?

Post by mvanthoor »

Maybe you need more than 1000 games when you're testing 5-10 Elo differences.

When I test my engine, doing more than 1000 games is useless. At 1000 games, the Elo difference between the two versions i basically locked in, and the only thing that happens is the error margin getting lower and lower. If the Elo difference is +100 for the dev version, and the error margin is 15 or 20 Elo, I can be quite sure that the dev version is stronger than the latest release.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
BrokenKeyboard
Posts: 24
Joined: Tue Mar 16, 2021 11:11 pm
Full name: Het Satasiya

Re: Do You need more than 1000 games really?

Post by BrokenKeyboard »

You could also just use an sprt test to let the computer find the gain within a confidence bound that you find acceptable.
Sopel
Posts: 389
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: Do You need more than 1000 games really?

Post by Sopel »

WOW SPRT is so clever! Why did no one think of that earlier? Maybe in a few years it will become a standard way of testing engine patches and we will be able to abandon our flawed testing methods. I can already imagine a place where people can submit patches and performs SPRT testing on some sort of cloud or something to determine whether a patch is good or not, and with no human bias at that!
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: Do You need more than 1000 games really?

Post by mvanthoor »

BrokenKeyboard wrote: Wed Sep 15, 2021 11:09 pm You could also just use an sprt test to let the computer find the gain within a confidence bound that you find acceptable.
Yes, that's a quick test to verify self-play Elo, but if you want to have a 'real' estimate of the actual gain in competition, you should play against other engines. In self-play the gain is inflated because the newer engine has functionality or optimizations the other doesn't, so it'll use that advantage over and over again. (Even so, if the expected gain is very small, and the confidence needs to be very high, you'd need thousands and thousands of games to actually prove that the newer engine is stronger than the previous version.)
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL