Do You need more than 1000 games really?

Jouni · Post by **Jouni** » Thu Sep 09, 2021 4:46 pm

SF dev version against SF14. 2 succcesive runs:

+131=810-59 +25 ELO
+132=808-60 +25 ELO

sovaz1997 · Post by **sovaz1997** » Thu Sep 09, 2021 6:49 pm

Jouni wrote: ↑Thu Sep 09, 2021 4:46 pm SF dev version against SF14. 2 succcesive runs:

+131=810-59 +25 ELO
+132=808-60 +25 ELO

Run 1000 times 1000 games and see different results

yurikvelo · Post by **yurikvelo** » Thu Sep 09, 2021 7:52 pm

Jouni wrote: ↑Thu Sep 09, 2021 4:46 pm SF dev version against SF14. 2 succcesive runs:

+131=810-59 +25 ELO
+132=808-60 +25 ELO

take any run from fishtest, it is in 200-game chunks
https://tests.stockfishchess.org/tests/ ... d8b78eb5f4

Make cumulative graph for WDL or ELO, 1000 games increment and see deviation.

hgm · Post by **hgm** » Thu Sep 09, 2021 7:53 pm

Need for what? Obviously one 1000-game match is enough here to get a likelihood of superiority very close to 100%. You have about 200 non-draws, and the standard deviation in the number of wins for that would be 50%*sqrt(200) = 7 points. And you are 36 points away from equality. That is about 5 standard deviations.

Jouni · Post by **Jouni** » Fri Sep 10, 2021 4:19 pm

3. try shows You need more than 1000 games

:

+134=788-78 +19 ELO

mvanthoor · Post by **mvanthoor** » Tue Sep 14, 2021 11:38 pm

Maybe you need more than 1000 games when you're testing 5-10 Elo differences.

When I test my engine, doing more than 1000 games is useless. At 1000 games, the Elo difference between the two versions i basically locked in, and the only thing that happens is the error margin getting lower and lower. If the Elo difference is +100 for the dev version, and the error margin is 15 or 20 Elo, I can be quite sure that the dev version is stronger than the latest release.

BrokenKeyboard · Post by **BrokenKeyboard** » Wed Sep 15, 2021 11:09 pm

You could also just use an sprt test to let the computer find the gain within a confidence bound that you find acceptable.

Sopel · Post by **Sopel** » Thu Sep 16, 2021 12:08 pm

WOW SPRT is so clever! Why did no one think of that earlier? Maybe in a few years it will become a standard way of testing engine patches and we will be able to abandon our flawed testing methods. I can already imagine a place where people can submit patches and performs SPRT testing on some sort of cloud or something to determine whether a patch is good or not, and with no human bias at that!

mvanthoor · Post by **mvanthoor** » Thu Sep 16, 2021 3:39 pm

BrokenKeyboard wrote: ↑Wed Sep 15, 2021 11:09 pm You could also just use an sprt test to let the computer find the gain within a confidence bound that you find acceptable.

Yes, that's a quick test to verify self-play Elo, but if you want to have a 'real' estimate of the actual gain in competition, you should play against other engines. In self-play the gain is inflated because the newer engine has functionality or optimizations the other doesn't, so it'll use that advantage over and over again. (Even so, if the expected gain is very small, and the confidence needs to be very high, you'd need thousands and thousands of games to actually prove that the newer engine is stronger than the previous version.)

Do You need more than 1000 games really?

Do You need more than 1000 games really?

Re: Do You need more than 1000 games really?

Re: Do You need more than 1000 games really?

Re: Do You need more than 1000 games really?

Re: Do You need more than 1000 games really?

Re: Do You need more than 1000 games really?

Re: Do You need more than 1000 games really?

Re: Do You need more than 1000 games really?

Re: Do You need more than 1000 games really?