SF dev version against SF14. 2 succcesive runs:
+131=810-59 +25 ELO
+132=808-60 +25 ELO
Do You need more than 1000 games really?
Moderators: hgm, Rebel, chrisw
-
- Posts: 3283
- Joined: Wed Mar 08, 2006 8:15 pm
-
- Posts: 261
- Joined: Sun Nov 13, 2016 10:37 am
Re: Do You need more than 1000 games really?
Run 1000 times 1000 games and see different results
Zevra 2 is my chess engine. Binary, source and description here: https://github.com/sovaz1997/Zevra2
Zevra v2.5 is last version of Zevra: https://github.com/sovaz1997/Zevra2/releases
Zevra v2.5 is last version of Zevra: https://github.com/sovaz1997/Zevra2/releases
-
- Posts: 710
- Joined: Sat Dec 06, 2014 1:53 pm
Re: Do You need more than 1000 games really?
take any run from fishtest, it is in 200-game chunks
https://tests.stockfishchess.org/tests/ ... d8b78eb5f4
Make cumulative graph for WDL or ELO, 1000 games increment and see deviation.
-
- Posts: 27790
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Do You need more than 1000 games really?
Need for what? Obviously one 1000-game match is enough here to get a likelihood of superiority very close to 100%. You have about 200 non-draws, and the standard deviation in the number of wins for that would be 50%*sqrt(200) = 7 points. And you are 36 points away from equality. That is about 5 standard deviations.
-
- Posts: 3283
- Joined: Wed Mar 08, 2006 8:15 pm
Re: Do You need more than 1000 games really?
3. try shows You need more than 1000 games :
+134=788-78 +19 ELO
+134=788-78 +19 ELO
Jouni
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Do You need more than 1000 games really?
Maybe you need more than 1000 games when you're testing 5-10 Elo differences.
When I test my engine, doing more than 1000 games is useless. At 1000 games, the Elo difference between the two versions i basically locked in, and the only thing that happens is the error margin getting lower and lower. If the Elo difference is +100 for the dev version, and the error margin is 15 or 20 Elo, I can be quite sure that the dev version is stronger than the latest release.
When I test my engine, doing more than 1000 games is useless. At 1000 games, the Elo difference between the two versions i basically locked in, and the only thing that happens is the error margin getting lower and lower. If the Elo difference is +100 for the dev version, and the error margin is 15 or 20 Elo, I can be quite sure that the dev version is stronger than the latest release.
-
- Posts: 24
- Joined: Tue Mar 16, 2021 11:11 pm
- Full name: Het Satasiya
Re: Do You need more than 1000 games really?
You could also just use an sprt test to let the computer find the gain within a confidence bound that you find acceptable.
-
- Posts: 389
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: Do You need more than 1000 games really?
WOW SPRT is so clever! Why did no one think of that earlier? Maybe in a few years it will become a standard way of testing engine patches and we will be able to abandon our flawed testing methods. I can already imagine a place where people can submit patches and performs SPRT testing on some sort of cloud or something to determine whether a patch is good or not, and with no human bias at that!
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Do You need more than 1000 games really?
Yes, that's a quick test to verify self-play Elo, but if you want to have a 'real' estimate of the actual gain in competition, you should play against other engines. In self-play the gain is inflated because the newer engine has functionality or optimizations the other doesn't, so it'll use that advantage over and over again. (Even so, if the expected gain is very small, and the confidence needs to be very high, you'd need thousands and thousands of games to actually prove that the newer engine is stronger than the previous version.)BrokenKeyboard wrote: ↑Wed Sep 15, 2021 11:09 pm You could also just use an sprt test to let the computer find the gain within a confidence bound that you find acceptable.