Why you play many games when testing

AndrewGrant · Post by **AndrewGrant** » Mon Jan 18, 2021 12:53 pm

Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.

I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...

Code: Select all

ELO   | 42.43 +- 25.95 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | 0.63 (-2.94, 2.94) [0.00, 4.00]
Games | N: 288 W: 77 L: 42 D: 169

If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.

Here are the results now, as I type this...

Code: Select all

ELO   | -4.22 +- 6.11 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | -1.37 (-2.94, 2.94) [0.00, 4.00]
Games | N: 4688 W: 858 L: 915 D: 2915

http://chess.grantnet.us/test/9548/

Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.

Milos · Post by **Milos** » Mon Jan 18, 2021 6:58 pm

AndrewGrant wrote: ↑Mon Jan 18, 2021 12:53 pm Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.

I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...
Code: Select all
ELO   | 42.43 +- 25.95 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | 0.63 (-2.94, 2.94) [0.00, 4.00]
Games | N: 288 W: 77 L: 42 D: 169
If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.

Here are the results now, as I type this...
Code: Select all
ELO   | -4.22 +- 6.11 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | -1.37 (-2.94, 2.94) [0.00, 4.00]
Games | N: 4688 W: 858 L: 915 D: 2915
http://chess.grantnet.us/test/9548/

Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.

The upper result claims that with roughly 98% of certainty the change gains Elo.
And then you played some more games and now with roughly 85% certainty you can conclude the change is not good.
So there is 85% chance that you've found a 2% corner case. What a discovery.

You seems to have a misconception that number of games in more important than certainty. The chance to be wrong is the same when you have the same relative elo margin (in sigmas) no matter if you played 10 games or 10 million games.

supersharp77 · Post by **supersharp77** » Mon Jan 18, 2021 10:48 pm

AndrewGrant wrote: ↑Mon Jan 18, 2021 12:53 pm Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.

I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...
Code: Select all
ELO   | 42.43 +- 25.95 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | 0.63 (-2.94, 2.94) [0.00, 4.00]
Games | N: 288 W: 77 L: 42 D: 169
If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.

Here are the results now, as I type this...
Code: Select all
ELO   | -4.22 +- 6.11 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | -1.37 (-2.94, 2.94) [0.00, 4.00]
Games | N: 4688 W: 858 L: 915 D: 2915
http://chess.grantnet.us/test/9548/

Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.

Trust me my friend....a strong engine is strong...10 games or 10,000....A pretty good engine is pretty good...10 games or 100 games and A superstrong engine (ex..Stockfish Dev..etc) is "Superstrong" 1 game or 100,000 games...alas..that is our reality..No? ...Why complain?..Ethereal is close to the top...Almost...

AndrewGrant · Post by **AndrewGrant** » Tue Jan 19, 2021 12:37 am

Further evidence of the ignorance of talkchess

Uri Blass · Post by **Uri Blass** » Tue Jan 19, 2021 6:38 am

AndrewGrant wrote: ↑Mon Jan 18, 2021 12:53 pm Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.

I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...
Code: Select all
ELO   | 42.43 +- 25.95 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | 0.63 (-2.94, 2.94) [0.00, 4.00]
Games | N: 288 W: 77 L: 42 D: 169
If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.

Here are the results now, as I type this...
Code: Select all
ELO   | -4.22 +- 6.11 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | -1.37 (-2.94, 2.94) [0.00, 4.00]
Games | N: 4688 W: 858 L: 915 D: 2915
http://chess.grantnet.us/test/9548/

Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.

I think that the interesting question is if you are sure that testing conditions are good.
I can imagine bad testing conditions when you test X against Y as one of the following:
1)In part of the games X is slowed down by a significant factor when it does not happen to Y and in part of the games it is the opposite.
If you have also statistics about number of nodes per second of both engines you can identify this type of problem and if you see
for example that Y get more nodes per second than X in all games 1-300 when X get more nodes per second in all games 301-600 then it is obvious that something is wrong in testing.
2)You do not play every position with both colors.
I believe that with the same number of games playing every position with both colors should reduce the error in testing.

Note that I believe that with correct testing the +- 25.95 after 288 games can be reduced because I believe this number is based on assumption
that the result of pair of consecutive games are independent when it does not have to be this case with the best testing.

mvanthoor · Post by **mvanthoor** » Wed Jan 20, 2021 6:04 pm

Could, in theory, an engine in test get into some sort of "ELO swing" ?

With 95% confidence after...

Test 10 games: +50 ELO
Test 100 games: +20 ELO
Test 1000 games: -15 ELO
Test 10.000 games: -50 ELO
Test 100.000 games: +45 ELO
Test 1.000.000 games: +60 ELO
Test 10 million games: - 25 ELO

and so on...

In that case, you would never be certain about the change you just committed, and gaining or losing ELO is completely dependent on the length of the test you're running.

Michel · Post by **Michel** » Thu Jan 21, 2021 10:57 am

mvanthoor wrote: ↑Wed Jan 20, 2021 6:04 pm In that case, you would never be certain about the change you just committed, and gaining or losing ELO is completely dependent on the length of the test you're running.

This is basic statistics.

You fix the length of the test beforehand. By definition, with 95% probability, the Elo will be contained in the confidence interval derived from the test (after its completion).

Why you play many games when testing

Why you play many games when testing

Re: Why you play many games when testing

Re: Why you play many games when testing

Re: Why you play many games when testing

Re: Why you play many games when testing

Re: Why you play many games when testing

Re: Why you play many games when testing