Why you play many games when testing

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

AndrewGrant
Posts: 1754
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Why you play many games when testing

Post by AndrewGrant »

Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.

I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...

Code: Select all

ELO   | 42.43 +- 25.95 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | 0.63 (-2.94, 2.94) [0.00, 4.00]
Games | N: 288 W: 77 L: 42 D: 169
If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.

Here are the results now, as I type this...

Code: Select all

ELO   | -4.22 +- 6.11 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | -1.37 (-2.94, 2.94) [0.00, 4.00]
Games | N: 4688 W: 858 L: 915 D: 2915
http://chess.grantnet.us/test/9548/

Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Why you play many games when testing

Post by Milos »

AndrewGrant wrote: Mon Jan 18, 2021 12:53 pm Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.

I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...

Code: Select all

ELO   | 42.43 +- 25.95 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | 0.63 (-2.94, 2.94) [0.00, 4.00]
Games | N: 288 W: 77 L: 42 D: 169
If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.

Here are the results now, as I type this...

Code: Select all

ELO   | -4.22 +- 6.11 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | -1.37 (-2.94, 2.94) [0.00, 4.00]
Games | N: 4688 W: 858 L: 915 D: 2915
http://chess.grantnet.us/test/9548/

Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.
The upper result claims that with roughly 98% of certainty the change gains Elo.
And then you played some more games and now with roughly 85% certainty you can conclude the change is not good.
So there is 85% chance that you've found a 2% corner case. What a discovery. :D
You seems to have a misconception that number of games in more important than certainty. The chance to be wrong is the same when you have the same relative elo margin (in sigmas) no matter if you played 10 games or 10 million games.
supersharp77
Posts: 1242
Joined: Sat Jul 05, 2014 7:54 am
Location: Southwest USA

Re: Why you play many games when testing

Post by supersharp77 »

AndrewGrant wrote: Mon Jan 18, 2021 12:53 pm Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.

I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...

Code: Select all

ELO   | 42.43 +- 25.95 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | 0.63 (-2.94, 2.94) [0.00, 4.00]
Games | N: 288 W: 77 L: 42 D: 169
If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.

Here are the results now, as I type this...

Code: Select all

ELO   | -4.22 +- 6.11 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | -1.37 (-2.94, 2.94) [0.00, 4.00]
Games | N: 4688 W: 858 L: 915 D: 2915
http://chess.grantnet.us/test/9548/

Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.
Trust me my friend....a strong engine is strong...10 games or 10,000....A pretty good engine is pretty good...10 games or 100 games and A superstrong engine (ex..Stockfish Dev..etc) is "Superstrong" 1 game or 100,000 games...alas..that is our reality..No? ...Why complain?..Ethereal is close to the top...Almost... :) :wink:
AndrewGrant
Posts: 1754
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: Why you play many games when testing

Post by AndrewGrant »

Further evidence of the ignorance of talkchess
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
Uri Blass
Posts: 10282
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Why you play many games when testing

Post by Uri Blass »

AndrewGrant wrote: Mon Jan 18, 2021 12:53 pm Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.

I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...

Code: Select all

ELO   | 42.43 +- 25.95 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | 0.63 (-2.94, 2.94) [0.00, 4.00]
Games | N: 288 W: 77 L: 42 D: 169
If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.

Here are the results now, as I type this...

Code: Select all

ELO   | -4.22 +- 6.11 (95%)
SPRT  | 10.0+0.1s Threads=1 Hash=8MB
LLR   | -1.37 (-2.94, 2.94) [0.00, 4.00]
Games | N: 4688 W: 858 L: 915 D: 2915
http://chess.grantnet.us/test/9548/

Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.
I think that the interesting question is if you are sure that testing conditions are good.
I can imagine bad testing conditions when you test X against Y as one of the following:
1)In part of the games X is slowed down by a significant factor when it does not happen to Y and in part of the games it is the opposite.
If you have also statistics about number of nodes per second of both engines you can identify this type of problem and if you see
for example that Y get more nodes per second than X in all games 1-300 when X get more nodes per second in all games 301-600 then it is obvious that something is wrong in testing.
2)You do not play every position with both colors.
I believe that with the same number of games playing every position with both colors should reduce the error in testing.

Note that I believe that with correct testing the +- 25.95 after 288 games can be reduced because I believe this number is based on assumption
that the result of pair of consecutive games are independent when it does not have to be this case with the best testing.
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: Why you play many games when testing

Post by mvanthoor »

Could, in theory, an engine in test get into some sort of "ELO swing" ?

With 95% confidence after...

Test 10 games: +50 ELO
Test 100 games: +20 ELO
Test 1000 games: -15 ELO
Test 10.000 games: -50 ELO
Test 100.000 games: +45 ELO
Test 1.000.000 games: +60 ELO
Test 10 million games: - 25 ELO

and so on...

In that case, you would never be certain about the change you just committed, and gaining or losing ELO is completely dependent on the length of the test you're running.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: Why you play many games when testing

Post by Michel »

mvanthoor wrote: Wed Jan 20, 2021 6:04 pm In that case, you would never be certain about the change you just committed, and gaining or losing ELO is completely dependent on the length of the test you're running.
This is basic statistics.

You fix the length of the test beforehand. By definition, with 95% probability, the Elo will be contained in the confidence interval derived from the test (after its completion).
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.