Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.
I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...
If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.
Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra "Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
AndrewGrant wrote: ↑Mon Jan 18, 2021 12:53 pm
Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.
I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...
If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.
Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.
The upper result claims that with roughly 98% of certainty the change gains Elo.
And then you played some more games and now with roughly 85% certainty you can conclude the change is not good.
So there is 85% chance that you've found a 2% corner case. What a discovery.
You seems to have a misconception that number of games in more important than certainty. The chance to be wrong is the same when you have the same relative elo margin (in sigmas) no matter if you played 10 games or 10 million games.
AndrewGrant wrote: ↑Mon Jan 18, 2021 12:53 pm
Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.
I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...
If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.
Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.
Trust me my friend....a strong engine is strong...10 games or 10,000....A pretty good engine is pretty good...10 games or 100 games and A superstrong engine (ex..Stockfish Dev..etc) is "Superstrong" 1 game or 100,000 games...alas..that is our reality..No? ...Why complain?..Ethereal is close to the top...Almost...
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra "Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
AndrewGrant wrote: ↑Mon Jan 18, 2021 12:53 pm
Time and time again, users of this form will claim to know the value of an engine pairing based only on 100 games, or even 10 games. Those people lack basic understandings of statistics, and here is a live example for you.
I launched a tuning test on Ethereal's hand crafted eval using some new ideas. Here were the results initially...
If you ignore the bias introduced by SPRT cutoffs, the patch gains elo with well over 95% confidence. Multiple stdeviations of confidence. But the people who actually program the engines tend to know a bit better.
Stop playing tiny samples, or you would commit garbage or claim garbage, which is what I would have done here if I did not employ SPRT.
I think that the interesting question is if you are sure that testing conditions are good.
I can imagine bad testing conditions when you test X against Y as one of the following:
1)In part of the games X is slowed down by a significant factor when it does not happen to Y and in part of the games it is the opposite.
If you have also statistics about number of nodes per second of both engines you can identify this type of problem and if you see
for example that Y get more nodes per second than X in all games 1-300 when X get more nodes per second in all games 301-600 then it is obvious that something is wrong in testing.
2)You do not play every position with both colors.
I believe that with the same number of games playing every position with both colors should reduce the error in testing.
Note that I believe that with correct testing the +- 25.95 after 288 games can be reduced because I believe this number is based on assumption
that the result of pair of consecutive games are independent when it does not have to be this case with the best testing.
Could, in theory, an engine in test get into some sort of "ELO swing" ?
With 95% confidence after...
Test 10 games: +50 ELO
Test 100 games: +20 ELO
Test 1000 games: -15 ELO
Test 10.000 games: -50 ELO
Test 100.000 games: +45 ELO
Test 1.000.000 games: +60 ELO
Test 10 million games: - 25 ELO
and so on...
In that case, you would never be certain about the change you just committed, and gaining or losing ELO is completely dependent on the length of the test you're running.
mvanthoor wrote: ↑Wed Jan 20, 2021 6:04 pm
In that case, you would never be certain about the change you just committed, and gaining or losing ELO is completely dependent on the length of the test you're running.
This is basic statistics.
You fix the length of the test beforehand. By definition, with 95% probability, the Elo will be contained in the confidence interval derived from the test (after its completion).
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.