We often test our dev version versus last release.
Testing versus a much stronger or weaker engine will be bad because the win (resp. lose) ratio would be inside the noise.
Testing versus the last release on the other side leads to huge draw rate and we have to use some biased book sometimes.
Is there a best point in the middle ? Would it be better to test versus a -30Elo and a +30Elo engine for instance.
What is the best point, that gives the best information the faster?
Any input from statistics guys here?
Testing and statistics...
Moderator: Ras
-
- Posts: 1872
- Joined: Sat Nov 25, 2017 2:28 pm
- Location: France
-
- Posts: 7299
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: Testing and statistics...
Usually I first do self-play, then create a 5 engine elo-pool of -50 / +50 based on the expected elo rating of the self-play result.
90% of coding is debugging, the other 10% is writing bugs.
-
- Posts: 6889
- Joined: Wed Nov 18, 2009 7:16 pm
- Location: Gutweiler, Germany
- Full name: Frank Quisinsky
Re: Testing and statistics...
Hi there,
I made some experiments during the time SWCR and FCP Ratinglist are still running.
Around the years 2010 and 2015.
- most interesting is many games for a match constellation (more nice for looking inside the direct comparsion between engines we like).
- in a group of engines, engines should be around 200 Elo in the near (example ... best engine = 2700, weakest = 2500)
Impossible for bigger tournaments with many engines I like to play.
But from my stats an other view is more interesting when the goal is:
"The greatest possible success - stable ratings - with the smallest effort - quantity of games -
And here the quantity of games must not 100, 200, 500 for each match constellation.
Much more important = Morre oppoennts!
Around 25 is perfect after my statistic experiments.
Second important point:
Very clear is ...
With more opponents lesser quainty of games.
With a higher time controls lesser quantity of games.
With 40/05 I need around 2000 games with more as 25 opponents
With 40/10 I need around 1400 games with more as 25 opponents
With 40/20 I need around 1200 games with more as 25 opponents
Differents with same quantity of games and 20-25 opponets is low.
Very very small differents with same quantity of games and more as 25 opponents.
... for stable ratings!
Basically from my older experiments I try today to build my ratings.
First I made with Winboard Ratingliste around the year 2000 (before CEGT, CCRL) ... see EloStat readme.
Best
Frank
I made some experiments during the time SWCR and FCP Ratinglist are still running.
Around the years 2010 and 2015.
- most interesting is many games for a match constellation (more nice for looking inside the direct comparsion between engines we like).
- in a group of engines, engines should be around 200 Elo in the near (example ... best engine = 2700, weakest = 2500)
Impossible for bigger tournaments with many engines I like to play.
But from my stats an other view is more interesting when the goal is:
"The greatest possible success - stable ratings - with the smallest effort - quantity of games -
And here the quantity of games must not 100, 200, 500 for each match constellation.
Much more important = Morre oppoennts!
Around 25 is perfect after my statistic experiments.
Second important point:
Very clear is ...
With more opponents lesser quainty of games.
With a higher time controls lesser quantity of games.
With 40/05 I need around 2000 games with more as 25 opponents
With 40/10 I need around 1400 games with more as 25 opponents
With 40/20 I need around 1200 games with more as 25 opponents
Differents with same quantity of games and 20-25 opponets is low.
Very very small differents with same quantity of games and more as 25 opponents.
... for stable ratings!
Basically from my older experiments I try today to build my ratings.
First I made with Winboard Ratingliste around the year 2000 (before CEGT, CCRL) ... see EloStat readme.
Best
Frank
-
- Posts: 1872
- Joined: Sat Nov 25, 2017 2:28 pm
- Location: France
Re: Testing and statistics...
Engine A : our base implementation
Engine B : another engine
Engine C : our dev implementation
Let's say we already know diffAB = (EloA - EloB) with good precision.
At a given TC and book, testing engine C versus engine B, we have something like
And also
We test a patch, expecting +3 to +5 Elo between A and C.
So that elo_diff is something like diffAB + 3
If we test a patch with sprt(diffAB, diffAB + 3 , 0.05, 0.05), what are the best W D L values that gives the answer the faster (in less games).
And how f, g and this number of games relates ? Where is the optimal point in term of diffAB in order to minimise the computation effort ?
Can we do the math analytically ? can it be simulated somehow ?
Engine B : another engine
Engine C : our dev implementation
Let's say we already know diffAB = (EloA - EloB) with good precision.
At a given TC and book, testing engine C versus engine B, we have something like
Code: Select all
draw_ratio = f(elo, elo_diff)
Code: Select all
win_ratio = g(elo,elo_diff)
So that elo_diff is something like diffAB + 3
If we test a patch with sprt(diffAB, diffAB + 3 , 0.05, 0.05), what are the best W D L values that gives the answer the faster (in less games).
And how f, g and this number of games relates ? Where is the optimal point in term of diffAB in order to minimise the computation effort ?
Can we do the math analytically ? can it be simulated somehow ?