Testing and statistics...

xr_a_y · Post by **xr_a_y** » Tue Jun 21, 2022 11:10 pm

We often test our dev version versus last release.
Testing versus a much stronger or weaker engine will be bad because the win (resp. lose) ratio would be inside the noise.
Testing versus the last release on the other side leads to huge draw rate and we have to use some biased book sometimes.

Is there a best point in the middle ? Would it be better to test versus a -30Elo and a +30Elo engine for instance.

What is the best point, that gives the best information the faster?

Any input from statistics guys here?

Rebel · Post by **Rebel** » Tue Jun 21, 2022 11:56 pm

Usually I first do self-play, then create a 5 engine elo-pool of -50 / +50 based on the expected elo rating of the self-play result.

Frank Quisinsky · Post by **Frank Quisinsky** » Wed Jun 22, 2022 7:06 am

Hi there,

I made some experiments during the time SWCR and FCP Ratinglist are still running.
Around the years 2010 and 2015.

- most interesting is many games for a match constellation (more nice for looking inside the direct comparsion between engines we like).
- in a group of engines, engines should be around 200 Elo in the near (example ... best engine = 2700, weakest = 2500)

Impossible for bigger tournaments with many engines I like to play.

But from my stats an other view is more interesting when the goal is:
"The greatest possible success - stable ratings - with the smallest effort - quantity of games -

And here the quantity of games must not 100, 200, 500 for each match constellation.
Much more important = Morre oppoennts!

Around 25 is perfect after my statistic experiments.

Second important point:
Very clear is ...
With more opponents lesser quainty of games.
With a higher time controls lesser quantity of games.

With 40/05 I need around 2000 games with more as 25 opponents
With 40/10 I need around 1400 games with more as 25 opponents
With 40/20 I need around 1200 games with more as 25 opponents

Differents with same quantity of games and 20-25 opponets is low.
Very very small differents with same quantity of games and more as 25 opponents.
... for stable ratings!

Basically from my older experiments I try today to build my ratings.
First I made with Winboard Ratingliste around the year 2000 (before CEGT, CCRL) ... see EloStat readme.

Best
Frank

xr_a_y · Post by **xr_a_y** » Wed Jun 22, 2022 7:40 am

Engine A : our base implementation
Engine B : another engine
Engine C : our dev implementation

Let's say we already know diffAB = (EloA - EloB) with good precision.

At a given TC and book, testing engine C versus engine B, we have something like

Code: Select all

draw_ratio = f(elo, elo_diff)

And also

Code: Select all

win_ratio = g(elo,elo_diff)

We test a patch, expecting +3 to +5 Elo between A and C.

So that elo_diff is something like diffAB + 3

If we test a patch with sprt(diffAB, diffAB + 3 , 0.05, 0.05), what are the best W D L values that gives the answer the faster (in less games).

And how f, g and this number of games relates ? Where is the optimal point in term of diffAB in order to minimise the computation effort ?

Can we do the math analytically ? can it be simulated somehow ?

Testing and statistics...

Testing and statistics...

Re: Testing and statistics...

Re: Testing and statistics...

Re: Testing and statistics...