Testing and statistics...

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
xr_a_y
Posts: 1872
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Testing and statistics...

Post by xr_a_y »

We often test our dev version versus last release.
Testing versus a much stronger or weaker engine will be bad because the win (resp. lose) ratio would be inside the noise.
Testing versus the last release on the other side leads to huge draw rate and we have to use some biased book sometimes.

Is there a best point in the middle ? Would it be better to test versus a -30Elo and a +30Elo engine for instance.

What is the best point, that gives the best information the faster?

Any input from statistics guys here?
User avatar
Rebel
Posts: 7299
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: Testing and statistics...

Post by Rebel »

Usually I first do self-play, then create a 5 engine elo-pool of -50 / +50 based on the expected elo rating of the self-play result.
90% of coding is debugging, the other 10% is writing bugs.
Frank Quisinsky
Posts: 6888
Joined: Wed Nov 18, 2009 7:16 pm
Location: Gutweiler, Germany
Full name: Frank Quisinsky

Re: Testing and statistics...

Post by Frank Quisinsky »

Hi there,

I made some experiments during the time SWCR and FCP Ratinglist are still running.
Around the years 2010 and 2015.

- most interesting is many games for a match constellation (more nice for looking inside the direct comparsion between engines we like).
- in a group of engines, engines should be around 200 Elo in the near (example ... best engine = 2700, weakest = 2500)

Impossible for bigger tournaments with many engines I like to play.

But from my stats an other view is more interesting when the goal is:
"The greatest possible success - stable ratings - with the smallest effort - quantity of games -

And here the quantity of games must not 100, 200, 500 for each match constellation.
Much more important = Morre oppoennts!

Around 25 is perfect after my statistic experiments.

Second important point:
Very clear is ...
With more opponents lesser quainty of games.
With a higher time controls lesser quantity of games.

With 40/05 I need around 2000 games with more as 25 opponents
With 40/10 I need around 1400 games with more as 25 opponents
With 40/20 I need around 1200 games with more as 25 opponents

Differents with same quantity of games and 20-25 opponets is low.
Very very small differents with same quantity of games and more as 25 opponents.
... for stable ratings!

Basically from my older experiments I try today to build my ratings.
First I made with Winboard Ratingliste around the year 2000 (before CEGT, CCRL) ... see EloStat readme.

Best
Frank
User avatar
xr_a_y
Posts: 1872
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

Re: Testing and statistics...

Post by xr_a_y »

Engine A : our base implementation
Engine B : another engine
Engine C : our dev implementation

Let's say we already know diffAB = (EloA - EloB) with good precision.

At a given TC and book, testing engine C versus engine B, we have something like

Code: Select all

draw_ratio = f(elo, elo_diff)
And also

Code: Select all

win_ratio = g(elo,elo_diff)
We test a patch, expecting +3 to +5 Elo between A and C.

So that elo_diff is something like diffAB + 3

If we test a patch with sprt(diffAB, diffAB + 3 , 0.05, 0.05), what are the best W D L values that gives the answer the faster (in less games).

And how f, g and this number of games relates ? Where is the optimal point in term of diffAB in order to minimise the computation effort ?

Can we do the math analytically ? can it be simulated somehow ?