testing multiple versions & elo calculation

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

flok

testing multiple versions & elo calculation

Post by flok »

Hi,

While working on Embla, I have multiple branches (apart from trunk) in my svn repo. One that adds this feature, one that adds that feature and so on. I would like to compare how they play.
I installed cutechess-cli, bayeselo and an opponent which mostly wins of my program (just not all the time).
Now I can indeed run, say, 9 versions for a while and check what the results are. But with 9 versions this is around 3000 games so with 40 moves an 300 seconds it takes quite a while to finish.
So my question now is: can I halfway add another program to the list and continue the match and still get a good elo calculation? Or should I restart every time?
flok

Re: testing multiple versions & elo calculation

Post by flok »

Oh and also: replacing a branch with a newer version of that branch. Will then ratings automatically get adjusted if the newer version is better?
nionita
Posts: 175
Joined: Fri Oct 22, 2010 9:47 pm
Location: Austria

Re: testing multiple versions & elo calculation

Post by nionita »

flok wrote:Hi,

While working on Embla, I have multiple branches (apart from trunk) in my svn repo. One that adds this feature, one that adds that feature and so on. I would like to compare how they play.
I installed cutechess-cli, bayeselo and an opponent which mostly wins of my program (just not all the time).
Now I can indeed run, say, 9 versions for a while and check what the results are. But with 9 versions this is around 3000 games so with 40 moves an 300 seconds it takes quite a while to finish.
So my question now is: can I halfway add another program to the list and continue the match and still get a good elo calculation? Or should I restart every time?
I use a similar method. But indeed, it needs a lot of games.
You can retire a branch if, after a higher number of games (i.e. when bound < e.g. 10):

elo + bound < elobest - boundbest

A branch, once released, will never be changed - otherwise all games it played before are useless, and you will have to cleanup the games database, which is hassle and loses information. Just make a new one, with the desired correction.
Henk
Posts: 7218
Joined: Mon May 27, 2013 10:31 am

Re: testing multiple versions & elo calculation

Post by Henk »

What is wrong with 120 seconds ?
flok

Re: testing multiple versions & elo calculation

Post by flok »

nionita wrote:I use a similar method. But indeed, it needs a lot of games.
You can retire a branch if, after a higher number of games (i.e. when bound < e.g. 10):

elo + bound < elobest - boundbest

A branch, once released, will never be changed - otherwise all games it played before are useless, and you will have to cleanup the games database, which is hassle and loses information. Just make a new one, with the desired correction.
Ok! Thanks!
Henk wrote:What is wrong with 120 seconds ?
Yeah or 60 seconds.
I took 300s because that is what we use for quick games.
nionita
Posts: 175
Joined: Fri Oct 22, 2010 9:47 pm
Location: Austria

Re: testing multiple versions & elo calculation

Post by nionita »

By the way, I sometimes have a branch with some improvement over another, which is not the best.

Then, if you find out that the improvement has a better score (either for sure, through scores +/- margin, or after enough games still remaining superior, with a margin of 2 or 3 elo), you can "transplant" the improvement to the best branch so far (in git this is a rebase operation, don't know in svn). Of course, with another branch name.

This mostly results in more or less same improvement elo wise, but of the best so far. This must be proved by enough games, though.
tpetzke
Posts: 686
Joined: Thu Mar 03, 2011 4:57 pm
Location: Germany

Re: testing multiple versions & elo calculation

Post by tpetzke »

For testing your time control seems very slow. For regression tests I use a 7 sec + 0.03 sec inc. And for 16.000 games it still takes a while and 16k is usually not enough to bring ELO difference beyond the error bar.

Sooner or later you have to abandon the idea of testing at the TC levels the games for the rating lists later are played at.

If I have the choice between a few slow games and a lot of fast games I take the fast games, this seems more reliable.
Thomas...

=======
http://macechess.blogspot.com - iCE Chess Engine
flok

Re: testing multiple versions & elo calculation

Post by flok »

tpetzke wrote:For testing your time control seems very slow. For regression tests I use a 7 sec + 0.03 sec inc. And for 16.000 games it still takes a while and 16k is usually not enough to bring ELO difference beyond the error bar.

Sooner or later you have to abandon the idea of testing at the TC levels the games for the rating lists later are played at.

If I have the choice between a few slow games and a lot of fast games I take the fast games, this seems more reliable.
I could give it a try and compare the results. IIRC I tried it a while ago and the results were rather different. Maybe better now that I removed some critical bugs.

I'll wait first until the results make sense (being the "+" and "-" values much smaller):

Code: Select all

Rank Name                           Elo    +    - games score oppo. draws 
   1 XboardEngine                   394   49   43   397   93%   -35    7% 
   2 Emblatrunk                       2   27   27   397   50%    22   31% 
   3 Emblatpt_move                   -5   31   31   299   49%    19   33% 
   4 Emblatptmove-pv-qtt            -24   27   27   397   46%    26   34% 
   5 Emblatrunk-pv                  -40   27   27   399   43%    28   36% 
   6 Emblatrunk-pv-tk               -41   27   27   398   43%    23   32% 
   7 Emblar1331_q_prevmove-pv       -49   61   62    78   42%    26   26% 
   8 Emblar1331_q_prevmove          -50   61   62    75   43%     9   31% 
   9 Emblar1331_q_prevmove_fix-pv   -58   30   30   322   40%    28   34% 
  10 Emblatpt_move-pv               -60   27   27   395   39%    32   34% 
  11 Embla_tptmove                  -68   55   56    99   40%    19   25% 
Daniel Anulliero
Posts: 759
Joined: Fri Jan 04, 2013 4:55 pm
Location: Nice

Re: testing multiple versions & elo calculation

Post by Daniel Anulliero »

Personnaly , I run two diférents tests :
- tests versions vs versions at 5'+1" increment , 30 Classical openings played with reversed colors = 60 games
-versions against five engines with a CCRL elo (list 40/4) :
Tscp 1.81 (elo 1702)
Jars 1.75a (elo 1813) one of my engines :-)
Fairy-Max (elo 1950)
Jabba 1.0 (elo 2035)
Clarabit 1.00 (elo 2100)
Again 30 opennings with reversed colors = 60 games each = 300
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: testing multiple versions & elo calculation

Post by Sven »

flok wrote:Hi,

While working on Embla, I have multiple branches (apart from trunk) in my svn repo. One that adds this feature, one that adds that feature and so on. I would like to compare how they play.
I installed cutechess-cli, bayeselo and an opponent which mostly wins of my program (just not all the time).
Now I can indeed run, say, 9 versions for a while and check what the results are. But with 9 versions this is around 3000 games so with 40 moves an 300 seconds it takes quite a while to finish.
So my question now is: can I halfway add another program to the list and continue the match and still get a good elo calculation? Or should I restart every time?
"Engine A branch B version X.Y" is different from "Engine A branch B version X.Z" so both should get their own ELO ratings. I would therefore maintain a growing pool of played games of all your different branches and versions, and recalculate the rating list from the whole pool (i.e. a set of PGN files) each time you have played a bunch of new games, or whenever you feel the need to do so. Replaying games of previous versions is not necessary, you only play games with your new versions (against a defined set of opponents, which of course may involve previous versions if you like) and add these games to the pool. This incremental approach drastically reduces the number of games you need to run. It requires, though, to keep your testing conditions (TC, hardware etc.) unchanged. Each time you change the conditions this means you open a new game pool, simply a new directory to store the PGN.

Sometimes it can make sense to remove games played by "obsolete" versions from the pool.

The rating list you have shown does not contain version numbers. To solve the versioning issue your engine should report its own version number to the UI (under WB using "feature name", under UCI using "id name") so that it appears in the PGN automatically.