testing multiple versions & elo calculation

flok · Post by **flok** » Tue Oct 27, 2015 9:12 pm

Hi,

While working on Embla, I have multiple branches (apart from trunk) in my svn repo. One that adds this feature, one that adds that feature and so on. I would like to compare how they play.
I installed cutechess-cli, bayeselo and an opponent which mostly wins of my program (just not all the time).
Now I can indeed run, say, 9 versions for a while and check what the results are. But with 9 versions this is around 3000 games so with 40 moves an 300 seconds it takes quite a while to finish.
So my question now is: can I halfway add another program to the list and continue the match and still get a good elo calculation? Or should I restart every time?

flok · Post by **flok** » Tue Oct 27, 2015 9:32 pm

Oh and also: replacing a branch with a newer version of that branch. Will then ratings automatically get adjusted if the newer version is better?

nionita · Post by **nionita** » Tue Oct 27, 2015 10:21 pm

flok wrote:Hi,

While working on Embla, I have multiple branches (apart from trunk) in my svn repo. One that adds this feature, one that adds that feature and so on. I would like to compare how they play.
I installed cutechess-cli, bayeselo and an opponent which mostly wins of my program (just not all the time).
Now I can indeed run, say, 9 versions for a while and check what the results are. But with 9 versions this is around 3000 games so with 40 moves an 300 seconds it takes quite a while to finish.
So my question now is: can I halfway add another program to the list and continue the match and still get a good elo calculation? Or should I restart every time?

I use a similar method. But indeed, it needs a lot of games.
You can retire a branch if, after a higher number of games (i.e. when bound < e.g. 10):

elo + bound < elobest - boundbest

A branch, once released, will never be changed - otherwise all games it played before are useless, and you will have to cleanup the games database, which is hassle and loses information. Just make a new one, with the desired correction.

Henk · Post by **Henk** » Tue Oct 27, 2015 10:54 pm

What is wrong with 120 seconds ?

flok · Post by **flok** » Tue Oct 27, 2015 11:09 pm

nionita wrote:I use a similar method. But indeed, it needs a lot of games.
You can retire a branch if, after a higher number of games (i.e. when bound < e.g. 10):

elo + bound < elobest - boundbest

A branch, once released, will never be changed - otherwise all games it played before are useless, and you will have to cleanup the games database, which is hassle and loses information. Just make a new one, with the desired correction.

Ok! Thanks!

Henk wrote:What is wrong with 120 seconds ?

Yeah or 60 seconds.
I took 300s because that is what we use for quick games.

nionita · Post by **nionita** » Tue Oct 27, 2015 11:43 pm

By the way, I sometimes have a branch with some improvement over another, which is not the best.

Then, if you find out that the improvement has a better score (either for sure, through scores +/- margin, or after enough games still remaining superior, with a margin of 2 or 3 elo), you can "transplant" the improvement to the best branch so far (in git this is a rebase operation, don't know in svn). Of course, with another branch name.

This mostly results in more or less same improvement elo wise, but of the best so far. This must be proved by enough games, though.

tpetzke · Post by **tpetzke** » Fri Oct 30, 2015 3:11 pm

For testing your time control seems very slow. For regression tests I use a 7 sec + 0.03 sec inc. And for 16.000 games it still takes a while and 16k is usually not enough to bring ELO difference beyond the error bar.

Sooner or later you have to abandon the idea of testing at the TC levels the games for the rating lists later are played at.

If I have the choice between a few slow games and a lot of fast games I take the fast games, this seems more reliable.

flok · Post by **flok** » Fri Oct 30, 2015 3:24 pm

tpetzke wrote:For testing your time control seems very slow. For regression tests I use a 7 sec + 0.03 sec inc. And for 16.000 games it still takes a while and 16k is usually not enough to bring ELO difference beyond the error bar.

Sooner or later you have to abandon the idea of testing at the TC levels the games for the rating lists later are played at.

If I have the choice between a few slow games and a lot of fast games I take the fast games, this seems more reliable.

I could give it a try and compare the results. IIRC I tried it a while ago and the results were rather different. Maybe better now that I removed some critical bugs.

I'll wait first until the results make sense (being the "+" and "-" values much smaller):

Code: Select all

Rank Name                           Elo    +    - games score oppo. draws 
   1 XboardEngine                   394   49   43   397   93%   -35    7% 
   2 Emblatrunk                       2   27   27   397   50%    22   31% 
   3 Emblatpt_move                   -5   31   31   299   49%    19   33% 
   4 Emblatptmove-pv-qtt            -24   27   27   397   46%    26   34% 
   5 Emblatrunk-pv                  -40   27   27   399   43%    28   36% 
   6 Emblatrunk-pv-tk               -41   27   27   398   43%    23   32% 
   7 Emblar1331_q_prevmove-pv       -49   61   62    78   42%    26   26% 
   8 Emblar1331_q_prevmove          -50   61   62    75   43%     9   31% 
   9 Emblar1331_q_prevmove_fix-pv   -58   30   30   322   40%    28   34% 
  10 Emblatpt_move-pv               -60   27   27   395   39%    32   34% 
  11 Embla_tptmove                  -68   55   56    99   40%    19   25%

Daniel Anulliero · Post by **Daniel Anulliero** » Sat Oct 31, 2015 12:31 am

Personnaly , I run two diférents tests :
- tests versions vs versions at 5'+1" increment , 30 Classical openings played with reversed colors = 60 games
-versions against five engines with a CCRL elo (list 40/4) :
Tscp 1.81 (elo 1702)
Jars 1.75a (elo 1813) one of my engines

Fairy-Max (elo 1950)
Jabba 1.0 (elo 2035)
Clarabit 1.00 (elo 2100)
Again 30 opennings with reversed colors = 60 games each = 300

Sven · Post by **Sven** » Sat Oct 31, 2015 1:26 pm

flok wrote:Hi,

While working on Embla, I have multiple branches (apart from trunk) in my svn repo. One that adds this feature, one that adds that feature and so on. I would like to compare how they play.
I installed cutechess-cli, bayeselo and an opponent which mostly wins of my program (just not all the time).
Now I can indeed run, say, 9 versions for a while and check what the results are. But with 9 versions this is around 3000 games so with 40 moves an 300 seconds it takes quite a while to finish.
So my question now is: can I halfway add another program to the list and continue the match and still get a good elo calculation? Or should I restart every time?

"Engine A branch B version X.Y" is different from "Engine A branch B version X.Z" so both should get their own ELO ratings. I would therefore maintain a growing pool of played games of all your different branches and versions, and recalculate the rating list from the whole pool (i.e. a set of PGN files) each time you have played a bunch of new games, or whenever you feel the need to do so. Replaying games of previous versions is not necessary, you only play games with your new versions (against a defined set of opponents, which of course may involve previous versions if you like) and add these games to the pool. This incremental approach drastically reduces the number of games you need to run. It requires, though, to keep your testing conditions (TC, hardware etc.) unchanged. Each time you change the conditions this means you open a new game pool, simply a new directory to store the PGN.

Sometimes it can make sense to remove games played by "obsolete" versions from the pool.

The rating list you have shown does not contain version numbers. To solve the versioning issue your engine should report its own version number to the UI (under WB using "feature name", under UCI using "id name") so that it appears in the PGN automatically.

testing multiple versions & elo calculation

testing multiple versions & elo calculation

Re: testing multiple versions & elo calculation

Re: testing multiple versions & elo calculation

Re: testing multiple versions & elo calculation

Re: testing multiple versions & elo calculation

Re: testing multiple versions & elo calculation

Re: testing multiple versions & elo calculation

Re: testing multiple versions & elo calculation

Re: testing multiple versions & elo calculation

Re: testing multiple versions & elo calculation