Hi,
While working on Embla, I have multiple branches (apart from trunk) in my svn repo. One that adds this feature, one that adds that feature and so on. I would like to compare how they play.
I installed cutechess-cli, bayeselo and an opponent which mostly wins of my program (just not all the time).
Now I can indeed run, say, 9 versions for a while and check what the results are. But with 9 versions this is around 3000 games so with 40 moves an 300 seconds it takes quite a while to finish.
So my question now is: can I halfway add another program to the list and continue the match and still get a good elo calculation? Or should I restart every time?
testing multiple versions & elo calculation
Moderators: hgm, Rebel, chrisw
Re: testing multiple versions & elo calculation
Oh and also: replacing a branch with a newer version of that branch. Will then ratings automatically get adjusted if the newer version is better?
-
- Posts: 175
- Joined: Fri Oct 22, 2010 9:47 pm
- Location: Austria
Re: testing multiple versions & elo calculation
I use a similar method. But indeed, it needs a lot of games.flok wrote:Hi,
While working on Embla, I have multiple branches (apart from trunk) in my svn repo. One that adds this feature, one that adds that feature and so on. I would like to compare how they play.
I installed cutechess-cli, bayeselo and an opponent which mostly wins of my program (just not all the time).
Now I can indeed run, say, 9 versions for a while and check what the results are. But with 9 versions this is around 3000 games so with 40 moves an 300 seconds it takes quite a while to finish.
So my question now is: can I halfway add another program to the list and continue the match and still get a good elo calculation? Or should I restart every time?
You can retire a branch if, after a higher number of games (i.e. when bound < e.g. 10):
elo + bound < elobest - boundbest
A branch, once released, will never be changed - otherwise all games it played before are useless, and you will have to cleanup the games database, which is hassle and loses information. Just make a new one, with the desired correction.
-
- Posts: 7220
- Joined: Mon May 27, 2013 10:31 am
Re: testing multiple versions & elo calculation
What is wrong with 120 seconds ?
Re: testing multiple versions & elo calculation
Ok! Thanks!nionita wrote:I use a similar method. But indeed, it needs a lot of games.
You can retire a branch if, after a higher number of games (i.e. when bound < e.g. 10):
elo + bound < elobest - boundbest
A branch, once released, will never be changed - otherwise all games it played before are useless, and you will have to cleanup the games database, which is hassle and loses information. Just make a new one, with the desired correction.
Yeah or 60 seconds.Henk wrote:What is wrong with 120 seconds ?
I took 300s because that is what we use for quick games.
-
- Posts: 175
- Joined: Fri Oct 22, 2010 9:47 pm
- Location: Austria
Re: testing multiple versions & elo calculation
By the way, I sometimes have a branch with some improvement over another, which is not the best.
Then, if you find out that the improvement has a better score (either for sure, through scores +/- margin, or after enough games still remaining superior, with a margin of 2 or 3 elo), you can "transplant" the improvement to the best branch so far (in git this is a rebase operation, don't know in svn). Of course, with another branch name.
This mostly results in more or less same improvement elo wise, but of the best so far. This must be proved by enough games, though.
Then, if you find out that the improvement has a better score (either for sure, through scores +/- margin, or after enough games still remaining superior, with a margin of 2 or 3 elo), you can "transplant" the improvement to the best branch so far (in git this is a rebase operation, don't know in svn). Of course, with another branch name.
This mostly results in more or less same improvement elo wise, but of the best so far. This must be proved by enough games, though.
-
- Posts: 686
- Joined: Thu Mar 03, 2011 4:57 pm
- Location: Germany
Re: testing multiple versions & elo calculation
For testing your time control seems very slow. For regression tests I use a 7 sec + 0.03 sec inc. And for 16.000 games it still takes a while and 16k is usually not enough to bring ELO difference beyond the error bar.
Sooner or later you have to abandon the idea of testing at the TC levels the games for the rating lists later are played at.
If I have the choice between a few slow games and a lot of fast games I take the fast games, this seems more reliable.
Sooner or later you have to abandon the idea of testing at the TC levels the games for the rating lists later are played at.
If I have the choice between a few slow games and a lot of fast games I take the fast games, this seems more reliable.
Re: testing multiple versions & elo calculation
I could give it a try and compare the results. IIRC I tried it a while ago and the results were rather different. Maybe better now that I removed some critical bugs.tpetzke wrote:For testing your time control seems very slow. For regression tests I use a 7 sec + 0.03 sec inc. And for 16.000 games it still takes a while and 16k is usually not enough to bring ELO difference beyond the error bar.
Sooner or later you have to abandon the idea of testing at the TC levels the games for the rating lists later are played at.
If I have the choice between a few slow games and a lot of fast games I take the fast games, this seems more reliable.
I'll wait first until the results make sense (being the "+" and "-" values much smaller):
Code: Select all
Rank Name Elo + - games score oppo. draws
1 XboardEngine 394 49 43 397 93% -35 7%
2 Emblatrunk 2 27 27 397 50% 22 31%
3 Emblatpt_move -5 31 31 299 49% 19 33%
4 Emblatptmove-pv-qtt -24 27 27 397 46% 26 34%
5 Emblatrunk-pv -40 27 27 399 43% 28 36%
6 Emblatrunk-pv-tk -41 27 27 398 43% 23 32%
7 Emblar1331_q_prevmove-pv -49 61 62 78 42% 26 26%
8 Emblar1331_q_prevmove -50 61 62 75 43% 9 31%
9 Emblar1331_q_prevmove_fix-pv -58 30 30 322 40% 28 34%
10 Emblatpt_move-pv -60 27 27 395 39% 32 34%
11 Embla_tptmove -68 55 56 99 40% 19 25%
-
- Posts: 759
- Joined: Fri Jan 04, 2013 4:55 pm
- Location: Nice
Re: testing multiple versions & elo calculation
Personnaly , I run two diférents tests :
- tests versions vs versions at 5'+1" increment , 30 Classical openings played with reversed colors = 60 games
-versions against five engines with a CCRL elo (list 40/4) :
Tscp 1.81 (elo 1702)
Jars 1.75a (elo 1813) one of my engines
Fairy-Max (elo 1950)
Jabba 1.0 (elo 2035)
Clarabit 1.00 (elo 2100)
Again 30 opennings with reversed colors = 60 games each = 300
- tests versions vs versions at 5'+1" increment , 30 Classical openings played with reversed colors = 60 games
-versions against five engines with a CCRL elo (list 40/4) :
Tscp 1.81 (elo 1702)
Jars 1.75a (elo 1813) one of my engines
Fairy-Max (elo 1950)
Jabba 1.0 (elo 2035)
Clarabit 1.00 (elo 2100)
Again 30 opennings with reversed colors = 60 games each = 300
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: testing multiple versions & elo calculation
"Engine A branch B version X.Y" is different from "Engine A branch B version X.Z" so both should get their own ELO ratings. I would therefore maintain a growing pool of played games of all your different branches and versions, and recalculate the rating list from the whole pool (i.e. a set of PGN files) each time you have played a bunch of new games, or whenever you feel the need to do so. Replaying games of previous versions is not necessary, you only play games with your new versions (against a defined set of opponents, which of course may involve previous versions if you like) and add these games to the pool. This incremental approach drastically reduces the number of games you need to run. It requires, though, to keep your testing conditions (TC, hardware etc.) unchanged. Each time you change the conditions this means you open a new game pool, simply a new directory to store the PGN.flok wrote:Hi,
While working on Embla, I have multiple branches (apart from trunk) in my svn repo. One that adds this feature, one that adds that feature and so on. I would like to compare how they play.
I installed cutechess-cli, bayeselo and an opponent which mostly wins of my program (just not all the time).
Now I can indeed run, say, 9 versions for a while and check what the results are. But with 9 versions this is around 3000 games so with 40 moves an 300 seconds it takes quite a while to finish.
So my question now is: can I halfway add another program to the list and continue the match and still get a good elo calculation? Or should I restart every time?
Sometimes it can make sense to remove games played by "obsolete" versions from the pool.
The rating list you have shown does not contain version numbers. To solve the versioning issue your engine should report its own version number to the UI (under WB using "feature name", under UCI using "id name") so that it appears in the PGN automatically.