Hello all,
I am currently doing some testing with my engine, playing against various opponents in the 3000 to 3100 elo range.
It's not the first time I've noticed strange elo gaps in the ccrl list. So I'm curious if this can be explained in a reasonable way.
Here is an example from the Blitz list:
demolito_20181029-cygwin-64-popcnt.exe
CCRL (rating 3017) 1300 games: 376 wins, 438 losses, 486 draws (37.4%), score: 47.6% (+-16).
Xiphos 0.2 SSE by Milos Tatarevic
CCRL (rating 3063): 1967 games: 629 wins, 496 losses, 842 draws (42.8%), score: 53.4% (+-13)
Demolito did much better against my engine, significantly better. I thought it might be because Demolito is very good with my engine. To exclude this I started a small match between Xiphos and Demolito and stopped after a little more than 2000 games.
Setup: 10s+10ms Hash 16MB (cutechess-cli) (1xThread both)
Score of Demolito vs Xiphos2: 945 - 506 - 681 [0.603] 2132
ELO difference: 72.58 +/- 12.29
Gap: 3063-3017 + 72.58 = 118.58
Although the list says that Xiphos is almost 50 Elo stronger, Xiphos is rather 70 Elo weaker.
The difference of this magnitude cannot be explained by the time mode or by the number of games.
At first glance, this looks like a systematic error.
So, can someone give me a plausible explanation ?
Thanks a lot.
CCRL Testing (@Testers)
Moderators: hgm, Rebel, chrisw
-
- Posts: 4606
- Joined: Wed Oct 01, 2008 6:33 am
- Location: Regensburg, Germany
- Full name: Guenther Simon
Re: CCRL Testing (@Testers)
Did you really test against the same version(and binary) of Demolito you named above or did you compile it yourself?Desperado wrote: ↑Thu Mar 25, 2021 9:39 am Hello all,
I am currently doing some testing with my engine, playing against various opponents in the 3000 to 3100 elo range.
It's not the first time I've noticed strange elo gaps in the ccrl list. So I'm curious if this can be explained in a reasonable way.
Here is an example from the Blitz list:
demolito_20181029-cygwin-64-popcnt.exe
CCRL (rating 3017) 1300 games: 376 wins, 438 losses, 486 draws (37.4%), score: 47.6% (+-16).
Xiphos 0.2 SSE by Milos Tatarevic
CCRL (rating 3063): 1967 games: 629 wins, 496 losses, 842 draws (42.8%), score: 53.4% (+-13)
Demolito did much better against my engine, significantly better. I thought it might be because Demolito is very good with my engine. To exclude this I started a small match between Xiphos and Demolito and stopped after a little more than 2000 games.
Setup: 10s+10ms Hash 16MB (cutechess-cli)
Score of Demolito vs Xiphos2: 945 - 506 - 681 [0.603] 2132
ELO difference: 72.58 +/- 12.29
Gap: 3063-3017 + 72.58 = 118.58
Although the list says that Xiphos is almost 50 Elo stronger, Xiphos is rather 70 Elo weaker.
The difference of this magnitude cannot be explained by the time mode or by the number of games.
At first glance, this looks like a systematic error.
So, can someone give me a plausible explanation ?
Thanks a lot.
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: CCRL Testing (@Testers)
I used downloaded executables. I checked the versions via uci option too.
-
- Posts: 4606
- Joined: Wed Oct 01, 2008 6:33 am
- Location: Regensburg, Germany
- Full name: Guenther Simon
Re: CCRL Testing (@Testers)
I checked my XB/UCI chronology and it says this is the download for that version and Demolito was compiled by TP = Thomas Poppins at that time.
https://www.mediafire.com/file/47jr2tfe ... p.zip/file
I don't see though which of the two binaries included in the download was really used at CCRL for its entry, actually 'popcount' is not mentioned in its name there, so I don't know how you attributed it in your OP?
http://ccrl.chessdom.com/ccrl/404/cgi/e ... -29_64-bit
Only CCRL can answer, which of both versions included was used.
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: CCRL Testing (@Testers)
The popcnt does not make a difference of 100 Elo or more. Using different number of Threads, maybe 4 threads might have such an effect.Guenther wrote: ↑Thu Mar 25, 2021 10:12 amI checked my XB/UCI chronology and it says this is the download for that version and Demolito was compiled by TP = Thomas Poppins at that time.
https://www.mediafire.com/file/47jr2tfe ... p.zip/file
I don't see though which of the two binaries included in the download was really used at CCRL for its entry, actually 'popcount' is not mentioned in its name there, so I don't know how you attributed it in your OP?
http://ccrl.chessdom.com/ccrl/404/cgi/e ... -29_64-bit
Only CCRL can answer, which of both versions included was used.
Using different compilers for the same version won't speedup the engine with factor 4 too.
-
- Posts: 4606
- Joined: Wed Oct 01, 2008 6:33 am
- Location: Regensburg, Germany
- Full name: Guenther Simon
Re: CCRL Testing (@Testers)
I agree on this (except that the non-popcount build might have an issue perhaps), could you measure the speed diff for both BTW? (I have no popcount hardware, so I cannot try this)Desperado wrote: ↑Thu Mar 25, 2021 10:23 amThe popcnt does not make a difference of 100 Elo or more. Using different number of Threads, maybe 4 threads might have such an effect.Guenther wrote: ↑Thu Mar 25, 2021 10:12 amI checked my XB/UCI chronology and it says this is the download for that version and Demolito was compiled by TP = Thomas Poppins at that time.
https://www.mediafire.com/file/47jr2tfe ... p.zip/file
I don't see though which of the two binaries included in the download was really used at CCRL for its entry, actually 'popcount' is not mentioned in its name there, so I don't know how you attributed it in your OP?
http://ccrl.chessdom.com/ccrl/404/cgi/e ... -29_64-bit
Only CCRL can answer, which of both versions included was used.
Using different compilers for the same version won't speedup the engine with factor 4 too.
I can see only one red entry in the opponents list, the performance vs. Arasan was quite loo low (-112).
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: CCRL Testing (@Testers)
The speed difference is about 10% (1431 KN/s and 1630 KN/s).So, let's say something betwenn 5-15%.Guenther wrote: ↑Thu Mar 25, 2021 10:33 amI agree on this (except that the non-popcount build might have an issue perhaps), could you measure the speed diff for both BTW? (I have no popcount hardware, so I cannot try this)Desperado wrote: ↑Thu Mar 25, 2021 10:23 amThe popcnt does not make a difference of 100 Elo or more. Using different number of Threads, maybe 4 threads might have such an effect.Guenther wrote: ↑Thu Mar 25, 2021 10:12 amI checked my XB/UCI chronology and it says this is the download for that version and Demolito was compiled by TP = Thomas Poppins at that time.
https://www.mediafire.com/file/47jr2tfe ... p.zip/file
I don't see though which of the two binaries included in the download was really used at CCRL for its entry, actually 'popcount' is not mentioned in its name there, so I don't know how you attributed it in your OP?
http://ccrl.chessdom.com/ccrl/404/cgi/e ... -29_64-bit
Only CCRL can answer, which of both versions included was used.
Using different compilers for the same version won't speedup the engine with factor 4 too.
I can see only one red entry in the opponents list, the performance vs. Arasan was quite loo low (-112).
Just looked into the chessbase gui what was reported for the start position.
Starting from the console (win10 64bit) both engines do exit.
So, there is a speed difference that might be measurable in elo terms at the time control 10s+10ms.
But we agree, not in the range of 100 Elo or more.
Well, the difference between the two Demolito versions can be measured. Maybe I can do a little match later...
At this point we ignore Xiphos so far (just to be mentioned). And finally i do not think there is an issue with the binaries.
-
- Posts: 4606
- Joined: Wed Oct 01, 2008 6:33 am
- Location: Regensburg, Germany
- Full name: Guenther Simon
Re: CCRL Testing (@Testers)
I checked now the 40/15 list and here it is very similar (actually the same diff as in the Blitz list +46 in favour of Xiphos):Desperado wrote: ↑Thu Mar 25, 2021 10:53 amThe speed difference is about 10% (1431 KN/s and 1630 KN/s).So, let's say something betwenn 5-15%.Guenther wrote: ↑Thu Mar 25, 2021 10:33 am
I agree on this (except that the non-popcount build might have an issue perhaps), could you measure the speed diff for both BTW? (I have no popcount hardware, so I cannot try this)
I can see only one red entry in the opponents list, the performance vs. Arasan was quite loo low (-112).
Just looked into the chessbase gui what was reported for the start position.
Starting from the console (win10 64bit) both engines do exit.
So, there is a speed difference that might be measurable in elo terms at the time control 10s+10ms.
But we agree, not in the range of 100 Elo or more.
Code: Select all
Xiphos 0.2 64-bit 3022 +18 -18 52.8% -15.2 46.8% 999
Demolito 2018-10-29 64-bit 2976 +15 -15 52.2% -18.4 39.8% 1418
(I assume you have checked your games for time losses of Xiphos?)
-
- Posts: 41432
- Joined: Sun Feb 26, 2006 10:52 am
- Location: Auckland, NZ
Re: CCRL Testing (@Testers)
We test against many opponents, not just one.Desperado wrote: ↑Thu Mar 25, 2021 9:39 am Hello all,
I am currently doing some testing with my engine, playing against various opponents in the 3000 to 3100 elo range.
It's not the first time I've noticed strange elo gaps in the ccrl list. So I'm curious if this can be explained in a reasonable way.
Here is an example from the Blitz list:
demolito_20181029-cygwin-64-popcnt.exe
CCRL (rating 3017) 1300 games: 376 wins, 438 losses, 486 draws (37.4%), score: 47.6% (+-16).
Xiphos 0.2 SSE by Milos Tatarevic
CCRL (rating 3063): 1967 games: 629 wins, 496 losses, 842 draws (42.8%), score: 53.4% (+-13)
Demolito did much better against my engine, significantly better. I thought it might be because Demolito is very good with my engine. To exclude this I started a small match between Xiphos and Demolito and stopped after a little more than 2000 games.
Setup: 10s+10ms Hash 16MB (cutechess-cli) (1xThread both)
Score of Demolito vs Xiphos2: 945 - 506 - 681 [0.603] 2132
ELO difference: 72.58 +/- 12.29
Gap: 3063-3017 + 72.58 = 118.58
Although the list says that Xiphos is almost 50 Elo stronger, Xiphos is rather 70 Elo weaker.
The difference of this magnitude cannot be explained by the time mode or by the number of games.
At first glance, this looks like a systematic error.
So, can someone give me a plausible explanation ?
Thanks a lot.
I use the TP builds for that version of Demolito.
Games include both popcount and non-popcount.
Demolito is an open-source UCI engine by Lucas Braesch
source:
https://github.com/lucasart/
Cygwin GCC 7.3.0 64-bit static builds by T. Poppins
64-old needs a Pentium 4 or later
The required cygwin1.dll is included.
gbanksnz at gmail.com
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: CCRL Testing (@Testers)
Hello Graham.Graham Banks wrote: ↑Thu Mar 25, 2021 11:31 amWe test against many opponents, not just one.Desperado wrote: ↑Thu Mar 25, 2021 9:39 am Hello all,
I am currently doing some testing with my engine, playing against various opponents in the 3000 to 3100 elo range.
It's not the first time I've noticed strange elo gaps in the ccrl list. So I'm curious if this can be explained in a reasonable way.
Here is an example from the Blitz list:
demolito_20181029-cygwin-64-popcnt.exe
CCRL (rating 3017) 1300 games: 376 wins, 438 losses, 486 draws (37.4%), score: 47.6% (+-16).
Xiphos 0.2 SSE by Milos Tatarevic
CCRL (rating 3063): 1967 games: 629 wins, 496 losses, 842 draws (42.8%), score: 53.4% (+-13)
Demolito did much better against my engine, significantly better. I thought it might be because Demolito is very good with my engine. To exclude this I started a small match between Xiphos and Demolito and stopped after a little more than 2000 games.
Setup: 10s+10ms Hash 16MB (cutechess-cli) (1xThread both)
Score of Demolito vs Xiphos2: 945 - 506 - 681 [0.603] 2132
ELO difference: 72.58 +/- 12.29
Gap: 3063-3017 + 72.58 = 118.58
Although the list says that Xiphos is almost 50 Elo stronger, Xiphos is rather 70 Elo weaker.
The difference of this magnitude cannot be explained by the time mode or by the number of games.
At first glance, this looks like a systematic error.
So, can someone give me a plausible explanation ?
Thanks a lot.
I use the TP builds for that version of Demolito.
Games include both popcount and non-popcount.
Demolito is an open-source UCI engine by Lucas Braesch
source:
https://github.com/lucasart/
Cygwin GCC 7.3.0 64-bit static builds by T. Poppins
64-old needs a Pentium 4 or later
The required cygwin1.dll is included.
That' fine, i mean testing against many opponents.
But i am not talking of a 5,10 or 20 Elo gap with some testing inconsistencies, it is more than 100 Elo!
Please get me right, i don't want to offend someone or tell someone does not know what he is doing as tester.
I simply interested how that can be!, because i also know what i am doing.
So, let's skip the basics and let's try to explain how something like this can be. I think everybody is intersted in
having useful information in the rating lists. This kind of information (if my observation is correct) would be simply useless.
Well let's assume the binaries are ok, my first thoughts were like, for example (non-technical):
* i was using different threadcounts
* time controls
* ccrl scaled the lists somehow
* checking my match-setups (e.g. same time for both engines...)
* other obvious ideas ...
On the other hand, there might be systematic problem like (as an idea).
* the opponents of both engines do not overlap enough and elo is relative to the engine pool,
so the numbers cannot be compared directly.
What i want to say is, when we can exclude simple mistakes we can look for the real reason. There is one.
If we find the reason, we can improve the quality of usefull information in the list.
An elo difference of about 100 Elo (with wrong sign) between two engines should be seen using many engines too,
or the other way around, if the stronger engine is 50 Elo stronger against a pool of different engines, it is very unlikely,
that it will be weaker 70 Elo in a head to head matchm, nearly impossible!