CCRL Testing (@Testers)

Desperado · Post by **Desperado** » Thu Mar 25, 2021 9:39 am

Hello all,

I am currently doing some testing with my engine, playing against various opponents in the 3000 to 3100 elo range.
It's not the first time I've noticed strange elo gaps in the ccrl list. So I'm curious if this can be explained in a reasonable way.

Here is an example from the Blitz list:

demolito_20181029-cygwin-64-popcnt.exe
CCRL (rating 3017) 1300 games: 376 wins, 438 losses, 486 draws (37.4%), score: 47.6% (+-16).

Xiphos 0.2 SSE by Milos Tatarevic
CCRL (rating 3063): 1967 games: 629 wins, 496 losses, 842 draws (42.8%), score: 53.4% (+-13)

Demolito did much better against my engine, significantly better. I thought it might be because Demolito is very good with my engine. To exclude this I started a small match between Xiphos and Demolito and stopped after a little more than 2000 games.

Setup: 10s+10ms Hash 16MB (cutechess-cli) (1xThread both)
Score of Demolito vs Xiphos2: 945 - 506 - 681 [0.603] 2132
ELO difference: 72.58 +/- 12.29

Gap: 3063-3017 + 72.58 = 118.58

Although the list says that Xiphos is almost 50 Elo stronger, Xiphos is rather 70 Elo weaker.
The difference of this magnitude cannot be explained by the time mode or by the number of games.

At first glance, this looks like a systematic error.
So, can someone give me a plausible explanation ?

Thanks a lot.

Guenther · Post by **Guenther** » Thu Mar 25, 2021 9:54 am

Desperado wrote: ↑Thu Mar 25, 2021 9:39 am Hello all,

I am currently doing some testing with my engine, playing against various opponents in the 3000 to 3100 elo range.
It's not the first time I've noticed strange elo gaps in the ccrl list. So I'm curious if this can be explained in a reasonable way.

Here is an example from the Blitz list:

demolito_20181029-cygwin-64-popcnt.exe
CCRL (rating 3017) 1300 games: 376 wins, 438 losses, 486 draws (37.4%), score: 47.6% (+-16).

Xiphos 0.2 SSE by Milos Tatarevic
CCRL (rating 3063): 1967 games: 629 wins, 496 losses, 842 draws (42.8%), score: 53.4% (+-13)

Demolito did much better against my engine, significantly better. I thought it might be because Demolito is very good with my engine. To exclude this I started a small match between Xiphos and Demolito and stopped after a little more than 2000 games.

Setup: 10s+10ms Hash 16MB (cutechess-cli)
Score of Demolito vs Xiphos2: 945 - 506 - 681 [0.603] 2132
ELO difference: 72.58 +/- 12.29

Gap: 3063-3017 + 72.58 = 118.58

Although the list says that Xiphos is almost 50 Elo stronger, Xiphos is rather 70 Elo weaker.
The difference of this magnitude cannot be explained by the time mode or by the number of games.

At first glance, this looks like a systematic error.
So, can someone give me a plausible explanation ?

Thanks a lot.

Did you really test against the same version(and binary) of Demolito you named above or did you compile it yourself?

Desperado · Post by **Desperado** » Thu Mar 25, 2021 9:58 am

I used downloaded executables. I checked the versions via uci option too.

Guenther · Post by **Guenther** » Thu Mar 25, 2021 10:12 am

Desperado wrote: ↑Thu Mar 25, 2021 9:58 am I used downloaded executables. I checked the versions via uci option too.

I checked my XB/UCI chronology and it says this is the download for that version and Demolito was compiled by TP = Thomas Poppins at that time.
https://www.mediafire.com/file/47jr2tfe ... p.zip/file

I don't see though which of the two binaries included in the download was really used at CCRL for its entry, actually 'popcount' is not mentioned in its name there, so I don't know how you attributed it in your OP?
http://ccrl.chessdom.com/ccrl/404/cgi/e ... -29_64-bit
Only CCRL can answer, which of both versions included was used.

Desperado · Post by **Desperado** » Thu Mar 25, 2021 10:23 am

Guenther wrote: ↑Thu Mar 25, 2021 10:12 am
Desperado wrote: ↑Thu Mar 25, 2021 9:58 am I used downloaded executables. I checked the versions via uci option too.
I checked my XB/UCI chronology and it says this is the download for that version and Demolito was compiled by TP = Thomas Poppins at that time.
https://www.mediafire.com/file/47jr2tfe ... p.zip/file

I don't see though which of the two binaries included in the download was really used at CCRL for its entry, actually 'popcount' is not mentioned in its name there, so I don't know how you attributed it in your OP?
http://ccrl.chessdom.com/ccrl/404/cgi/e ... -29_64-bit
Only CCRL can answer, which of both versions included was used.

The popcnt does not make a difference of 100 Elo or more. Using different number of Threads, maybe 4 threads might have such an effect.
Using different compilers for the same version won't speedup the engine with factor 4 too.

Guenther · Post by **Guenther** » Thu Mar 25, 2021 10:33 am

Desperado wrote: ↑Thu Mar 25, 2021 10:23 am
Guenther wrote: ↑Thu Mar 25, 2021 10:12 am
Desperado wrote: ↑Thu Mar 25, 2021 9:58 am I used downloaded executables. I checked the versions via uci option too.
I checked my XB/UCI chronology and it says this is the download for that version and Demolito was compiled by TP = Thomas Poppins at that time.
https://www.mediafire.com/file/47jr2tfe ... p.zip/file

I don't see though which of the two binaries included in the download was really used at CCRL for its entry, actually 'popcount' is not mentioned in its name there, so I don't know how you attributed it in your OP?
http://ccrl.chessdom.com/ccrl/404/cgi/e ... -29_64-bit
Only CCRL can answer, which of both versions included was used.
The popcnt does not make a difference of 100 Elo or more. Using different number of Threads, maybe 4 threads might have such an effect.
Using different compilers for the same version won't speedup the engine with factor 4 too.

I agree on this (except that the non-popcount build might have an issue perhaps), could you measure the speed diff for both BTW? (I have no popcount hardware, so I cannot try this)
I can see only one red entry in the opponents list, the performance vs. Arasan was quite loo low (-112).

Desperado · Post by **Desperado** » Thu Mar 25, 2021 10:53 am

Guenther wrote: ↑Thu Mar 25, 2021 10:33 am
Desperado wrote: ↑Thu Mar 25, 2021 10:23 am
Guenther wrote: ↑Thu Mar 25, 2021 10:12 am
Desperado wrote: ↑Thu Mar 25, 2021 9:58 am I used downloaded executables. I checked the versions via uci option too.
I checked my XB/UCI chronology and it says this is the download for that version and Demolito was compiled by TP = Thomas Poppins at that time.
https://www.mediafire.com/file/47jr2tfe ... p.zip/file

I don't see though which of the two binaries included in the download was really used at CCRL for its entry, actually 'popcount' is not mentioned in its name there, so I don't know how you attributed it in your OP?
http://ccrl.chessdom.com/ccrl/404/cgi/e ... -29_64-bit
Only CCRL can answer, which of both versions included was used.
The popcnt does not make a difference of 100 Elo or more. Using different number of Threads, maybe 4 threads might have such an effect.
Using different compilers for the same version won't speedup the engine with factor 4 too.
I agree on this (except that the non-popcount build might have an issue perhaps), could you measure the speed diff for both BTW? (I have no popcount hardware, so I cannot try this)
I can see only one red entry in the opponents list, the performance vs. Arasan was quite loo low (-112).

The speed difference is about 10% (1431 KN/s and 1630 KN/s).So, let's say something betwenn 5-15%.
Just looked into the chessbase gui what was reported for the start position.
Starting from the console (win10 64bit) both engines do exit.

So, there is a speed difference that might be measurable in elo terms at the time control 10s+10ms.
But we agree, not in the range of 100 Elo or more.

Well, the difference between the two Demolito versions can be measured. Maybe I can do a little match later...
At this point we ignore Xiphos so far (just to be mentioned). And finally i do not think there is an issue with the binaries.

Guenther · Post by **Guenther** » Thu Mar 25, 2021 11:10 am

Desperado wrote: ↑Thu Mar 25, 2021 10:53 am
Guenther wrote: ↑Thu Mar 25, 2021 10:33 am
I agree on this (except that the non-popcount build might have an issue perhaps), could you measure the speed diff for both BTW? (I have no popcount hardware, so I cannot try this)
I can see only one red entry in the opponents list, the performance vs. Arasan was quite loo low (-112).
The speed difference is about 10% (1431 KN/s and 1630 KN/s).So, let's say something betwenn 5-15%.
Just looked into the chessbase gui what was reported for the start position.
Starting from the console (win10 64bit) both engines do exit.

So, there is a speed difference that might be measurable in elo terms at the time control 10s+10ms.
But we agree, not in the range of 100 Elo or more.

I checked now the 40/15 list and here it is very similar (actually the same diff as in the Blitz list +46 in favour of Xiphos):

Code: Select all

Xiphos 0.2 64-bit		3022	+18	-18	52.8%	-15.2	46.8%	999
Demolito 2018-10-29 64-bit	2976	+15	-15	52.2%	-18.4	39.8%	1418

One reason could be e.g. the very different tc (mps blitz vs. inc ultra)?
(I assume you have checked your games for time losses of Xiphos?)

Graham Banks · Post by **Graham Banks** » Thu Mar 25, 2021 11:31 am

Desperado wrote: ↑Thu Mar 25, 2021 9:39 am Hello all,

I am currently doing some testing with my engine, playing against various opponents in the 3000 to 3100 elo range.
It's not the first time I've noticed strange elo gaps in the ccrl list. So I'm curious if this can be explained in a reasonable way.

Here is an example from the Blitz list:

demolito_20181029-cygwin-64-popcnt.exe
CCRL (rating 3017) 1300 games: 376 wins, 438 losses, 486 draws (37.4%), score: 47.6% (+-16).

Xiphos 0.2 SSE by Milos Tatarevic
CCRL (rating 3063): 1967 games: 629 wins, 496 losses, 842 draws (42.8%), score: 53.4% (+-13)

Demolito did much better against my engine, significantly better. I thought it might be because Demolito is very good with my engine. To exclude this I started a small match between Xiphos and Demolito and stopped after a little more than 2000 games.

Setup: 10s+10ms Hash 16MB (cutechess-cli) (1xThread both)
Score of Demolito vs Xiphos2: 945 - 506 - 681 [0.603] 2132
ELO difference: 72.58 +/- 12.29

Gap: 3063-3017 + 72.58 = 118.58

Although the list says that Xiphos is almost 50 Elo stronger, Xiphos is rather 70 Elo weaker.
The difference of this magnitude cannot be explained by the time mode or by the number of games.

At first glance, this looks like a systematic error.
So, can someone give me a plausible explanation ?

Thanks a lot.

We test against many opponents, not just one.
I use the TP builds for that version of Demolito.
Games include both popcount and non-popcount.

Demolito is an open-source UCI engine by Lucas Braesch

source:
https://github.com/lucasart/

Cygwin GCC 7.3.0 64-bit static builds by T. Poppins
64-old needs a Pentium 4 or later
The required cygwin1.dll is included.

Desperado · Post by **Desperado** » Thu Mar 25, 2021 12:03 pm

Graham Banks wrote: ↑Thu Mar 25, 2021 11:31 am
Desperado wrote: ↑Thu Mar 25, 2021 9:39 am Hello all,

I am currently doing some testing with my engine, playing against various opponents in the 3000 to 3100 elo range.
It's not the first time I've noticed strange elo gaps in the ccrl list. So I'm curious if this can be explained in a reasonable way.

Here is an example from the Blitz list:

demolito_20181029-cygwin-64-popcnt.exe
CCRL (rating 3017) 1300 games: 376 wins, 438 losses, 486 draws (37.4%), score: 47.6% (+-16).

Xiphos 0.2 SSE by Milos Tatarevic
CCRL (rating 3063): 1967 games: 629 wins, 496 losses, 842 draws (42.8%), score: 53.4% (+-13)

Demolito did much better against my engine, significantly better. I thought it might be because Demolito is very good with my engine. To exclude this I started a small match between Xiphos and Demolito and stopped after a little more than 2000 games.

Setup: 10s+10ms Hash 16MB (cutechess-cli) (1xThread both)
Score of Demolito vs Xiphos2: 945 - 506 - 681 [0.603] 2132
ELO difference: 72.58 +/- 12.29

Gap: 3063-3017 + 72.58 = 118.58

Although the list says that Xiphos is almost 50 Elo stronger, Xiphos is rather 70 Elo weaker.
The difference of this magnitude cannot be explained by the time mode or by the number of games.

At first glance, this looks like a systematic error.
So, can someone give me a plausible explanation ?

Thanks a lot.
We test against many opponents, not just one.
I use the TP builds for that version of Demolito.
Games include both popcount and non-popcount.

Demolito is an open-source UCI engine by Lucas Braesch

source:
https://github.com/lucasart/

Cygwin GCC 7.3.0 64-bit static builds by T. Poppins
64-old needs a Pentium 4 or later
The required cygwin1.dll is included.

Hello Graham.

That' fine, i mean testing against many opponents.

But i am not talking of a 5,10 or 20 Elo gap with some testing inconsistencies, it is more than 100 Elo!

Please get me right, i don't want to offend someone or tell someone does not know what he is doing as tester.
I simply interested how that can be!, because i also know what i am doing.

So, let's skip the basics and let's try to explain how something like this can be. I think everybody is intersted in
having useful information in the rating lists. This kind of information (if my observation is correct) would be simply useless.

Well let's assume the binaries are ok, my first thoughts were like, for example (non-technical):

* i was using different threadcounts
* time controls
* ccrl scaled the lists somehow
* checking my match-setups (e.g. same time for both engines...)
* other obvious ideas ...

On the other hand, there might be systematic problem like (as an idea).

* the opponents of both engines do not overlap enough and elo is relative to the engine pool,
so the numbers cannot be compared directly.

What i want to say is, when we can exclude simple mistakes we can look for the real reason. There is one.
If we find the reason, we can improve the quality of usefull information in the list.

An elo difference of about 100 Elo (with wrong sign) between two engines should be seen using many engines too,
or the other way around, if the stronger engine is 50 Elo stronger against a pool of different engines, it is very unlikely,
that it will be weaker 70 Elo in a head to head matchm, nearly impossible!

CCRL Testing (@Testers)

CCRL Testing (@Testers)

Re: CCRL Testing (@Testers)

Re: CCRL Testing (@Testers)

Re: CCRL Testing (@Testers)

Re: CCRL Testing (@Testers)

Re: CCRL Testing (@Testers)

Re: CCRL Testing (@Testers)

Re: CCRL Testing (@Testers)

Re: CCRL Testing (@Testers)

Re: CCRL Testing (@Testers)