CCRL Testing (@Testers)

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: CCRL Testing (@Testers)

Post by Desperado »

Guenther wrote: Fri Mar 26, 2021 7:36 pm Did you really never test this 'standard' tc?
No :-).
... but who needs this ...
otb, it is standard for humans: 40 moves 2 hours the second time control is 1 hour for the rest.

Ok, so in cutechess-cli i do not need to configure a second, third time control seperately.
The 40/x will just be repeated.

I am not sure when i can do the match. Maybe i will the next days at the earliest.

Regards.
User avatar
Guenther
Posts: 4610
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

Re: CCRL Testing (@Testers)

Post by Guenther »

Desperado wrote: Fri Mar 26, 2021 8:22 pm
otb, it is standard for humans: 40 moves 2 hours the second time control is 1 hour for the rest.

...
Not anymore, we play now with inc too ;-) (if we would play at all, but all is on hold now)
90min/40moves+30s and 30min+30s for the rest now

I still remember when we played 150/50 then 1 hour break and then 60/20, after this 'hängepartie' that was long ago of course.

BTW does this mean you are working on a new Nemo release? :)
https://rwbc-chess.de

trollwatch:
Talkchess nowadays is a joke - it is full of trolls/idiots/people stuck in the pleistocene > 80% of the posts fall into this category...
abulmo2
Posts: 433
Joined: Fri Dec 16, 2016 11:04 am
Location: France
Full name: Richard Delorme

Re: CCRL Testing (@Testers)

Post by abulmo2 »

Desperado wrote: Thu Mar 25, 2021 12:03 pm
Hello Graham.

That' fine, i mean testing against many opponents.

But i am not talking of a 5,10 or 20 Elo gap with some testing inconsistencies, it is more than 100 Elo!

Please get me right, i don't want to offend someone or tell someone does not know what he is doing as tester.
I simply interested how that can be!, because i also know what i am doing.
You are too confident on the Elo model. It supposes that, in head to head matches, if player A > B and player B > C then A > C, but in practice it is quite possible than C > A. So you cannot compare your results to the one done by CCRL or CGET where a gauntlet against many different opponents is played.
Richard Delorme
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: CCRL Testing (@Testers)

Post by Desperado »

abulmo2 wrote: Fri Mar 26, 2021 10:15 pm
Desperado wrote: Thu Mar 25, 2021 12:03 pm
Hello Graham.

That' fine, i mean testing against many opponents.

But i am not talking of a 5,10 or 20 Elo gap with some testing inconsistencies, it is more than 100 Elo!

Please get me right, i don't want to offend someone or tell someone does not know what he is doing as tester.
I simply interested how that can be!, because i also know what i am doing.
You are too confident on the Elo model. It supposes that, in head to head matches, if player A > B and player B > C then A > C, but in practice it is quite possible than C > A. So you cannot compare your results to the one done by CCRL or CGET where a gauntlet against many different opponents is played.
Hi, i am aware that Elo is not transitive (A,B,C example). It is not about that Demolito might be stronger in the direct match and general weaker compared to a pool of different engines. I know that.

It is about the gap +50 vs -70 which is more than 100 Elo! That is an anomly! (while the error bars in the lists are about +-15 Elo)

I, for myself, accept the results by CCRL and CEGT and of course but i also know that my test match was done like i did the last decade!
And i also trust my testframe to produce reliable results.

So, there is a conflict seeing the results, i need/want to figure out the reason. My experience tells me, that i must be open minded
to everything like, i do something wrong, maybe i use outdated binaries,maybe my book has a big influence, maybe over many years there is a systematic procedure that leads to somehow strange results in the lists (Ficticious Example: Engine A plays against B,C,D and Engine E plays F,G,H. Now there would not be any relation between the two Elo numbers although they play against many opponents!)

There are soooo many possibilities, it is not about there are 2 lists giving a similar information, or knowing the transitivity topic.
Finally it will be the sum of 3,4 or 5 major subjects.

My guess at the moment is, that there a several things that sum up. The time model has more impact that i did expect. I already can confirm the idea of Guenther (let's say that can make a dif of 30,40 Elo as guess). For now we all used different books and combined the binaries in different way. It goes on, my machine is TR 2950x, so if an engine is compiled with pext the speed and strenght would drop drastically, which would be another measureable effect...

Many things to look at...

And it is worth to examine this subject! I already updated my engine time management where i only used a fixed MTG constant. Playing MVS/TME will now already be 35 Elo stronger than before! (This only one positive side effect beside learning and collecting experience). For know i did not want to talk about technical details like pext and amd because things like that might be resolved using the correct binaries for the test.

So, thx for the hint, but there are real issues involved and nobody is doing sth. wrong. It is just a puzzle to be solved.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: CCRL Testing (@Testers)

Post by Desperado »

Interesting result.

I switched to 40/10 (40 moves 10 seconds, repeating) and used Guenther's book ("3M_lines.txt").
It was 10+0.1 before and an own book which included 8moves_GM_LB or 8moves_v3.pgn.

I manually stopped the match after 1001 games.

Finished game 1010 (Demolito vs Xiphos2): 0-1 {Black mates}
Score of Xiphos2 vs Demolito: 432 - 234 - 335 [0.599] 1001
ELO difference: 69.64 +/- 17.72

Honestly, I would never have expected that.

I repeat the test with 8moves_v3.pgn, just to get an idea how strong the influence of book and time mode is.
User avatar
Guenther
Posts: 4610
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

Re: CCRL Testing (@Testers)

Post by Guenther »

Desperado wrote: Sat Mar 27, 2021 8:52 am Interesting result.

I switched to 40/10 (40 moves 10 seconds, repeating) and used Guenther's book ("3M_lines.txt").
It was 10+0.1 before and an own book which included 8moves_GM_LB or 8moves_v3.pgn.

I manually stopped the match after 1001 games.

Finished game 1010 (Demolito vs Xiphos2): 0-1 {Black mates}
Score of Xiphos2 vs Demolito: 432 - 234 - 335 [0.599] 1001
ELO difference: 69.64 +/- 17.72

Honestly, I would never have expected that.

I repeat the test with 8moves_v3.pgn, just to get an idea how strong the influence of book and time mode is.
Hi Michael, did you already have time to check for time losses of Demolito, just being curious?

BTW I think a big part of this must be due to a weakness in Demolitos time management (at least for this version in 2018-10-29).
I will check now for TM changes in the repo for mps. I am also curious, if the discrepancy is still noticable until today.
https://rwbc-chess.de

trollwatch:
Talkchess nowadays is a joke - it is full of trolls/idiots/people stuck in the pleistocene > 80% of the posts fall into this category...
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: CCRL Testing (@Testers)

Post by Desperado »

Now a summary:

Blitz Rating Xiphos 0.2: 3063
Blitz Rating Demolito_20181029: 3017

Setup - TC 10+0.1 Hash/16MB Thread/1x Ponder/off book/own-book (collection of public books like 8moves_v3)
Score of Demolito vs Xiphos2: 945 - 506 - 681 [0.603] 2132
ELO difference: 72.58 +/- 12.29 (attention: the elo advantage is reversed here)

Setup - TC 40/10 Hash/16MB Thread/1x Ponder/off book/3M_lines.txt
Score of Xiphos2 vs Demolito: 432 - 234 - 335 [0.599] 1001
ELO difference: 69.64 +/- 17.72

Setup - TC 40/10 Hash/16MB Thread/1x Ponder/off book/8moves_v30
Finished game 1000 (Demolito vs Xiphos2): 1/2-1/2 {Draw by fifty moves rule}
Score of Xiphos2 vs Demolito: 378 - 279 - 343 [0.549] 1000
ELO difference: 34.51 +/- 17.49

Ok, using a different time control (modus,discipline) is the main reason. Switching to MVS/TME and the results correspond very well with the rating lists.
If Xiphos works not well with 10+0.1 or Demolito with 40/10 is still not clear at the point. Maybe there is no bug at all, but the two disciplines
are simply very different!? The easiest thing would be to check the sources and do more tests.

At least i found a plausible answer. What to make out of it is not clear at all. Just to give an idea what i mean.
As programmer who is interested to reach a good ranking, someone should use the same time control (modus,discipline). For my engine,
this was a simple change in the time management which resulted in +30 Elo (maybe more) playing in the MVS/TME modus.
Especially if there is no bug involved in the binaries the result would show that the ratings can depend strongly on the choosen time control (modus,discipline).

ok, enough transparency for me. Thx everybody for your support, especially thx to Guenther.

Regards

P.S.: i was not able to detect any time losses. I scanned the pgn.
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: CCRL Testing (@Testers)

Post by lkaufman »

Desperado wrote: Sat Mar 27, 2021 10:05 am Now a summary:

Blitz Rating Xiphos 0.2: 3063
Blitz Rating Demolito_20181029: 3017

Setup - TC 10+0.1 Hash/16MB Thread/1x Ponder/off book/own-book (collection of public books like 8moves_v3)
Score of Demolito vs Xiphos2: 945 - 506 - 681 [0.603] 2132
ELO difference: 72.58 +/- 12.29 (attention: the elo advantage is reversed here)

Setup - TC 40/10 Hash/16MB Thread/1x Ponder/off book/3M_lines.txt
Score of Xiphos2 vs Demolito: 432 - 234 - 335 [0.599] 1001
ELO difference: 69.64 +/- 17.72

Setup - TC 40/10 Hash/16MB Thread/1x Ponder/off book/8moves_v30
Finished game 1000 (Demolito vs Xiphos2): 1/2-1/2 {Draw by fifty moves rule}
Score of Xiphos2 vs Demolito: 378 - 279 - 343 [0.549] 1000
ELO difference: 34.51 +/- 17.49

Ok, using a different time control (modus,discipline) is the main reason. Switching to MVS/TME and the results correspond very well with the rating lists.
If Xiphos works not well with 10+0.1 or Demolito with 40/10 is still not clear at the point. Maybe there is no bug at all, but the two disciplines
are simply very different!? The easiest thing would be to check the sources and do more tests.

At least i found a plausible answer. What to make out of it is not clear at all. Just to give an idea what i mean.
As programmer who is interested to reach a good ranking, someone should use the same time control (modus,discipline). For my engine,
this was a simple change in the time management which resulted in +30 Elo (maybe more) playing in the MVS/TME modus.
Especially if there is no bug involved in the binaries the result would show that the ratings can depend strongly on the choosen time control (modus,discipline).

ok, enough transparency for me. Thx everybody for your support, especially thx to Guenther.

Regards

P.S.: i was not able to detect any time losses. I scanned the pgn.
CCRL blitz switched to increment play (2' + 1" adapted to hardware) about a year ago, so 40/x play should be irrelevant to that list unless perhaps the games were played before the switch.
Komodo rules!
User avatar
Guenther
Posts: 4610
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

Re: CCRL Testing (@Testers)

Post by Guenther »

lkaufman wrote: Sat Mar 27, 2021 7:46 pm ...

CCRL blitz switched to increment play (2' + 1" adapted to hardware) about a year ago, so 40/x play should be irrelevant to that list unless perhaps the games were played before the switch.
The thread was concerning Demolito_20181029 :)

BTW are you sure they switched to inc, I thought they switched from 40/4 to 40/2?

Edit:
Oh really, never looked close enough at the header since then. I wonder how this works out
especially with all older XB programs in the lower half at least, they often had problems
with inc, or couldn't play that tc at all. Aren't they not tested anymore?
https://rwbc-chess.de

trollwatch:
Talkchess nowadays is a joke - it is full of trolls/idiots/people stuck in the pleistocene > 80% of the posts fall into this category...