abulmo2 wrote: ↑Fri Mar 26, 2021 10:15 pm
Desperado wrote: ↑Thu Mar 25, 2021 12:03 pm
Hello Graham.
That' fine, i mean testing against many opponents.
But i am not talking of a 5,10 or 20 Elo gap with some testing inconsistencies, it is more than 100 Elo!
Please get me right, i don't want to offend someone or tell someone does not know what he is doing as tester.
I simply interested how that can be!, because i also know what i am doing.
You are too confident on the Elo model. It supposes that, in head to head matches, if player A > B and player B > C then A > C, but in practice it is quite possible than C > A. So you cannot compare your results to the one done by CCRL or CGET where a gauntlet against many different opponents is played.
Hi, i am aware that Elo is not transitive (A,B,C example). It is not about that Demolito might be stronger in the direct match and general weaker compared to a pool of different engines. I know that.
It is about the gap +50 vs -70 which is more than 100 Elo! That is an anomly! (while the error bars in the lists are about +-15 Elo)
I, for myself, accept the results by CCRL and CEGT and of course but i also know that my test match was done like i did the last decade!
And i also trust my testframe to produce reliable results.
So, there is a conflict seeing the results, i need/want to figure out the reason. My experience tells me, that i must be open minded
to everything like, i do something wrong, maybe i use outdated binaries,maybe my book has a big influence, maybe over many years there is a systematic procedure that leads to somehow strange results in the lists (Ficticious Example: Engine A plays against B,C,D and Engine E plays F,G,H. Now there would not be any relation between the two Elo numbers although they play against many opponents!)
There are soooo many possibilities, it is not about there are 2 lists giving a similar information, or knowing the transitivity topic.
Finally it will be the sum of 3,4 or 5 major subjects.
My guess at the moment is, that there a several things that sum up. The time model has more impact that i did expect. I already can confirm the idea of Guenther (let's say that can make a dif of 30,40 Elo as guess). For now we all used different books and combined the binaries in different way. It goes on, my machine is TR 2950x, so if an engine is compiled with pext the speed and strenght would drop drastically, which would be another measureable effect...
Many things to look at...
And it is worth to examine this subject! I already updated my engine time management where i only used a fixed MTG constant. Playing MVS/TME will now already be 35 Elo stronger than before! (This only one positive side effect beside learning and collecting experience). For know i did not want to talk about technical details like pext and amd because things like that might be resolved using the correct binaries for the test.
So, thx for the hint, but there are real issues involved and nobody is doing sth. wrong. It is just a puzzle to be solved.