future of top engines:how much more elo?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Ovyron
Posts: 4556
Joined: Tue Jul 03, 2007 4:30 am

Re: future of top engines:how much more elo?

Post by Ovyron »

Zenmastur wrote: Wed Jul 31, 2019 11:42 pmLike Kai said, it's not worth their efforts. Just do the math and be content that it's that easy. If you want more precision you'll have to run the tests yourself!
If I was a tester I'd have done such a test long time ago, I can't believe I'm so unique that it's impossible for another tester to think the same and do the actual testing.

At least, this could be more useful than testing another engine even more just to figure out its rating is now +-4 instead of +-5 :roll:
Your beliefs create your reality, so be careful what you wish for.
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: future of top engines:how much more elo?

Post by Dann Corbit »

I don't think that the tests will produce a satisfactory answer even if performed.
What they would answer is "What is the relative strength of SF 9 and {for example} ccrl 40/40 time control and ccrl 40/4 time control?" and so on for the other engines.
But it would only be exactly correct for the given hardware, software, thread count, and time controls specified.

In the real world, we all have different hardware. And probably most of us (when running Stockfish) are running bleeding edge Stockfish or Cfish rather than SF9 or SF10.

The utility of lists like CCRL and CEGT is to see the relative strength of chess engines at several standardized time controls.

If, for instance, the lists had absolutely identical rankings, then we could conclude that all engines scale with the exact same SMP and search losses. Since that won't be the case, we can see (perhaps) which engines tend to scale better with more time and/or more threads.

For pretty much every ranking, the same engines will all be in the top five. So I don't think there are any spooky mysteries to be solved here.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: future of top engines:how much more elo?

Post by jp »

Ovyron wrote: Thu Aug 01, 2019 12:03 am If I was a tester I'd have done such a test long time ago, I can't believe I'm so unique that it's impossible for another tester to think the same and do the actual testing.

At least, this could be more useful than testing another engine even more just to figure out its rating is now +-4 instead of +-5 :roll:
Even worse is that testers appear happy to test 20 different versions and clones of the same engine, which could also be said to be not the best use of testing time.
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: future of top engines:how much more elo?

Post by Dann Corbit »

jp wrote: Thu Aug 01, 2019 2:01 am
Ovyron wrote: Thu Aug 01, 2019 12:03 am If I was a tester I'd have done such a test long time ago, I can't believe I'm so unique that it's impossible for another tester to think the same and do the actual testing.

At least, this could be more useful than testing another engine even more just to figure out its rating is now +-4 instead of +-5 :roll:
Even worse is that testers appear happy to test 20 different versions and clones of the same engine, which could also be said to be not the best use of testing time.
Of course, there was a great hue and cry when the clones were not tested.
So, I guess, you can't please everybody.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
User avatar
Ovyron
Posts: 4556
Joined: Tue Jul 03, 2007 4:30 am

Re: future of top engines:how much more elo?

Post by Ovyron »

Dann Corbit wrote: Thu Aug 01, 2019 2:28 amOf course, there was a great hue and cry when the clones were not tested.
So, I guess, you can't please everybody.
There are other testers, not from the rating lists, that dedicate all their time to testing Stockfish derivatives exclusively, so I guess what volunteers could do is just testing what is already known to be the best one.

(because... the clones are a necessary evil, as Stockfish 10 has gotten old and irrelevant, specially after July release where Stockfish 10 became significantly weaker than Stockfish dev, so you ought to test some stronger clone, and only Stockfish's devs are to blame for this, with their releases. Stockfish 2 was legendary with its releases, I continued using 2.1.1 up until Stockfish 8 release. So where's my Stockfish 10.1.1, Stockfish 10.2.2, Stockfish 10.3.1 JA, Stockfish 10 Granseries 2 Persistent Hash? I'm basically being forced to use a clone...)
Your beliefs create your reality, so be careful what you wish for.
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: future of top engines:how much more elo?

Post by jp »

Dann Corbit wrote: Thu Aug 01, 2019 2:28 am
jp wrote: Thu Aug 01, 2019 2:01 am Even worse is that testers appear happy to test 20 different versions and clones of the same engine, which could also be said to be not the best use of testing time.
Of course, there was a great hue and cry when the clones were not tested.
So, I guess, you can't please everybody.
To be clear, I am not disturbed that they test all the clones if that's what they want to do (though it's strange when CCC etc. choose to play engine tournaments with 6 clones of one engine). It's just that if we start judging what is tested or not, it all becomes subjective.
User avatar
Ozymandias
Posts: 1535
Joined: Sun Oct 25, 2009 2:30 am

Re: future of top engines:how much more elo?

Post by Ozymandias »

Ovyron wrote: Wed Jul 31, 2019 10:57 pm I guess all these discussions are useless, the rating lists are built from volunteer work and what those volunteers want to test (that's why Stockfish 9 tops the 40/4 list...
I'd say the latter is a different problem which has to do with credibility. I pointed out to Graham from the get go (of SF10 testing) that something was wrong, and yet the more games that are poured into testing, the less reliable the rating becomes. If it were a middle placed engine, very few people would notice, but we're talking about the number one spot; I can hardly imagine this being good news for their reputation.
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: future of top engines:how much more elo?

Post by Robert Pope »

Ozymandias wrote: Thu Aug 01, 2019 9:04 am
Ovyron wrote: Wed Jul 31, 2019 10:57 pm I guess all these discussions are useless, the rating lists are built from volunteer work and what those volunteers want to test (that's why Stockfish 9 tops the 40/4 list...
I'd say the latter is a different problem which has to do with credibility. I pointed out to Graham from the get go (of SF10 testing) that something was wrong, and yet the more games that are poured into testing, the less reliable the rating becomes. If it were a middle placed engine, very few people would notice, but we're talking about the number one spot; I can hardly imagine this being good news for their reputation.
Sorry, I'm not up to speed on this. What is wrong with the SF10 testing?
User avatar
Ozymandias
Posts: 1535
Joined: Sun Oct 25, 2009 2:30 am

Re: future of top engines:how much more elo?

Post by Ozymandias »

Robert Pope wrote: Thu Aug 01, 2019 3:28 pmWhat is wrong with the SF10 testing?
Stockfish 9 64-bit 4CPU has a rating of 3547 with error bars of +12/−12 while Stockfish 10 64-bit 4CPU has 3546 with the same error margin. That means that even in the best case scenario, SF10 would rise to 23 Elo points ahead of SF9. That's less than the difference we see in the 40/40 list (29), CEGT 40/4 (45) or CEGT 40/20 (26). Those are the lists with 4CPU rankings.

Is it possible that the situation would rectify itself if millions of games were played under the same conditions? Just barely, it would still be the smallest advantage on any 4CPU listing, and still, is hard to believe that it would; I'd rather say there's something wrong with this particular testing.
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: future of top engines:how much more elo?

Post by Dann Corbit »

Here are some rating lists for SF at 40/40, 40/20, and 40/4

Code: Select all

CCRL
40/40
Stockfish 10 64-bit 4CPU	3461	+18	-18
Stockfish  9 64-bit 4CPU	3432	+15	-15

So SF 10 at 40/40 could be as high as 3479 or as low as 3443
So SF 9 at 40/40 could be as high as 3447 or as low as 3417
We are not totally sure which one is stronger

40/4
Stockfish  9 64-bit 4CPU	3547	+12	-12
Stockfish 10 64-bit 4CPU	3546	+12	-12

So SF 9 at 40/4 could be as high as 3559 or as low as 3535
So SF 10 at 40/4 could be as high as 3558 or as low as 3534
We are not totally sure which one is stronger

CEGT 
40/20
Stockfish 10.0 x64 8CPU	3518	+21	-21
Stockfish  9.0 x64 8CPU	3493	+24	-24

So Stockfish 10 at 40/20 could be as high as 3539 or as low as 3497
So Stockfish 9 at 40/20 could be as high as 3517 or as low as 3469
We are not totally sure which one is stronger

40/4
Stockfish 10.0 x64 4CPU	3548	+16	-16
Stockfish  9.0 x64 4CPU	3503	+17	-17

So SF 10 at 40/4 could be as high as 3564 or as low as 3532
So SF 9 at 40/4 could be as high as 3520 or as low as 3486
It appears that SF 10 is stronger, within a couple of standard deviations, but not by much.
Which measurements exactly do you have an issue with, and what evidence do you propose to show that the measurements are wrong?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.