Stockfish Dev vs Stockfish 8 (1 Year anniversary)

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

gogamoga
Posts: 33
Joined: Sat May 21, 2016 9:45 am

Stockfish Dev vs Stockfish 8 (1 Year anniversary)

Post by gogamoga »

One year has passed since the last official release :)

2 threads per engine
GUI: Cutechess-cli
Hash: 256 Mb
TC: 180s+1
Opening book: HERT openings
Syzygy: 5-pieces
Games: http://www26.zippyshare.com/v/Gngwx6H4/file.html

Code: Select all

   # PLAYER              :    Elo  Error  POINTS  PLAYED   (%)  CFS(next)    W    D    L  D(%)
   1 Stockfish 311017    :   3218      5   550.0    1000  55.0        100  166  768   66  76.8
   2 Stockfish 8         :   3182      5   450.0    1000  45.0        ---   66  768  166  76.8

White advantage = 31.96 +/- 5.30
Draw rate (equal opponents) = 80.04 % +/- 1.36

Code: Select all

Games        : 1000 (finished)

White Wins   : 161 (16.1 %)
Black Wins   : 71 (7.1 %)
Draws        : 768 (76.8 %)
Unfinished   : 0

White Score  : 54.5 %
Black Score  : 45.5 %

Stockfish 190817 - SF8: +26 ELO (my previous test)
Stockfish 311017 - SF8: +35 ELO


est. SF9 release date - May 2018 - ...
Hai
Posts: 598
Joined: Sun Aug 04, 2013 1:19 pm

Re: Stockfish Dev vs Stockfish 8 (1 Year anniversary)

Post by Hai »

gogamoga wrote:One year has passed since the last official release :)

2 threads per engine
GUI: Cutechess-cli
Hash: 256 Mb
TC: 180s+1
Opening book: HERT openings
Syzygy: 5-pieces
Games: http://www26.zippyshare.com/v/Gngwx6H4/file.html

Code: Select all

   # PLAYER              :    Elo  Error  POINTS  PLAYED   (%)  CFS(next)    W    D    L  D(%)
   1 Stockfish 311017    :   3218      5   550.0    1000  55.0        100  166  768   66  76.8
   2 Stockfish 8         :   3182      5   450.0    1000  45.0        ---   66  768  166  76.8

White advantage = 31.96 +/- 5.30
Draw rate (equal opponents) = 80.04 % +/- 1.36

Code: Select all

Games        : 1000 (finished)

White Wins   : 161 (16.1 %)
Black Wins   : 71 (7.1 %)
Draws        : 768 (76.8 %)
Unfinished   : 0

White Score  : 54.5 %
Black Score  : 45.5 %

Stockfish 190817 - SF8: +26 ELO (my previous test)
Stockfish 311017 - SF8: +35 ELO


est. SF9 release date - May 2018 - ...
thx, that means we need ~2 years before realizing Stockfish 9.
Or someone can donate machines to the framework, only +200% more machines are needed.
Jouni
Posts: 3291
Joined: Wed Mar 08, 2006 8:15 pm

Re: Stockfish Dev vs Stockfish 8 (1 Year anniversary)

Post by Jouni »

There are now 8 pull request waiting. And additionally capture_history. It gives +3 to +5 ELO by changing "1" to "2" in code :!: . Exactly same as 1 gigabyte of syzygy data.
Jouni
Uri Blass
Posts: 10297
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Stockfish Dev vs Stockfish 8 (1 Year anniversary)

Post by Uri Blass »

Jouni wrote:There are now 8 pull request waiting. And additionally capture_history. It gives +3 to +5 ELO by changing "1" to "2" in code :!: . Exactly same as 1 gigabyte of syzygy data.
We have no data about elo improvement because the stockfish team are not interested in it and do not test accepted changes with fixed number of games to get a good estimate.

in order to know that a change gives +3 to +5 elo you need a fixed number of games and I think that 80,000 games are good enough but nobody does it.
jdart
Posts: 4367
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Stockfish Dev vs Stockfish 8 (1 Year anniversary)

Post by jdart »

I have a question: I don't follow Stockfish testing very much, but it appears to me that sometimes multiple variants of a patch are being tested, with the goal to find the best one.

So if they don't have an absolute ELO measure, how do you do this?

Does the patch that produces a positive SPRT with the least number of games get chosen? Or is a "green" patch committed and then variants tested against that?

--Jon
User avatar
Eelco de Groot
Posts: 4567
Joined: Sun Mar 12, 2006 2:40 am
Full name:   

Re: Stockfish Dev vs Stockfish 8 (1 Year anniversary)

Post by Eelco de Groot »

The regression tests are simply fixed number of games, so those are 'traditional' tests to measure Elo. Only the SPRT tests are not stopped blind because their purpose is to shorten the number of games if possible. But Michel van den Berg has a script to calculate Elo with with statistically sound margins from SPRT as used by Stockfish Fishtest, I can't say I've studied that but you can calculate numbers that everybody can compare with that.

I don't know if there are rules for comparing different versions. Often if something passes STC you still don't know if it will pass LTC. It takes some discipline to start comparing different versions at STC if you don't know yet for certain if it will pass LTC but it has been done like this; after one version passed LTC, go back to STC test different versions of the patch, I don't think with SPRT but just fixed number of games. Then the best version is tested with LTC SPRT again.

An alternative is just commit if something passed and try to improve on that version with [0,4] SPRT bounds. But the testing at fixed number of games at only STC goes much faster. With the alternative passing [0,4] SPRT, you are (a bit) more certain that a new version is really better at LTC, unless the fixed number of games was showing a very clear pattern.
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan
jdart
Posts: 4367
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Stockfish Dev vs Stockfish 8 (1 Year anniversary)

Post by jdart »

That sort of makes sense, although I have found that STC results are a bad predictor of LTC results, so using them to select a patch is iffy.

By the way it appears regression tests are no longer done?
At least http://tests.stockfishchess.org/regression has no recent tests (nothing past 2015).

--Jon
User avatar
Eelco de Groot
Posts: 4567
Joined: Sun Mar 12, 2006 2:40 am
Full name:   

Re: Stockfish Dev vs Stockfish 8 (1 Year anniversary)

Post by Eelco de Groot »

jdart wrote:That sort of makes sense, although I have found that STC results are a bad predictor of LTC results, so using them to select a patch is iffy.
Hi Jon,

Yes basically I agree: you have to be aware of scaling. Just as an example the recent test of asmFish by Stefan Pohl. The pure speed difference of asmFish has less Elo impact with his new test conditions. Longer time controls but also I think the book was different. If we ignore the book, I think it is true any patch that only differs in speed would have more trouble passing LTC.

In the case that I remember of testing several versions, that was done by Marco to test some numerical parameters of the search patch lthat was later called 'improving' There was a clear pattern visible in the STC and you can do much more games in the same time using STC. Also a property of Fishtest is that it has many different machines. So testing at STC is actually a range of timecontrols; that makes possible effects of for instance a very specific timecotrol just reaching or not reaching a certain depth, less.

Statistically the timecontrol should not matter insofar as the confidence intervals are concerned but as the number of draws increases, testresults at least under the theory of Bayeselo, that puts more value in draws, profit from this increase in draws. But that is more theoretical and some people would question this as a reason to trust long time ontrols better. I just thought this up, myself I'm still not clear what is involved here (other than some things scaling or not scaling of course)
By the way it appears regression tests are no longer done?
At least http://tests.stockfishchess.org/regression has no recent tests (nothing past 2015).
--Jon
Well under that tab are mainly tests from Jean-Francois Romang but the number of games he could play on his own computer were not enough. And later Stefan Pohl came with his very valuable testing!

I was referring to the regression tests done against the official version of Stockfish, now Stockfish 8. The idea is it is a bit weaker so it can simulate weaker engines, and acts as a fixed point in time. The testresults are a bit difficult to find but are done regularly. Testing against other engines is not done on the framework one reason being people object to having other programs running on their machines of which it was not very clear what code is inside them.
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan