Stockfish now benefits from hyperthreading

Gusev · Post by **Gusev** » Thu Nov 19, 2015 11:11 pm

The reported NPS did increase, but I recall that also to be the case with good old Houdini, as once reported by Robert Houdart, except it didn't buy Houdini any Elo strength. Ditto Firenzina, ditto Stockfish before LazySMP, quote from [url]http://Do not post this link here![/url]

I tested the impact of hyperthreading and 32-bit v. 64-bit on my new research desktop with an i7-5960x 8-core processor.
Test 1. Firenzina 2.4.1 x64 (8 threads) - Firenzina 2.4.1 x64 (16 threads), 509.5:490.5 (+7 Elo), 1'+1", 32M hash, GUI: ArenaBlitzer.
Test 2. Firenzina 2.4.1 x64 (8 threads) - Firenzina 2.4.1 x64 (12 threads), 501.5:498.5 (+124-121=755, +1 Elo), 1'+1", 32M hash, GUI: ArenaBlitzer.
Test 3. Firenzina 2.4.1 x64 (8 threads) - Firenzina 2.4.1 x64 (4 threads), 1066:982 (+403-319=1326, +14 Elo), 2"+0.25", 256M hash, GUI: ArenaBlitzer. (It turned out that ArenaBlitzer cannot play more than 2048 games.)
Test 4. Stockfish 14102815 modern (8 threads) - Stockfish 14102815 modern (16 threads), 1034:966 (+303-235=1462, +12 Elo), 2"+0.25", 256M hash, GUI: ArenaBlitzer.
Test 5. Firenzina 2.4.1 x64 (8 threads) - Firenzina 2.4.1 x64 (16 threads), 389.5:380.5 (+107-98=565, +4 Elo), 1'+1", 256M hash, GUI: Deep Fritz 14.
Test 6. Firenzina 2.3.1 WinXP x32 (8 threads) - Firenzina 2.3.1 Win7 x32 (8 threads), 1038:962 (+13 Elo), 2"+0.25", 256M hash, GUI: ArenaBlitzer.
Test 7. Firenzina 2.3.1 x64 (05/23/2013, 8 threads) - Firenzina 2.3.1 WinXP x32 (8 threads), 1331.5:668.5 (+805-142=1053, +120 Elo), 2"+0.25", 256M hash, GUI: ArenaBlitzer.
OS: Windows 8.1 Enterprise.
Conclusions: Nether Firenzina, nor Stockfish benefits from hyperthreading. The 64-bit compile easily beats the 32-bit compile on the modern 64-bit computer, as expected.

bob · Post by **bob** » Fri Nov 20, 2015 1:07 am

Gusev wrote:The reported NPS did increase, but I recall that also to be the case with good old Houdini, as once reported by Robert Houdart, except it didn't buy Houdini any Elo strength. Ditto Firenzina, ditto Stockfish before LazySMP, quote from [url]http://Do not post this link here![/url]
I tested the impact of hyperthreading and 32-bit v. 64-bit on my new research desktop with an i7-5960x 8-core processor.
Test 1. Firenzina 2.4.1 x64 (8 threads) - Firenzina 2.4.1 x64 (16 threads), 509.5:490.5 (+7 Elo), 1'+1", 32M hash, GUI: ArenaBlitzer.
Test 2. Firenzina 2.4.1 x64 (8 threads) - Firenzina 2.4.1 x64 (12 threads), 501.5:498.5 (+124-121=755, +1 Elo), 1'+1", 32M hash, GUI: ArenaBlitzer.
Test 3. Firenzina 2.4.1 x64 (8 threads) - Firenzina 2.4.1 x64 (4 threads), 1066:982 (+403-319=1326, +14 Elo), 2"+0.25", 256M hash, GUI: ArenaBlitzer. (It turned out that ArenaBlitzer cannot play more than 2048 games.)
Test 4. Stockfish 14102815 modern (8 threads) - Stockfish 14102815 modern (16 threads), 1034:966 (+303-235=1462, +12 Elo), 2"+0.25", 256M hash, GUI: ArenaBlitzer.
Test 5. Firenzina 2.4.1 x64 (8 threads) - Firenzina 2.4.1 x64 (16 threads), 389.5:380.5 (+107-98=565, +4 Elo), 1'+1", 256M hash, GUI: Deep Fritz 14.
Test 6. Firenzina 2.3.1 WinXP x32 (8 threads) - Firenzina 2.3.1 Win7 x32 (8 threads), 1038:962 (+13 Elo), 2"+0.25", 256M hash, GUI: ArenaBlitzer.
Test 7. Firenzina 2.3.1 x64 (05/23/2013, 8 threads) - Firenzina 2.3.1 WinXP x32 (8 threads), 1331.5:668.5 (+805-142=1053, +120 Elo), 2"+0.25", 256M hash, GUI: ArenaBlitzer.
OS: Windows 8.1 Enterprise.
Conclusions: Nether Firenzina, nor Stockfish benefits from hyperthreading. The 64-bit compile easily beats the 32-bit compile on the modern 64-bit computer, as expected.

Have you run the SAME test twice to see the volatility? This is not exactly a large sample size...

Gusev · Post by **Gusev** » Sat Nov 21, 2015 4:53 am

Robert,

Prior to the latest test results, the matter of how well different chess engines can exploit hyperthreading seemed like a non-issue to me. This was part of my preparation for FOSCEC Season 3, which is coming next semester. And now, all of a sudden, it tentatively looks like setting the number of threads to the same value may not always be the fairest way to compare two engines running on the same HT-capable system.

If you can suggest a good venue for publishing the results, then I'd be happy to explore this matter together with you.

On the sample size, I see the subtle difference between running the test twice and doubling the size of the sample. But that would be almost the same thing. Arena-Blitzer allows matches up to 2048 games long.

History knows published research in other areas, such as color science and computational linguistics, where conclusions were drawn from smaller samples. It's not unusual to run a psychophysical experiment in color science on as few as 16 or 64 human subjects and publish a paper. Using 1000 human observers may be unfeasible and/or prohibitively expensive, difficult to justify to the study's sponsors.

Likewise, genealogy of languages was studied using databases containing the words for 35 (Yakhontov), 50 (Starostin), 100, 200, 207 (Swadesh), or 421 (Ringe, Warnow, Taylor 2002) basic meanings, such as 'I', 'water', 'walk', 'star', etc. https://en.wikipedia.org/wiki/Swadesh_list. To put things in perspective, I would liken these meanings to chess openings in a test suite. Except with languages, there are just two outcomes of a "vote", where a meaning acts as a "voter". Either two languages are judged to have a common origin, say, based on the measured similarity of "water" (English) and "voda" (Russian), or they are judged to be unrelated from "mountain" (English, borrowed from French) being dissimilar to "gora" (Russian). If a "voter" is a chess game, we have three outcomes: Engine A is stronger than Engine B (1:0), Engine B is stronger than Engine A (0:1), or the two engines are equally strong (1/2:1/2). All "voters" are biased in favor of the engine playing the white pieces, and we know how this is remedied. Somewhat similarly, in comparative linguistics there are methods that assign more weight to the meanings that are more "stable" relative to others, such as 'louse'. The resulting genealogical trees of languages are a lot like genealogical trees of chess engines built using Don Dailey's similarity tool.

Speaking of voters, it's common to see published opinion poll results on political matters based on a statistical sample of 1000 respondents. Nate Silver recently showed that he can do better by cleverly processing more data. With likely voters and two candidates, we see direct similarity to chess: Will vote for Candidate A (1:0), will vote for Candidate B (0:1), undecided (1/2:1/2). While it looks like in chess we're better off when we can run very many games using a large, well-balanced mix of openings, the question will remain how the result would scale for longer TCs.

Best regards,
Dmitri

bob · Post by **bob** » Sat Nov 21, 2015 7:10 pm

Gusev wrote:Robert,

Prior to the latest test results, the matter of how well different chess engines can exploit hyperthreading seemed like a non-issue to me. This was part of my preparation for FOSCEC Season 3, which is coming next semester. And now, all of a sudden, it tentatively looks like setting the number of threads to the same value may not always be the fairest way to compare two engines running on the same HT-capable system.

If you can suggest a good venue for publishing the results, then I'd be happy to explore this matter together with you.

On the sample size, I see the subtle difference between running the test twice and doubling the size of the sample. But that would be almost the same thing. Arena-Blitzer allows matches up to 2048 games long.

History knows published research in other areas, such as color science and computational linguistics, where conclusions were drawn from smaller samples. It's not unusual to run a psychophysical experiment in color science on as few as 16 or 64 human subjects and publish a paper. Using 1000 human observers may be unfeasible and/or prohibitively expensive, difficult to justify to the study's sponsors.

Likewise, genealogy of languages was studied using databases containing the words for 35 (Yakhontov), 50 (Starostin), 100, 200, 207 (Swadesh), or 421 (Ringe, Warnow, Taylor 2002) basic meanings, such as 'I', 'water', 'walk', 'star', etc. https://en.wikipedia.org/wiki/Swadesh_list. To put things in perspective, I would liken these meanings to chess openings in a test suite. Except with languages, there are just two outcomes of a "vote", where a meaning acts as a "voter". Either two languages are judged to have a common origin, say, based on the measured similarity of "water" (English) and "voda" (Russian), or they are judged to be unrelated from "mountain" (English, borrowed from French) being dissimilar to "gora" (Russian). If a "voter" is a chess game, we have three outcomes: Engine A is stronger than Engine B (1:0), Engine B is stronger than Engine A (0:1), or the two engines are equally strong (1/2:1/2). All "voters" are biased in favor of the engine playing the white pieces, and we know how this is remedied. Somewhat similarly, in comparative linguistics there are methods that assign more weight to the meanings that are more "stable" relative to others, such as 'louse'. The resulting genealogical trees of languages are a lot like genealogical trees of chess engines built using Don Dailey's similarity tool.

Speaking of voters, it's common to see published opinion poll results on political matters based on a statistical sample of 1000 respondents. Nate Silver recently showed that he can do better by cleverly processing more data. With likely voters and two candidates, we see direct similarity to chess: Will vote for Candidate A (1:0), will vote for Candidate B (0:1), undecided (1/2:1/2). While it looks like in chess we're better off when we can run very many games using a large, well-balanced mix of openings, the question will remain how the result would scale for longer TCs.

Best regards,
Dmitri

Most base their sample size on the standard deviation between samples. That's where parallel search can be extremely messy to measure.

If you play games between two equal programs, seeing a significant variance is quite normal in 1000 game matches. There is too much even in 30K game matches, giving the +/- 3 Elo measurement margin...

Political polls are as accurate as the pollsters want to be. If the sample is truly random, you can get reasonably accurate results. But if there is bias in the sampling (say only calling those with cell phones, or something similar) it won't be so good...

I published (here) a ton of results several years ago showing just how variable games can be between two equal opponents, much less when using two equal opponents both with a non-deterministic parallel search component added in. Time variability is quite enough to screw the measurements up with smallish sample sizes (1000 games is a very small sample size...)

Gusev · Post by **Gusev** » Sat Nov 21, 2015 10:02 pm

Thank you!! Let's not take the current, limited results too seriously. Which chess interface do you usually use for testing?

Gusev · Post by **Gusev** » Mon Nov 30, 2015 5:59 pm

As a follow-up, I ran a 2000-game match of SF 16 threads vs. 8 threads with the TC of 1'+1" over the Thanksgiving break on the 8-core i7-5960X, HT on. The result is in favor of 16 threads,
-----------------Stockfish-----------------
Stockfish - Stockfish_2015-11-03_x64_bmi2_MinGW : 1041.0/2000 216-134-1650 (==1===101===0==0=0===========010==11===========0===========1===========0====10====10================1=1=1===11====1===1========0========1=1=========1
============010==============1======0=====0======1====0=0==============0=============1=========1====1======1======0===============0===============
=1=====10==1=========0=00===========0============
==1=============10========1=======1===========1=============1==0===0======================1======
=======10==1========0==1===========1===11=1=====0
========1====01=====1==0======1====0==========0=1==================1=0=========================0=======0================1=======10====10========1=
==1===1==0=1=01==0===========0===0===========1==1===10====1==1==10===0========0=10===================1==============1====11=1=1====0====1=1=========
======1======0=========01======0==11=01=1=================1===========0=11======1=1====01=1====10===0=1=====1=1==1=================================
=======1====1=======1=======0=====0=====0=====0=0=01=1=10====0===101=1=1==========01=1=====1====1==========1====================0==1===1=1=1===1=====
=======1====1======0====11========1010====1===========================10==1======0=0=======1===0=======0101=====1======1======1=1======0======1====
==========0===========11===========1============
0======1=110===01==01====1==1==1=1========1=0======01=========1===========1========0===1=1==0=====0=
====0=1======11=======0==0==============0==1=====
===========1=1==1=1=1=1===0=====0=1===1=======1===
===1=0====01==========110============1===========1===100=1============0=======0===================
===01=1===0=1==11===========1====1=========1=1====
1===1=1=====1============0==================1==========0======1==========0==1=1==0101=====0=1=====0=====================0==0=0=====1============1=01
0===========1===========1=1======01=10========0====0===0==================1=======1===0=======10==
======================1===1=====================10====1==0=1=====0====10=0=====01=1=======10====1=1
0====1===10=110==1========1=1==1=) 52% +14
-----------------Stockfish_2015-11-03_x64_bmi2_MinGW-----------------
Stockfish_2015-11-03_x64_bmi2_MinGW - Stockfish : 959.0/2000 134-216-1650 (==0===010===1==1=1===========101==00===========1===========0===========1====01====01================0=0=0===00====0===0========1========0=0=========0
============101==============0======1=====1======0====1=1==============1=============0=========0====0======0======1===============1===============
=0=====01==0=========1=11===========1==============0=============01========0=======0===========0=============0==1===1======================0======
=======01==0========1==0===========0===00=0=====1========0====10=====0==1======0====1==========1=0
==================0=1=========================1=
======1================0=======01====01========0===0===0==1=0=10==1===========1===1===========0==0=
==01====0==0==01===1========1=01=================
==0==============0====00=0=0====1====0=0===============0======1=========10======1==00=10=0=================0===========1=00======0=0====10=0====01==
=1=0=====0=0==0========================================0====0=======0=======1=====1=====1=====1=1=10=0=01====1===010=0=0==========10=0=====0====0===
=======0====================1==0===0=0=0===0============0====0======1====00========0101====0===========================01==0======1=1=======0===1==
=====1010=====0======0======0=0======1======0==============1===========00===========0============1======0=001===10==10====0==0==0=0========0=1======
10=========0===========0========1===0=0==1=====1=====1=0======00=======1==1==============1==0================0=0==0=0=0=0===1=====1=0===0=======0===
===0=1====10==========001============0===========0===011=0============1=======1======================10=0===1=0==00===========0====0=========0=0====
0===0=0=====0============1==================0==========1======0==========1==0=0==1010=====1=0=====1=====================1==1=1=====0============0=10
1===========0===========0=0======10=01========1====1===1==================0=======0===1=======01========================0===0=====================
01====0==1=0=====1====01=1=====10=0=======01====0=01====0===01=001==0========0=0==0=) 48% -14

AlvaroBegue · Post by **AlvaroBegue** » Mon Nov 30, 2015 7:42 pm

The long sequences of "0", "1" and "=" add nothing of value to your posts, an they make them hard to read. There is no state that transfers from game to game, so all the information is contained in the three-number summary.

Gusev · Post by **Gusev** » Tue Dec 01, 2015 2:28 am

They show that there weren't any suspiciously long streaks that might indicate malfunction. They also illustrate how stable the performance was through the set of openings.

The long sequences of "0", "1" and "=" add nothing of value to your posts, an they make them hard to read. There is no state that transfers from game to game, so all the information is contained in the three-number summary.

bob · Post by **bob** » Tue Dec 01, 2015 11:42 pm

Gusev wrote:They show that there weren't any suspiciously long streaks that might indicate malfunction. They also illustrate how stable the performance was through the set of openings.
The long sequences of "0", "1" and "=" add nothing of value to your posts, an they make them hard to read. There is no state that transfers from game to game, so all the information is contained in the three-number summary.

I am not sure that conclusion is supported by the data. Have you ever tested PRNG's? IE things like the poker test, runs test, etc...? NOT having a streak looks more problematic than does having one, streaks are natural. Humans make VERY POOR judges of randomness in a string of numbers...

I have this problem with students all the time..

Dann Corbit · Post by **Dann Corbit** » Tue Dec 01, 2015 11:55 pm

At 2000 games, unless there is a procedural error of some sort, this is enough games to show that hyperthreading is probably not a detriment for Lazy SMP.

In itself, that is incredibly surprising.
As we add more and more threads, we should have more and more SMP loss.
But we don't.

I would love to see a logical explanation for that.
I can't think of any.

Stockfish now benefits from hyperthreading

Re: Ditto Cheng 4 Re: Stockfish now benefits from hyperthrea

Re: Ditto Cheng 4 Re: Stockfish now benefits from hyperthrea

Re: Ditto Cheng 4 Re: Stockfish now benefits from hyperthrea

Re: Ditto Cheng 4 Re: Stockfish now benefits from hyperthrea

Re: Ditto Cheng 4 Re: Stockfish now benefits from hyperthrea

Re: Ditto Cheng 4 Re: Stockfish now benefits from hyperthrea

Re: Ditto Cheng 4 Re: Stockfish now benefits from hyperthrea

Re: Ditto Cheng 4 Re: Stockfish now benefits from hyperthrea

Re: Ditto Cheng 4 Re: Stockfish now benefits from hyperthrea

Re: Ditto Cheng 4 Re: Stockfish now benefits from hyperthrea