New AMD Ryzen™ Threadripper™ PRO 3995WX (Windows and Multithreading Problem)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: New AMD Ryzen™ Threadripper™ PRO 3995WX (Windows and Multithreading Problem)

Post by Joerg Oster »

Zerbinati wrote: Mon Apr 26, 2021 6:12 pm
Joerg Oster wrote: Mon Apr 26, 2021 5:52 pm
syzygy wrote: Mon Apr 26, 2021 5:15 pm
Zerbinati wrote: Mon Apr 26, 2021 8:41 am Ronald does this mean that in each group I have 32 physical and 32 logical threads?

Code: Select all

Cfish 110421 64 AVX2 NUMA by Syzygy based on Stockfish
info string NUMA enabled.
setoption name Threads value 64
go depth 10
info string Binding thread 0 to node 0 in group 0.
info string Binding thread 1 to node 0 in group 0.
info string Binding thread 2 to node 0 in group 0.
info string Binding thread 3 to node 0 in group 0.
info string Binding thread 4 to node 0 in group 0.
info string Binding thread 5 to node 0 in group 0.
info string Binding thread 6 to node 0 in group 0.
info string Binding thread 7 to node 0 in group 0.
info string Binding thread 8 to node 0 in group 0.
info string Binding thread 9 to node 0 in group 0.
info string Binding thread 10 to node 0 in group 0.
info string Binding thread 11 to node 0 in group 0.
info string Binding thread 12 to node 0 in group 0.
info string Binding thread 13 to node 0 in group 0.
info string Binding thread 14 to node 0 in group 0.
info string Binding thread 15 to node 0 in group 0.
info string Binding thread 16 to node 0 in group 0.
info string Binding thread 17 to node 0 in group 0.
info string Binding thread 18 to node 0 in group 0.
info string Binding thread 19 to node 0 in group 0.
info string Binding thread 20 to node 0 in group 0.
info string Binding thread 21 to node 0 in group 0.
info string Binding thread 22 to node 0 in group 0.
info string Binding thread 23 to node 0 in group 0.
info string Binding thread 24 to node 0 in group 0.
info string Binding thread 25 to node 0 in group 0.
info string Binding thread 26 to node 0 in group 0.
info string Binding thread 27 to node 0 in group 0.
info string Binding thread 28 to node 0 in group 0.
info string Binding thread 29 to node 0 in group 0.
info string Binding thread 30 to node 0 in group 0.
info string Binding thread 31 to node 0 in group 0.
info string Binding thread 32 to node 1 in group 1.
info string Binding thread 33 to node 1 in group 1.
info string Binding thread 34 to node 1 in group 1.
info string Binding thread 35 to node 1 in group 1.
info string Binding thread 36 to node 1 in group 1.
info string Binding thread 37 to node 1 in group 1.
info string Binding thread 38 to node 1 in group 1.
info string Binding thread 39 to node 1 in group 1.
info string Binding thread 40 to node 1 in group 1.
info string Binding thread 41 to node 1 in group 1.
info string Binding thread 42 to node 1 in group 1.
info string Binding thread 43 to node 1 in group 1.
info string Binding thread 44 to node 1 in group 1.
info string Binding thread 45 to node 1 in group 1.
info string Binding thread 46 to node 1 in group 1.
info string Binding thread 47 to node 1 in group 1.
info string Binding thread 48 to node 1 in group 1.
info string Binding thread 49 to node 1 in group 1.
info string Binding thread 50 to node 1 in group 1.
info string Binding thread 51 to node 1 in group 1.
info string Binding thread 52 to node 1 in group 1.
info string Binding thread 53 to node 1 in group 1.
info string Binding thread 54 to node 1 in group 1.
info string Binding thread 55 to node 1 in group 1.
info string Binding thread 56 to node 1 in group 1.
info string Binding thread 57 to node 1 in group 1.
info string Binding thread 58 to node 1 in group 1.
info string Binding thread 59 to node 1 in group 1.
info string Binding thread 60 to node 1 in group 1.
info string Binding thread 61 to node 1 in group 1.
info string Binding thread 62 to node 1 in group 1.
info string Binding thread 63 to node 1 in group 1.
It means that 32 threads are assigned to group 0 and 32 threads are assigned to group 1. This should be fine.

With 128 threads, it is 64 and 64 threads assigned to groups 0 and 1.

So as far as I can tell, Cfish on your machine is not hindered by the limitations of Windows.
Stockfish doesn't use the search threads to clear the Hash Table.
You probably know this, of course.
Am I right that the thread-binding inside the TT.clear() method destroys all binding of the search threads?

Code: Select all

/// TranspositionTable::clear() initializes the entire transposition table to zero,
//  in a multi-threaded way.

void TranspositionTable::clear() {

  std::vector<std::thread> threads;

  for (size_t idx = 0; idx < Options["Threads"]; ++idx)
  {
      threads.emplace_back([this, idx]() {

          // Thread binding gives faster search on systems with a first-touch policy
          if (Options["Threads"] > 8)
              WinProcGroup::bindThisThread(idx);

          // Each thread will zero its part of the hash table
          const size_t stride = size_t(clusterCount / Options["Threads"]),
                       start  = size_t(stride * idx),
                       len    = idx != Options["Threads"] - 1 ?
                                stride : clusterCount - start;

          std::memset(&table[start], 0, len * sizeof(Cluster));
      });
  }

  for (std::thread& th : threads)
      th.join();
}
Yes Joerg,

therefore some modification would be necessary
to make the engine take advantage of the increase in cores from 64 to 128
with an increment of nodes?
Well, if the answer is really "Yes", I am not sure it is,
then you can simply delete the 2 lines of code for thread-binding and retry.
Jörg Oster
User avatar
Zerbinati
Posts: 122
Joined: Mon Aug 18, 2014 7:12 pm
Location: Trento (Italy)

Re: New AMD Ryzen™ Threadripper™ PRO 3995WX (Windows and Multithreading Problem)

Post by Zerbinati »

Joerg Oster wrote: Mon Apr 26, 2021 6:21 pm
Do you mean this only?

Well, if the answer is really "Yes", I am not sure it is,
then you can simply delete the 2 lines of code for thread-binding and retry.

Code: Select all

          // Thread binding gives faster search on systems with a first-touch policy
          if (Options["Threads"] > 8)
              WinProcGroup::bindThisThread(idx);
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: New AMD Ryzen™ Threadripper™ PRO 3995WX (Windows and Multithreading Problem)

Post by Joerg Oster »

Zerbinati wrote: Mon Apr 26, 2021 6:35 pm
Joerg Oster wrote: Mon Apr 26, 2021 6:21 pm
Do you mean this only?

Well, if the answer is really "Yes", I am not sure it is,
then you can simply delete the 2 lines of code for thread-binding and retry.

Code: Select all

          // Thread binding gives faster search on systems with a first-touch policy
          if (Options["Threads"] > 8)
              WinProcGroup::bindThisThread(idx);
Exactly.
Jörg Oster
User avatar
Zerbinati
Posts: 122
Joined: Mon Aug 18, 2014 7:12 pm
Location: Trento (Italy)

Re: New AMD Ryzen™ Threadripper™ PRO 3995WX (Windows and Multithreading Problem)

Post by Zerbinati »

I removed thread binding by commenting the call to bindThisThread() but that turned out to be a disaster when using 128 threads (For some reason, the console output was very slow and the NPS was clearly bad) so I reverted this change.

A friend of mine told me that hash size could be the problem. I tried testing that theory and I believe the findings are at least interesting (all my tests done earlier were using 16 GB)

To test different hash sizes, I ran the following commands for ThreadCount = 64 and 128, and HashSize = 1, 2, 4, 8 GB:

Code: Select all

setoption name Threads value [ThreadCount]
setoption name Hash value [HashSize]
setoption name Clear Hash
go depth 32
I recorded the number of nodes searched as well as the NPS for each run, here are my findings:

64 Threads :
1GB -> nodes 386108962 nps 75781935
2GB -> nodes 412854391 nps 74522453
4GB -> nodes 446810263 nps 73718901
8GB -> nodes 299798467 nps 72925922

128 Threads :
1GB -> nodes 641860549 nps 100747221 -> Speed boost: 132%
2GB -> nodes 431677082 nps 84692384 -> Speed boost: 114%
4GB -> nodes 532605248 nps 78451207 -> Speed boost: 106%
8GB -> nodes 450780836 nps 74251496 -> Speed boost: 102%

It is clear that when using a small hash size, the speed of 128 threads is much more than 64 threads. However, when increasing the hash size to 2/8/16 GB the speed boost becomes less significant (while the number of searched nodes is significantly higher)

I wonder if this is expected, or if the Hash table logic is not thread friendly, or something else is going on?
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: New AMD Ryzen™ Threadripper™ PRO 3995WX (Windows and Multithreading Problem)

Post by Joerg Oster »

Zerbinati wrote: Mon Apr 26, 2021 9:40 pm I removed thread binding by commenting the call to bindThisThread() but that turned out to be a disaster when using 128 threads (For some reason, the console output was very slow and the NPS was clearly bad) so I reverted this change.

A friend of mine told me that hash size could be the problem. I tried testing that theory and I believe the findings are at least interesting (all my tests done earlier were using 16 GB)

To test different hash sizes, I ran the following commands for ThreadCount = 64 and 128, and HashSize = 1, 2, 4, 8 GB:

Code: Select all

setoption name Threads value [ThreadCount]
setoption name Hash value [HashSize]
setoption name Clear Hash
go depth 32
I recorded the number of nodes searched as well as the NPS for each run, here are my findings:

64 Threads :
1GB -> nodes 386108962 nps 75781935
2GB -> nodes 412854391 nps 74522453
4GB -> nodes 446810263 nps 73718901
8GB -> nodes 299798467 nps 72925922

128 Threads :
1GB -> nodes 641860549 nps 100747221 -> Speed boost: 132%
2GB -> nodes 431677082 nps 84692384 -> Speed boost: 114%
4GB -> nodes 532605248 nps 78451207 -> Speed boost: 106%
8GB -> nodes 450780836 nps 74251496 -> Speed boost: 102%

It is clear that when using a small hash size, the speed of 128 threads is much more than 64 threads. However, when increasing the hash size to 2/8/16 GB the speed boost becomes less significant (while the number of searched nodes is significantly higher)

I wonder if this is expected, or if the Hash table logic is not thread friendly, or something else is going on?
I see.
It is probably advisable to Limit the number of threads in this case to 16 or 32 at most.
Jörg Oster
User avatar
Zerbinati
Posts: 122
Joined: Mon Aug 18, 2014 7:12 pm
Location: Trento (Italy)

Re: New AMD Ryzen™ Threadripper™ PRO 3995WX (Windows and Multithreading Problem)

Post by Zerbinati »

Joerg Oster wrote: Mon Apr 26, 2021 10:16 pm
Zerbinati wrote: Mon Apr 26, 2021 9:40 pm I removed thread binding by commenting the call to bindThisThread() but that turned out to be a disaster when using 128 threads (For some reason, the console output was very slow and the NPS was clearly bad) so I reverted this change.

A friend of mine told me that hash size could be the problem. I tried testing that theory and I believe the findings are at least interesting (all my tests done earlier were using 16 GB)

To test different hash sizes, I ran the following commands for ThreadCount = 64 and 128, and HashSize = 1, 2, 4, 8 GB:

Code: Select all

setoption name Threads value [ThreadCount]
setoption name Hash value [HashSize]
setoption name Clear Hash
go depth 32
I recorded the number of nodes searched as well as the NPS for each run, here are my findings:

64 Threads :
1GB -> nodes 386108962 nps 75781935
2GB -> nodes 412854391 nps 74522453
4GB -> nodes 446810263 nps 73718901
8GB -> nodes 299798467 nps 72925922

128 Threads :
1GB -> nodes 641860549 nps 100747221 -> Speed boost: 132%
2GB -> nodes 431677082 nps 84692384 -> Speed boost: 114%
4GB -> nodes 532605248 nps 78451207 -> Speed boost: 106%
8GB -> nodes 450780836 nps 74251496 -> Speed boost: 102%

It is clear that when using a small hash size, the speed of 128 threads is much more than 64 threads. However, when increasing the hash size to 2/8/16 GB the speed boost becomes less significant (while the number of searched nodes is significantly higher)

I wonder if this is expected, or if the Hash table logic is not thread friendly, or something else is going on?
I see.
It is probably advisable to Limit the number of threads in this case to 16 or 32 at most.
Thanks so much Joerg!
User avatar
MikeB
Posts: 4889
Joined: Thu Mar 09, 2006 6:34 am
Location: Pen Argyl, Pennsylvania

Re: New AMD Ryzen™ Threadripper™ PRO 3995WX (Windows and Multithreading Problem)

Post by MikeB »

Zerbinati wrote: Mon Apr 26, 2021 10:21 pm
Joerg Oster wrote: Mon Apr 26, 2021 10:16 pm
Zerbinati wrote: Mon Apr 26, 2021 9:40 pm I removed thread binding by commenting the call to bindThisThread() but that turned out to be a disaster when using 128 threads (For some reason, the console output was very slow and the NPS was clearly bad) so I reverted this change.

A friend of mine told me that hash size could be the problem. I tried testing that theory and I believe the findings are at least interesting (all my tests done earlier were using 16 GB)

To test different hash sizes, I ran the following commands for ThreadCount = 64 and 128, and HashSize = 1, 2, 4, 8 GB:

Code: Select all

setoption name Threads value [ThreadCount]
setoption name Hash value [HashSize]
setoption name Clear Hash
go depth 32
I recorded the number of nodes searched as well as the NPS for each run, here are my findings:

64 Threads :
1GB -> nodes 386108962 nps 75781935
2GB -> nodes 412854391 nps 74522453
4GB -> nodes 446810263 nps 73718901
8GB -> nodes 299798467 nps 72925922

128 Threads :
1GB -> nodes 641860549 nps 100747221 -> Speed boost: 132%
2GB -> nodes 431677082 nps 84692384 -> Speed boost: 114%
4GB -> nodes 532605248 nps 78451207 -> Speed boost: 106%
8GB -> nodes 450780836 nps 74251496 -> Speed boost: 102%

It is clear that when using a small hash size, the speed of 128 threads is much more than 64 threads. However, when increasing the hash size to 2/8/16 GB the speed boost becomes less significant (while the number of searched nodes is significantly higher)

I wonder if this is expected, or if the Hash table logic is not thread friendly, or something else is going on?
I see.
It is probably advisable to Limit the number of threads in this case to 16 or 32 at most.
Thanks so much Joerg!
Have to you tried turning large pages on?

3970x

stockfish bench 2048 64 16 >/dev/null
===========================
Total time (ms) : 1675
Nodes searched : 143354067
Nodes/second : 85584517

there is a $4k premium on the 3995WX over the 3970x , I would hope you can get more nps...
Image
User avatar
MikeB
Posts: 4889
Joined: Thu Mar 09, 2006 6:34 am
Location: Pen Argyl, Pennsylvania

Re: New AMD Ryzen™ Threadripper™ PRO 3995WX (Windows and Multithreading Problem)

Post by MikeB »

I just saw this PM right after I posted to your thread.

You have to turn large pages on:
https://docs.microsoft.com/en-us/sql/da ... rver-ver15

this prevents any disk swapping , and is needed for below

https://docs.microsoft.com/en-us/window ... ge-support

always use hash sizes in 2048 MB intervals 2048 4096 8092 etc

on my 3970x

stockfish bench 2048 64 16 >/dev/null
===========================
Total time (ms) : 1675
Nodes searched : 143354067
Nodes/second : 85584517

keep me posted ...
Image
User avatar
Zerbinati
Posts: 122
Joined: Mon Aug 18, 2014 7:12 pm
Location: Trento (Italy)

Re: New AMD Ryzen™ Threadripper™ PRO 3995WX (Windows and Multithreading Problem)

Post by Zerbinati »

Hi Michael,
thanks for your intervention.
Yes Large Pages are enabled on my system.
A friend of mine who owns a similar processor (3990x) has the same problem. The increase of the nodes is linked to the assigned hash quantity. Evident some other correlation concerns processors with 64 physical cores. In all my other Dual Socket Xeon systems this annoying problem is not present.

However it is just a problem I have encountered in Stockfish ..
with all other engines, increasing hash does not penalize nodes in any way.
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: New AMD Ryzen™ Threadripper™ PRO 3995WX (Windows and Multithreading Problem)

Post by Joerg Oster »

Zerbinati wrote: Mon Apr 26, 2021 10:21 pm Thanks so much Joerg!
Marco, do you get better performance now?
Jörg Oster