TalkChess.com

Posted: **Sun Sep 16, 2018 2:55 am**

CMCanavessi wrote: ↑Sat Sep 15, 2018 11:52 pm
chessdev wrote: ↑Sat Sep 15, 2018 10:49 pm Hello all. Update time. After extensive discussions, we have decided to turn Ponder OFF at this time (for Stage 2). We recognize that there are some irregularities with how threads are being allocated in some engines and situations, and until we figure that out, we are going to play it safe.

That said, we believe the games have been incredibly entertaining, and the outcome is more or less expected, which leads us to believe that this was not a meaningful issue in the scheme of things. But we do want to find solutions going forward. We are looking at options for things we can do on our end to ensure that each engine gets the processing power it deserves. That may mean that Ponder will be turned back ON for Stage 3. We will see.

Additionally, we will also be providing the Arena logs for Stage 2 and beyond. Links will be provided when Stage 2 begins. As for Stage 1 logs, some have been already provided, and we may reconsider releasing all of Stage 1 in the future. Right now we need to be fully focused on the rest of the event.

Thank you all again for your insights and patience!
Just to clarify, ponder will be off and thread count for each engine will double, right? Otherwise it makes little sense.

I don't think so. The jump from 46 threads to using 46 real cores is massive. The jump from 46 real cores to 92 threads is < 30 elo.

That jump assumes that all engines have NUMA support. It is my understanding that only Stockfish, Houdini, Texel, and Ethereal have such support. I'm tempted to add Komodo into that too, based on some chats with Mark, but I don't know that first hand.

Posted: **Sun Sep 16, 2018 2:57 am**

AndrewGrant wrote: ↑Sun Sep 16, 2018 2:55 am That jump assumes that all engines have NUMA support. It is my understanding that only Stockfish, Houdini, Texel, and Ethereal have such support. I'm tempted to add Komodo into that too, based on some chats with Mark, but I don't know that first hand.

Crafty has NUMA support

Posted: **Sun Sep 16, 2018 3:19 am**

AndrewGrant wrote: ↑Sun Sep 16, 2018 2:55 am That jump assumes that all engines have NUMA support. It is my understanding that only Stockfish, Houdini, Texel, and Ethereal have such support. I'm tempted to add Komodo into that too, based on some chats with Mark, but I don't know that first hand.

Adding NUMA support is pretty trivial. I can't believe that only SF, you and RH copied it so far from Peter?

Posted: **Sun Sep 16, 2018 3:41 am**

Milos wrote: ↑Sun Sep 16, 2018 3:19 am
AndrewGrant wrote: ↑Sun Sep 16, 2018 2:55 am That jump assumes that all engines have NUMA support. It is my understanding that only Stockfish, Houdini, Texel, and Ethereal have such support. I'm tempted to add Komodo into that too, based on some chats with Mark, but I don't know that first hand.
Adding NUMA support is pretty trivial. I can't believe that only SF, you and RH copied it so far from Peter?

Its a fairly low priority for most. And seeing as a large chunk of engines in TCEC can't even run well with 43 cores, most authors have bigger fish to fry. Why support 64+ threads when your engine is not able to make use of 32?

Posted: **Sun Sep 16, 2018 4:03 am**

AndrewGrant wrote: ↑Sun Sep 16, 2018 2:55 am [
That jump assumes that all engines have NUMA support. It is my understanding that only Stockfish, Houdini, Texel, and Ethereal have such support. I'm tempted to add Komodo into that too, based on some chats with Mark, but I don't know that first hand.

The key to decent NUMA support is to have each thread allocate as much of the memory that is uses itself, to ensure the memory is on the NUMA node it is placed on. This can be done without resorting to special NUMA calls. But in Windows, running on more than 64 Threads does require special code since Windows only deals with 64 core "processor groups" which requires these calls. I have always wondered how Windows
would deal with more than 64 CPUs on one processor chip...would it split them into two processor groups?

Although I have tried special NUMA calls to do things like locking a thread to a NUMA node, they have never paid off. The OS seems to have the best knowledge of where a threads should go, and I have no found a way to get access to things like which CPU is busy. I suppose the OS things for security, this is none of my business!

Posted: **Sun Sep 16, 2018 4:08 am**

mjlef wrote: ↑Sun Sep 16, 2018 4:03 am
AndrewGrant wrote: ↑Sun Sep 16, 2018 2:55 am [
That jump assumes that all engines have NUMA support. It is my understanding that only Stockfish, Houdini, Texel, and Ethereal have such support. I'm tempted to add Komodo into that too, based on some chats with Mark, but I don't know that first hand.
The key to decent NUMA support is to have each thread allocate as much of the memory that is uses itself, to ensure the memory is on the NUMA node it is placed on. This can be done without resorting to special NUMA calls. But in Windows, running on more than 64 Threads does require special code since Windows only deals with 64 core "processor groups" which requires these calls. I have always wondered how Windows
would deal with more than 64 CPUs on one processor chip...would it split them into two processor groups?

Although I have tried special NUMA calls to do things like locking a thread to a NUMA node, they have never paid off. The OS seems to have the best knowledge of where a threads should go, and I have no found a way to get access to things like which CPU is busy. I suppose the OS things for security, this is none of my business!

Yes. In this whole thread, when I talk about NUMA support, I'm really referring to the Numa Windows Processor Group problem.

Posted: **Sun Sep 16, 2018 12:06 pm**

mjlef wrote: ↑Sun Sep 16, 2018 4:03 am
AndrewGrant wrote: ↑Sun Sep 16, 2018 2:55 am [
That jump assumes that all engines have NUMA support. It is my understanding that only Stockfish, Houdini, Texel, and Ethereal have such support. I'm tempted to add Komodo into that too, based on some chats with Mark, but I don't know that first hand.
The key to decent NUMA support is to have each thread allocate as much of the memory that is uses itself, to ensure the memory is on the NUMA node it is placed on. This can be done without resorting to special NUMA calls. But in Windows, running on more than 64 Threads does require special code since Windows only deals with 64 core "processor groups" which requires these calls. I have always wondered how Windows
would deal with more than 64 CPUs on one processor chip...would it split them into two processor groups?

Although I have tried special NUMA calls to do things like locking a thread to a NUMA node, they have never paid off. The OS seems to have the best knowledge of where a threads should go, and I have no found a way to get access to things like which CPU is busy. I suppose the OS things for security, this is none of my business!

Cfish binds threads to nodes, so they are always "close" to their thread-specific data. Assuming a non-defective OS, I wonder how often a non-bound thread would be run on a node other than its "home" node? I suppose there are NUMA-related profiling tools that would answer this? Running on a "non-home" node makes fetching thread-specific data slower---but that might still be better than waiting for chance to run on the "home" node.

I wonder how much testing Ronald was able to do (or get help with) for this aspect of Cfish?

Posted: **Sun Sep 16, 2018 6:58 pm**

AndrewGrant wrote: ↑Sun Sep 16, 2018 2:55 am
I don't think so. The jump from 46 threads to using 46 real cores is massive. The jump from 46 real cores to 92 threads is < 30 elo.

That jump assumes that all engines have NUMA support. It is my understanding that only Stockfish, Houdini, Texel, and Ethereal have such support. I'm tempted to add Komodo into that too, based on some chats with Mark, but I don't know that first hand.

I completely agree (with one difference). On the fastgm.de site, Andreas did experiments doubling threads and doubling time. The threads results are here:

http://www.fastgm.de/schach/SMP-scaling-SF8-K10.4.pdf

As shown, Komodo only gained 12 elo going from 16 to 32 threads, and Stockfish gained 6 elo. Barring some clever scheme to better use threads, further doublings would be worth even less (you can see the trend in the data). So

But doubling time, which is about the same as the roughly doubling of CPU speed you get with either using half the cores (even with hyperthreading off, the OS will pair threads to two hyperthread when available) is worth a lot more. For example:

http://fastgm.de/time-control4.html

Even then 5120 vs 2560 (plus 1%) time doubling shows a 41 elo gain. Way above any likely elo gain going from 44 real cores to 88 hyperthreads (or whatever they decide for keeping some threads free for t Arena and the OS, etc.). From limited tests I have done, hyperthreading off helps a little, probably by forcing the OS to always assign the threads properly. But it is a small thing.

So summary, I think best is use real thread, no pondering, ideally with hyperthreading off. Do not double the threads.

Mark

Posted: **Sun Sep 16, 2018 9:46 pm**

mjlef wrote: ↑Sun Sep 16, 2018 6:58 pm
AndrewGrant wrote: ↑Sun Sep 16, 2018 2:55 am
I don't think so. The jump from 46 threads to using 46 real cores is massive. The jump from 46 real cores to 92 threads is < 30 elo.

That jump assumes that all engines have NUMA support. It is my understanding that only Stockfish, Houdini, Texel, and Ethereal have such support. I'm tempted to add Komodo into that too, based on some chats with Mark, but I don't know that first hand.
I completely agree (with one difference). On the fastgm.de site, Andreas did experiments doubling threads and doubling time. The threads results are here:

http://www.fastgm.de/schach/SMP-scaling-SF8-K10.4.pdf

As shown, Komodo only gained 12 elo going from 16 to 32 threads, and Stockfish gained 6 elo. Barring some clever scheme to better use threads, further doublings would be worth even less (you can see the trend in the data). So

But doubling time, which is about the same as the roughly doubling of CPU speed you get with either using half the cores (even with hyperthreading off, the OS will pair threads to two hyperthread when available) is worth a lot more. For example:

http://fastgm.de/time-control4.html

Even then 5120 vs 2560 (plus 1%) time doubling shows a 41 elo gain. Way above any likely elo gain going from 44 real cores to 88 hyperthreads (or whatever they decide for keeping some threads free for t Arena and the OS, etc.). From limited tests I have done, hyperthreading off helps a little, probably by forcing the OS to always assign the threads properly. But it is a small thing.

So summary, I think best is use real thread, no pondering, ideally with hyperthreading off. Do not double the threads.

Mark

Interesting. I wonder how Intel hyperthreads compare to AMD hyperthreads. As far as I know, AMD ones are way more efficient and work way better, while Intel ones kinda suck, at least for chess. Is there some kind of study about this?

Posted: **Sun Sep 16, 2018 11:05 pm**

mjlef wrote: ↑Sun Sep 16, 2018 4:03 am
AndrewGrant wrote: ↑Sun Sep 16, 2018 2:55 am [
That jump assumes that all engines have NUMA support. It is my understanding that only Stockfish, Houdini, Texel, and Ethereal have such support. I'm tempted to add Komodo into that too, based on some chats with Mark, but I don't know that first hand.
The key to decent NUMA support is to have each thread allocate as much of the memory that is uses itself, to ensure the memory is on the NUMA node it is placed on. This can be done without resorting to special NUMA calls. But in Windows, running on more than 64 Threads does require special code since Windows only deals with 64 core "processor groups" which requires these calls. I have always wondered how Windows
would deal with more than 64 CPUs on one processor chip...would it split them into two processor groups?

Although I have tried special NUMA calls to do things like locking a thread to a NUMA node, they have never paid off. The OS seems to have the best knowledge of where a threads should go, and I have no found a way to get access to things like which CPU is busy. I suppose the OS things for security, this is none of my business!

Hi Mark,

Do you think it's worthwhile to duplicate global variables that might be used a lot such as pre-calculated bitboards of moves on an empty board such that each thread has it's own copy? Do you know if global constants are handled differently by the compiler/OS such that they might have faster memory access?

John

TalkChess.com

Chess.com 2018 computer chess championship

Re: Chess.com 2018 computer chess championship

Re: Chess.com 2018 computer chess championship

Re: Chess.com 2018 computer chess championship

Re: Chess.com 2018 computer chess championship

Re: Chess.com 2018 computer chess championship

Re: Chess.com 2018 computer chess championship

Re: Chess.com 2018 computer chess championship

Re: Chess.com 2018 computer chess championship

Re: Chess.com 2018 computer chess championship

Re: Chess.com 2018 computer chess championship