Current data - threads-nps efficiency up to 32 threads
Posted: Fri Oct 24, 2014 9:31 pm
Computer Chess Club
https://talkchess.com/
No, TTD is not an universal measure. Andreas to 16 threads already performed the REAL strength measure with 3,000 _games_ each datapoint. Now he is looking for the limiting value of effective speed-up for these engines. Effective speed-up will never be larger than NPS speed-up.matthewlai wrote:I would recommend looking at time to fixed depth instead (across a few different positions, and take the average).
Parallel searches make the search tree bigger.
2x NPS is very bad if the search tree is 3x the size (to reach the same depth).
Laskos wrote:No, TTD is not an universal measure. Andreas to 16 threads already performed the REAL strength measure with 3,000 _games_ each datapoint. Now he is looking for the limiting value of effective speed-up for these engines. Effective speed-up will never be larger than NPS speed-up.matthewlai wrote:I would recommend looking at time to fixed depth instead (across a few different positions, and take the average).
Parallel searches make the search tree bigger.
2x NPS is very bad if the search tree is 3x the size (to reach the same depth).
Something strange happens with Crafty.
Absolutely unknown. I have one 24 core box around that I tested on and could not produce that effect at all. The cores add to the NPS in a pretty linear way, although the parallel speedup doesn't climb as quickly as the NPS.Mark wrote:What in the world happened to Crafty >16 cores??
4 nodes looks wrong. A node = one physical core with one shared memory (local) bank. Is your bios set to NUMA or SMP? Or whatever they call it now. Most AMD systems with more than one chip allow you to allocate memory with consecutive addresses on a single node. IE if you have 4 nodes and 4 gigs, node 0 gets addresses 0-1gig, node 1 gets addresses 1-2 gigs and so forth. If you put it in SMP mode, then node 0 gets page 0, node 1 gets page 1, interleaving the pages across all the nodes. Idea here is that for non-numa-aware programs, that spreads memory addresses (like the hash table) uniformly across the nodes, where it would be better to have consecutive addresses on a single node if the program knows how to allocate and use memory correctly. I'll try to look up that CPU and MB to see exactly what it does, NUMA-wise. But 4 nodes seems a bit odd, every AMD box I have used reported nodes = chips, which for some machines was nodes = cores when we had 1 core per chip.fastgm wrote:Hello, Mr. Hyatt,
yes, the behavior of Crafty is really strange!
The System is a 32-way dual 16 core AMD Opteron 6376, Mainboard ASUS KGPE-D16 with 8 x 4 GB 1600 MHz DDR3.
OS is Windows 7 Professional 64 Bit.
Crafty reports: System is NUMA. 4 nodes reported by windows
For the test I used the "official" Crafty version "crafty-24.1-x64-sse3.exe" from http://www.kikrtech.com/
Best regards,
Andreas Strangmüller
If you ever have time, could you run crafty with 1 cpu, 2, 4, 8, 16 and 32 and use the bench command, and send me the log file? And the final question, how long did the test you used run? Fractional second tests can certainly produce all sorts of weird numbers.fastgm wrote:Hello, Mr. Hyatt,
yes, the behavior of Crafty is really strange!
The System is a 32-way dual 16 core AMD Opteron 6376, Mainboard ASUS KGPE-D16 with 8 x 4 GB 1600 MHz DDR3.
OS is Windows 7 Professional 64 Bit.
Crafty reports: System is NUMA. 4 nodes reported by windows
For the test I used the "official" Crafty version "crafty-24.1-x64-sse3.exe" from http://www.kikrtech.com/
Best regards,
Andreas Strangmüller
It is AMD running at a low clock of 2.3GHz. With this generation of AMD CPU, performance per core is quite low, but you have lots of them. On the desktop they run at 4GHz+ and people overclock beyond that, and then performance is good, but on a server you can't do that.bob wrote:Also, what kind of machine does he have. Those NPS numbers look REALLY low for Crafty. My 2 year old macbook pro with a dual i7 at 2.0ghz runs Crafty at 5M nodes per second on one CPU.