nps scaling

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

nps scaling

Post by Daniel Shawul »

Recently I had a chance to run scorpio on 8 processor machine.
Unfortunately it did not seem to do better than the 4 processor version at all! I see that all the cpus are busy (100%) but obviously sitting idle. The work allocation scheme is pretty much straight forward but I don't know if that is good enough for keeping the processors alive. The 4 processor version scales almost 4x nps wise as expected but the 8 cpu run does not increase the nps at all!! Bewildered I took crafty and run it there and it sweapt to 10 Mnps with no problem! I do not have a lot of chance to run on that machine but I can test on 4 cpus as much as I want to, which I am hoping will help me figure out the problem. I heard problems of this kind happening to other guys here, so if you could point me to things to watch out for, it is much appreciated.

Daniel
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: nps scaling

Post by hgm »

What was the architecture of the 8-core machine?

It sounds like you are running into a bottleneck other than CPU. It could be that you saturate the memory bandwidth with hash probes. (Likely if all cores share the same memory controller.) It could also be that you have different memory controllers, and the hash table is divided over CPUs that they have to access each other's memory too often.

Note that Crafty does not probe or store hash during QS, so it is less likely to run into any of these memory bottlenecks.

Could also that you are being cheated, and the '8-core machine' is really a 4-core machine using hyperthreading.
krazyken

Re: nps scaling

Post by krazyken »

If you want more chances to test, I have an 8-core Mac I can get you ssh access to for testing.
frankp
Posts: 228
Joined: Sun Mar 12, 2006 3:11 pm

Re: nps scaling

Post by frankp »

Glass half empty?

I have just gotten my new threaded search 'working' (ie not crashing or locking), but am getting little more than 2x on a 4 core box. I don't think I am searching the same moves repeatedly.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: nps scaling

Post by Daniel Shawul »

The nps scaling for 30 sec runs on the initial position are => 2cpu - 1.85, 4cpu - 3.32, 8cpu - 3.37. The nps scaling is also bad at 4 cpus compared to other (max possible speed 3.3 is not good i guess). It is probably a real 8 cpu machines as crafty showed doubling of nps fro 2,4,8 cpus where it reached 10mnps. Some time ago, I improved the nps scaling significantly by removing unnecessary 'volatile' variables and other memory problems. So I am pretty sure there is still work to do there, but I honestly expected some increase in nps on the 8 cpu run however bad the code is. FYI, I don't probe HT at queiscence, don't spltit below 3 ply left ,YBW after just one move at every nodes etc...
I suspect that the data copied during split may be too much for my case with all the linked lists and every square being copied.

A friend of mine wondered why his IBM machine costed like 6000$ , so I checked how many processors it had and saw 8 there. To convince him of its benefit I ran scorpio with 4 cpus which was ok but was embarrassed on the 8cpu run. That is when I took crafty and showed him :)

Daniel
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: nps scaling

Post by Daniel Shawul »

I guess you are talking of real speed up (time scaling). Searching moves repeatedly would not affect your nps scaling, and I also think it won't affect your real speed up because you probably get immediate hashtable cutoff.

Daniel
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: nps scaling

Post by Daniel Shawul »

Thanks. I will let you know if I don't get my friend to lend me some cpu time.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: nps scaling

Post by bob »

Daniel Shawul wrote:Recently I had a chance to run scorpio on 8 processor machine.
Unfortunately it did not seem to do better than the 4 processor version at all! I see that all the cpus are busy (100%) but obviously sitting idle. The work allocation scheme is pretty much straight forward but I don't know if that is good enough for keeping the processors alive. The 4 processor version scales almost 4x nps wise as expected but the 8 cpu run does not increase the nps at all!! Bewildered I took crafty and run it there and it sweapt to 10 Mnps with no problem! I do not have a lot of chance to run on that machine but I can test on 4 cpus as much as I want to, which I am hoping will help me figure out the problem. I heard problems of this kind happening to other guys here, so if you could point me to things to watch out for, it is much appreciated.

Daniel
This sounds like a memory issue. If your program is not carefully optimized to maximize cache usage (for example, any variables used close together with respect to time need to be close together in memory so that a single cache fill will get them all at once) then you run into a bottleneck when you reach maximum memory bandwidth, where additional processors don't help because they interfere with each other when accessing memory.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: nps scaling

Post by bob »

Daniel Shawul wrote:The nps scaling for 30 sec runs on the initial position are => 2cpu - 1.85, 4cpu - 3.32, 8cpu - 3.37. The nps scaling is also bad at 4 cpus compared to other (max possible speed 3.3 is not good i guess). It is probably a real 8 cpu machines as crafty showed doubling of nps fro 2,4,8 cpus where it reached 10mnps. Some time ago, I improved the nps scaling significantly by removing unnecessary 'volatile' variables and other memory problems. So I am pretty sure there is still work to do there, but I honestly expected some increase in nps on the 8 cpu run however bad the code is. FYI, I don't probe HT at queiscence, don't spltit below 3 ply left ,YBW after just one move at every nodes etc...
I suspect that the data copied during split may be too much for my case with all the linked lists and every square being copied.

A friend of mine wondered why his IBM machine costed like 6000$ , so I checked how many processors it had and saw 8 there. To convince him of its benefit I ran scorpio with 4 cpus which was ok but was embarrassed on the 8cpu run. That is when I took crafty and showed him :)

Daniel
Do you limit splitting? If your cost for splitting is fairly high, you can help limit the effect by not splitting within N plies of the tree tips...
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: nps scaling

Post by Daniel Shawul »

Yes, no splits below 3 plies left. Last time I tried 4 plies the idle time of processors increased (less nps), but I never checked if the real speed up may be better. Without going to low-level cache optimization, are there any general guidelines useful to avoid such problems before they crop up? I remember you gave such advice before but i can not find it. So if possible could you recite it here.