Page 1 of 2

AMD Phenom Hex core (SMP performance problem)

Posted: Mon Apr 04, 2011 11:50 pm
by michiguel
Gaviota has a speed up of ~1.7 (both nps and time to ply, roughly) in an AMD dual running two threads. I tested it in a AMD hexacore 1090T and the speed up (running two threads to make it comparable) is not more than 1.2x in nodes per second. Awful.
Anybody has any idea why this could be possible? What did I set up wrong with the hardware? Any hint? I cannot imagine it's a software problem... or is it?

Miguel

Re: AMD Phenom Hex core (SMP performance problem)

Posted: Mon Apr 04, 2011 11:57 pm
by rbarreira
michiguel wrote:Gaviota has a speed up of ~1.7 (both nps and time to ply, roughly) in an AMD dual running two threads. I tested it in a AMD hexacore 1090T and the speed up (running two threads to make it comparable) is not more than 1.2x in nodes per second. Awful.
Anybody has any idea why this could be possible? What did I set up wrong with the hardware? Any hint? I cannot imagine is a software problem... or is it?

Miguel
Does your motherboard support the Phenom II X6? I had to update my BIOS to get it working correctly. As long as all the cores are detected by the OS, you should be fine.

I don't see any such problem with my CPU (Phenom II X6 1055T). NPS with six cores is not 6x, but that's to be expected with other bottlenecks, but with two cores the speed-up is certainly close to 2x. The turbo mode should be in use as long as <= 3 cores are being used, so turbo shouldn't be the reason for what you're seeing.

How can I run your test here to compare?

Re: AMD Phenom Hex core (SMP performance problem)

Posted: Tue Apr 05, 2011 12:12 am
by michiguel
rbarreira wrote:
michiguel wrote:Gaviota has a speed up of ~1.7 (both nps and time to ply, roughly) in an AMD dual running two threads. I tested it in a AMD hexacore 1090T and the speed up (running two threads to make it comparable) is not more than 1.2x in nodes per second. Awful.
Anybody has any idea why this could be possible? What did I set up wrong with the hardware? Any hint? I cannot imagine is a software problem... or is it?

Miguel
Does your motherboard support the Phenom II X6? I had to update my BIOS to get it working correctly. As long as all the cores are detected by the OS, you should be fine.
When I run all 6 threads I see of cores being detected and busy when I run it for 2-3 minutes, so I think that the cores are detected.

I don't see any such problem with my CPU (Phenom II X6 1055T). NPS with six cores is not 6x, but that's to be expected with other bottlenecks, but with two cores the speed-up is certainly close to 2x. The turbo mode should be in use as long as <= 3 cores are being used, so turbo shouldn't be the reason for what you're seeing.

How can I run your test here to compare?
Thanks, that would be fantastic. Tonight when I get home I will post a binary for you to download.

Linux? Windows? 32 or 64 bits?

I observed the problems with Linux 64.
Miguel

Re: AMD Phenom Hex core (SMP performance problem)

Posted: Tue Apr 05, 2011 12:17 am
by rbarreira
Linux 64.

Re: AMD Phenom Hex core (SMP performance problem)

Posted: Tue Apr 05, 2011 12:23 am
by marcelk
michiguel wrote:Gaviota has a speed up of ~1.7 (both nps and time to ply, roughly) in an AMD dual running two threads. I tested it in a AMD hexacore 1090T and the speed up (running two threads to make it comparable) is not more than 1.2x in nodes per second. Awful.
Anybody has any idea why this could be possible? What did I set up wrong with the hardware? Any hint? I cannot imagine it's a software problem... or is it?

Miguel
I had terrible speedups with my program until I disabled the Cool'n'Quiet in BIOS.

What this function does is throttle down a core when it is not being used. If your processes/threads are idling when there is no work for them this will hurt because there is a delay between throttle down and ramping up. This was the case with my program. It was effectively operating at the throttle down speed all the time.

As a bonus, when you disable the Cool'n'quiet, each core gets locked at its "turbo core" speed. (which is 3.6GHz for the 1090T and 3.7GHz for the 1100T). This is a slight overclock because not all 6 cores are supposed to run at this speed simultaneously, but I had no problems with it running 24/7 for a few months now.

Re: AMD Phenom Hex core (SMP performance problem)

Posted: Tue Apr 05, 2011 6:52 am
by bob
michiguel wrote:Gaviota has a speed up of ~1.7 (both nps and time to ply, roughly) in an AMD dual running two threads. I tested it in a AMD hexacore 1090T and the speed up (running two threads to make it comparable) is not more than 1.2x in nodes per second. Awful.
Anybody has any idea why this could be possible? What did I set up wrong with the hardware? Any hint? I cannot imagine it's a software problem... or is it?

Miguel
It is possibly a cache issue. You have to be _very_ careful what is shared. Remember that memory if fetched in 64 byte blocks. If you have two adjacent 4-byte or 8-byte values, each being updated by a different thread, your goose is cooked. That is sometimes called "false sharing". The caches transfer that block back and forth between the cores, killing performance...

As a simple test, run two separate instances of your program in two different windows and see if each runs at its normal nps. If so, then you likely have a cache issue as above. If not, then you have something going on with the hardware settings...

Make sure you are not using the Intel compiler if you are running on AMD of course.

Re: AMD Phenom Hex core (SMP performance problem)

Posted: Tue Apr 05, 2011 6:58 am
by Joost Buijs
marcelk wrote: I had terrible speedups with my program until I disabled the Cool'n'Quiet in BIOS.
In fact there is a 'Turbo Core' compatibility problem with older Linux kernels, look at - for instance - the article here: http://www.h-online.com/open/news/item/ ... 93127.html

Re: AMD Phenom Hex core (SMP performance problem)

Posted: Tue Apr 05, 2011 7:33 am
by Joost Buijs
bob wrote: It is possibly a cache issue. You have to be _very_ careful what is shared. Remember that memory if fetched in 64 byte blocks. If you have two adjacent 4-byte or 8-byte values, each being updated by a different thread, your goose is cooked.
I had cache thrashing when I first moved from 2 to many threads.
In my case the distance between the board structs in a splitpoint was too small.

The speedup on 6 cores for my program is now around 4 to 4.75 depending upon the position. And there is still room for improvement because I didn't implement 'the helpful master concept' yet.

Re: AMD Phenom Hex core (SMP performance problem)

Posted: Tue Apr 05, 2011 8:28 am
by michiguel
Joost Buijs wrote:
marcelk wrote: I had terrible speedups with my program until I disabled the Cool'n'Quiet in BIOS.
In fact there is a 'Turbo Core' compatibility problem with older Linux kernels, look at - for instance - the article here: http://www.h-online.com/open/news/item/ ... 93127.html
Thanks Joost and Marcel, it looks like something like this was the problem!

I disable both turbocore and cool n quiet and now I have a speed up of 1.7 using two threads, similar to what I have in my dual.

Probably I do not need to disable turbocore (as what Marcel seems to indicate) but this was my first test. I had the newest kernel, so that should not be a problem. I had the gut feeling that this may happens with programs that uses mutexes or semaphores that put the cores to rest and wake them up.

This was driving me nuts. THANKS!

Miguel

Re: AMD Phenom Hex core (SMP performance problem)

Posted: Tue Apr 05, 2011 8:31 am
by michiguel
bob wrote:
michiguel wrote:Gaviota has a speed up of ~1.7 (both nps and time to ply, roughly) in an AMD dual running two threads. I tested it in a AMD hexacore 1090T and the speed up (running two threads to make it comparable) is not more than 1.2x in nodes per second. Awful.
Anybody has any idea why this could be possible? What did I set up wrong with the hardware? Any hint? I cannot imagine it's a software problem... or is it?

Miguel
It is possibly a cache issue. You have to be _very_ careful what is shared. Remember that memory if fetched in 64 byte blocks. If you have two adjacent 4-byte or 8-byte values, each being updated by a different thread, your goose is cooked. That is sometimes called "false sharing". The caches transfer that block back and forth between the cores, killing performance...

As a simple test, run two separate instances of your program in two different windows and see if each runs at its normal nps. If so, then you likely have a cache issue as above. If not, then you have something going on with the hardware settings...

Make sure you are not using the Intel compiler if you are running on AMD of course.
I buillt a version that almost did not save anything in memory (counters, hashtable, killers etc etc) and still had the problem. I used gcc. I also run two instances and they were fine (not 100% but close enough). Apparently, it was this cool n quiet thing...

Miguel