Gaviota has a speed up of ~1.7 (both nps and time to ply, roughly) in an AMD dual running two threads. I tested it in a AMD hexacore 1090T and the speed up (running two threads to make it comparable) is not more than 1.2x in nodes per second. Awful.
Anybody has any idea why this could be possible? What did I set up wrong with the hardware? Any hint? I cannot imagine it's a software problem... or is it?
Miguel
AMD Phenom Hex core (SMP performance problem)
Moderators: hgm, Rebel, chrisw
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
-
- Posts: 900
- Joined: Tue Apr 27, 2010 3:48 pm
Re: AMD Phenom Hex core (SMP performance problem)
Does your motherboard support the Phenom II X6? I had to update my BIOS to get it working correctly. As long as all the cores are detected by the OS, you should be fine.michiguel wrote:Gaviota has a speed up of ~1.7 (both nps and time to ply, roughly) in an AMD dual running two threads. I tested it in a AMD hexacore 1090T and the speed up (running two threads to make it comparable) is not more than 1.2x in nodes per second. Awful.
Anybody has any idea why this could be possible? What did I set up wrong with the hardware? Any hint? I cannot imagine is a software problem... or is it?
Miguel
I don't see any such problem with my CPU (Phenom II X6 1055T). NPS with six cores is not 6x, but that's to be expected with other bottlenecks, but with two cores the speed-up is certainly close to 2x. The turbo mode should be in use as long as <= 3 cores are being used, so turbo shouldn't be the reason for what you're seeing.
How can I run your test here to compare?
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: AMD Phenom Hex core (SMP performance problem)
When I run all 6 threads I see of cores being detected and busy when I run it for 2-3 minutes, so I think that the cores are detected.rbarreira wrote:Does your motherboard support the Phenom II X6? I had to update my BIOS to get it working correctly. As long as all the cores are detected by the OS, you should be fine.michiguel wrote:Gaviota has a speed up of ~1.7 (both nps and time to ply, roughly) in an AMD dual running two threads. I tested it in a AMD hexacore 1090T and the speed up (running two threads to make it comparable) is not more than 1.2x in nodes per second. Awful.
Anybody has any idea why this could be possible? What did I set up wrong with the hardware? Any hint? I cannot imagine is a software problem... or is it?
Miguel
Thanks, that would be fantastic. Tonight when I get home I will post a binary for you to download.
I don't see any such problem with my CPU (Phenom II X6 1055T). NPS with six cores is not 6x, but that's to be expected with other bottlenecks, but with two cores the speed-up is certainly close to 2x. The turbo mode should be in use as long as <= 3 cores are being used, so turbo shouldn't be the reason for what you're seeing.
How can I run your test here to compare?
Linux? Windows? 32 or 64 bits?
I observed the problems with Linux 64.
Miguel
-
- Posts: 900
- Joined: Tue Apr 27, 2010 3:48 pm
-
- Posts: 348
- Joined: Sat Feb 27, 2010 12:21 am
Re: AMD Phenom Hex core (SMP performance problem)
I had terrible speedups with my program until I disabled the Cool'n'Quiet in BIOS.michiguel wrote:Gaviota has a speed up of ~1.7 (both nps and time to ply, roughly) in an AMD dual running two threads. I tested it in a AMD hexacore 1090T and the speed up (running two threads to make it comparable) is not more than 1.2x in nodes per second. Awful.
Anybody has any idea why this could be possible? What did I set up wrong with the hardware? Any hint? I cannot imagine it's a software problem... or is it?
Miguel
What this function does is throttle down a core when it is not being used. If your processes/threads are idling when there is no work for them this will hurt because there is a delay between throttle down and ramping up. This was the case with my program. It was effectively operating at the throttle down speed all the time.
As a bonus, when you disable the Cool'n'quiet, each core gets locked at its "turbo core" speed. (which is 3.6GHz for the 1090T and 3.7GHz for the 1100T). This is a slight overclock because not all 6 cores are supposed to run at this speed simultaneously, but I had no problems with it running 24/7 for a few months now.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: AMD Phenom Hex core (SMP performance problem)
It is possibly a cache issue. You have to be _very_ careful what is shared. Remember that memory if fetched in 64 byte blocks. If you have two adjacent 4-byte or 8-byte values, each being updated by a different thread, your goose is cooked. That is sometimes called "false sharing". The caches transfer that block back and forth between the cores, killing performance...michiguel wrote:Gaviota has a speed up of ~1.7 (both nps and time to ply, roughly) in an AMD dual running two threads. I tested it in a AMD hexacore 1090T and the speed up (running two threads to make it comparable) is not more than 1.2x in nodes per second. Awful.
Anybody has any idea why this could be possible? What did I set up wrong with the hardware? Any hint? I cannot imagine it's a software problem... or is it?
Miguel
As a simple test, run two separate instances of your program in two different windows and see if each runs at its normal nps. If so, then you likely have a cache issue as above. If not, then you have something going on with the hardware settings...
Make sure you are not using the Intel compiler if you are running on AMD of course.
-
- Posts: 1563
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: AMD Phenom Hex core (SMP performance problem)
In fact there is a 'Turbo Core' compatibility problem with older Linux kernels, look at - for instance - the article here: http://www.h-online.com/open/news/item/ ... 93127.htmlmarcelk wrote: I had terrible speedups with my program until I disabled the Cool'n'Quiet in BIOS.
-
- Posts: 1563
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: AMD Phenom Hex core (SMP performance problem)
I had cache thrashing when I first moved from 2 to many threads.bob wrote: It is possibly a cache issue. You have to be _very_ careful what is shared. Remember that memory if fetched in 64 byte blocks. If you have two adjacent 4-byte or 8-byte values, each being updated by a different thread, your goose is cooked.
In my case the distance between the board structs in a splitpoint was too small.
The speedup on 6 cores for my program is now around 4 to 4.75 depending upon the position. And there is still room for improvement because I didn't implement 'the helpful master concept' yet.
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: AMD Phenom Hex core (SMP performance problem)
Thanks Joost and Marcel, it looks like something like this was the problem!Joost Buijs wrote:In fact there is a 'Turbo Core' compatibility problem with older Linux kernels, look at - for instance - the article here: http://www.h-online.com/open/news/item/ ... 93127.htmlmarcelk wrote: I had terrible speedups with my program until I disabled the Cool'n'Quiet in BIOS.
I disable both turbocore and cool n quiet and now I have a speed up of 1.7 using two threads, similar to what I have in my dual.
Probably I do not need to disable turbocore (as what Marcel seems to indicate) but this was my first test. I had the newest kernel, so that should not be a problem. I had the gut feeling that this may happens with programs that uses mutexes or semaphores that put the cores to rest and wake them up.
This was driving me nuts. THANKS!
Miguel
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: AMD Phenom Hex core (SMP performance problem)
I buillt a version that almost did not save anything in memory (counters, hashtable, killers etc etc) and still had the problem. I used gcc. I also run two instances and they were fine (not 100% but close enough). Apparently, it was this cool n quiet thing...bob wrote:It is possibly a cache issue. You have to be _very_ careful what is shared. Remember that memory if fetched in 64 byte blocks. If you have two adjacent 4-byte or 8-byte values, each being updated by a different thread, your goose is cooked. That is sometimes called "false sharing". The caches transfer that block back and forth between the cores, killing performance...michiguel wrote:Gaviota has a speed up of ~1.7 (both nps and time to ply, roughly) in an AMD dual running two threads. I tested it in a AMD hexacore 1090T and the speed up (running two threads to make it comparable) is not more than 1.2x in nodes per second. Awful.
Anybody has any idea why this could be possible? What did I set up wrong with the hardware? Any hint? I cannot imagine it's a software problem... or is it?
Miguel
As a simple test, run two separate instances of your program in two different windows and see if each runs at its normal nps. If so, then you likely have a cache issue as above. If not, then you have something going on with the hardware settings...
Make sure you are not using the Intel compiler if you are running on AMD of course.
Miguel