Choosing a GPU platform: AMD and Nvidia

diep · Post by **diep** » Sat Jul 14, 2012 5:30 pm

johnhamlen wrote:Gerd, Sven, Edmund and Daniel have been kind enough to be giving me advice on SIMD friendly move generation over on this thread
http://www.talkchess.com/forum/viewtopic.php?t=43971

A good piece of advice from Daniel was that I should be:
...first picking up one of better gpus with the larger cache size. It is hard to fit everything on the L1.
I promised that I'd start a separate thread so ...... any thoughts on:

1) AMD vs Nvidia for compute devices. I've found this rather wonderful web-site on my travels http://www.clbenchmark.com/result.jsp . It seems like AMD's 7000 series have taken back the compute crown even though Nvidia's 600 series are winning all the frames-per-second gaming reviews.

2) It seems like many AMD cards report 32Kb of local memory whilst many Nvidia cards report 48Kb. Is this available to every processing element? e.g. The Nvidia GTX 550 Ti has 192 "CUDA cores", does that mean that every one of them has 48kb of local memory i.e. a total of 9Mb of fast, local memory split evenly across the device?

Many thanks in advance,
John

hi John,

things work different AMD versus Nvidia.

In AMD you have opencl and every SIMD must execute the same instructions at the same time (roughly).

In Nvidia every SIMD can execute a different instruction stream, so for SMP that's easier.

For Nvidia you have the Tesla series. Get a Fermi type tesla and you don't need much of a RAM of course.

Executing different instruction streams is a massive advantage for computerchess.

As for cache sizes, forget the caches, they work different from what you're used to.

L1 cache of Nvidia is 64 KB. You can choose how to split that up in instructioncache and datacache. This is not a datacache how you guess it works like.

AMD has separated caches there. It has a 8KB L1i and a local shared cache of 32KB.

8KB is a huge limitation for computerchess if i may say so.

If you don't want to get a tesla, get a Nvidia 580 to toy with.

Note there is a way to get things to work on AMD, but it's going to complicate things a lot for you, as you need several wavefronts after each other to get things done then. That's very complicated.

As for the GTX680, i'm not so sure you like that as a gpgpu platform. You need to research it for gpgpu.

For prime numbers is really slow that 680 but i didn't figure out details why and how. You want to do that.

For initial experiments i'd suggest pick up a cheap 2nd hand Fermi card from ebay.

Make sure it's a fermi with 32 cuda cores in each SIMD. Note all kind of names get mixed everywhere.

In OpenCL it's called compute unit. AMD has 64 cores in each compute unit.

The problem with gpgpu is your own time to figure out what speed every individual instruction is working on and how well the compiler is doing things for you.

The gpu's have in fact a lot of hardware instructions for conditional moves, replacing simple branches, but figuring out which branch the compiler understands to replace, that's gonna eat most of your time, as branches are too slow

diep · Post by **diep** » Sat Jul 14, 2012 5:53 pm

smatovic wrote:Looking for the Integer Throughput of different GPUs is a bit sophisticated...

Meanwhile i could imagine that Nvidias Fermi, (4xx, 5xx) have more Integer Throughput than AMDs 7xxx...

http://zeta-chess.blogspot.de/2012/07/i ... -gpus.html

http://zeta-chess.blogspot.de/2012/07/a ... gpgpu.html

--
Srdja

it is not clear to me why. maybe they just weakened the multiplication unit for 32 bits integers.

the 32x32 bits multiplication, not to mention double precision, is simply the thing that is part of weakest path in the gpu chip i would blindfolded guess.

So they can increase yields bigtime crippling the multiplication units.

However that's exactly the thing that you need for big crunching, so no wonder it's a lot slower there. What's unclear to me is the other architectural changes and how this impacts performance.

In computerchess multiplication is not so relevant. So researching the 680 better is worth it.

smatovic · Post by **smatovic** » Sun Jul 15, 2012 1:10 pm

What's unclear to me is the other architectural changes and how this impacts performance.

You mean the new SMX Units and branch divergence?

So researching the 680 better is worth it.

I would wait until the GK110 is out...should be Q4 2012,
AMD is going to release HD8000 Q4 2012 or Q1 2013,
with probably 25% more cores on die and higher clock rates,
so NV will be under pressure to deliver GK110.

--
Srdja

smatovic · Post by **smatovic** » Sun Jul 15, 2012 1:35 pm

it is not clear to me why. maybe they just weakened the multiplication unit for 32 bits integers.

the 32x32 bits multiplication, not to mention double precision, is simply the thing that is part of weakest path in the gpu chip i would blindfolded guess.

Take a look into the GK104 White-Paper page 9:
http://www.geforce.com/Active/en_US/en_ ... -FINAL.pdf

6 Cores share one SFU, i guess 32 bit multiplication is done by these Special Function Units.

GK110 White Paper, page 8:
http://www.nvidia.com/content/PDF/keple ... epaper.pdf

Shows that there is still one SFU for 6 cores, but two DP-Units.

--
Srdja

diep · Post by **diep** » Sun Jul 15, 2012 2:19 pm

smatovic wrote:
it is not clear to me why. maybe they just weakened the multiplication unit for 32 bits integers.

the 32x32 bits multiplication, not to mention double precision, is simply the thing that is part of weakest path in the gpu chip i would blindfolded guess.
Take a look into the GK104 White-Paper page 9:
http://www.geforce.com/Active/en_US/en_ ... -FINAL.pdf

6 Cores share one SFU, i guess 32 bit multiplication is done by these Special Function Units.

GK110 White Paper, page 8:
http://www.nvidia.com/content/PDF/keple ... epaper.pdf

Shows that there is still one SFU for 6 cores, but two DP-Units.

--
Srdja

Nvidia donated me a bunch of tesla 2075's so the DP i don't need a slow gamerscard for

for chess is not interesting anyway the dp.

for prime numbers the problem is : how to keep the gpu's fed. i have 16 amd cores (real cores, not minicores) to do that.

As for chess:

If a good SMP search works at the gpu's, it's easy to put with a riser card a bunch of gpu's at a mainboard. It's a bit of toying with wood then to have the cards above the motherboard with those riser cards, yet it means you can build a huge supercomputer at a single motherboard.

So the tesla's are handy then as they are a tad lower power than most of the gamerscards.

smatovic · Post by **smatovic** » Sun Jul 15, 2012 2:37 pm

As for chess:

If a good SMP search works at the gpu's, it's easy to put with a riser card a bunch of gpu's at a mainboard. It's a bit of toying with wood then to have the cards above the motherboard with those riser cards, yet it means you can build a huge supercomputer at a single motherboard.

Curious, do you have pictures of your setup?

I saw in the web some 10 GPU-Setup from Bitcoin-Miners, crazy....

--
Srdja

diep · Post by **diep** » Sun Jul 15, 2012 7:45 pm

smatovic wrote:
As for chess:

If a good SMP search works at the gpu's, it's easy to put with a riser card a bunch of gpu's at a mainboard. It's a bit of toying with wood then to have the cards above the motherboard with those riser cards, yet it means you can build a huge supercomputer at a single motherboard.
Curious, do you have pictures of your setup?

I saw in the web some 10 GPU-Setup from Bitcoin-Miners, crazy....

--
Srdja

I have some pictures on my facebook of older setup.

From what i read in wiki, bitcoin needs entire network, cryptographic no big deal. The problem is messaging the entire network, as to recognize a bitcoin entire network first needs to verify it. That's what you pay for.

Not surprisingly price of bitcoins also went down. Zimbabwanian dollar worth more now

Don't think these guys are interesting. They're using the 5970's.
AMD officially doesn't support OpenCL at both gpu's of the 5970's cards.

Not sure what status is now, some time ago i saw maximum was 12 cards.

Here i have problems feeding with 16 cpu's just 1 GPU.
GPU is too fast

Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia