TalkChess.com

Posted: **Sun Jun 10, 2012 4:52 pm**

Gerd, Sven, Edmund and Daniel have been kind enough to be giving me advice on SIMD friendly move generation over on this thread
http://www.talkchess.com/forum/viewtopic.php?t=43971

A good piece of advice from Daniel was that I should be:

...first picking up one of better gpus with the larger cache size. It is hard to fit everything on the L1.

I promised that I'd start a separate thread so ...... any thoughts on:

1) AMD vs Nvidia for compute devices. I've found this rather wonderful web-site on my travels http://www.clbenchmark.com/result.jsp . It seems like AMD's 7000 series have taken back the compute crown even though Nvidia's 600 series are winning all the frames-per-second gaming reviews.

2) It seems like many AMD cards report 32Kb of local memory whilst many Nvidia cards report 48Kb. Is this available to every processing element? e.g. The Nvidia GTX 550 Ti has 192 "CUDA cores", does that mean that every one of them has 48kb of local memory i.e. a total of 9Mb of fast, local memory split evenly across the device?

Many thanks in advance,
John

Posted: **Sun Jun 10, 2012 8:12 pm**

Okay, I think I'm getting a bit more of a handle on things.

Here is the architecture of AMD's latest GCN architecture. This is used in its 7xxx series of GPUs, a big departure from their previous designs, and much more suited to "general purpose" computing. Previously this had been almost exclusively Nvidia's domain:
http://www.anandtech.com/show/4455/amds ... -compute/4
Where GPUs are partitioned into Compute Units (CU's) of 64 ALUs which share:
256Kb register file => 4Kb each
16Kb "Data L1" => 0.25Kb each

With Nvidia's latest "Kepler" architecture - if I'm reading this correctly http://www.anandtech.com/show/5699/nvid ... 0-review/2 - the GPUs are partitioned into SMX units of 192 Cores which share:
256Kb register file => 1.3Kb each
64Kb "Shared Mem/L1 Cache" => 0.3Kb each

Posted: **Sun Jun 10, 2012 8:56 pm**

2) It seems like many AMD cards report 32Kb of local memory whilst many Nvidia cards report 48Kb. Is this available to every processing element? e.g. The Nvidia GTX 550 Ti has 192 "CUDA cores", does that mean that every one of them has 48kb of local memory i.e. a total of 9Mb of fast, local memory split evenly across the device?

No. Shared memory (or. local memory as AMD calls it) is allocated per multi-processor not for each cuda core. Each mp consists of a group of 8 cuda cores, and the latest one using fermi architecture have 32 cuda cores per mp. Infact the 48kb figure is for the fermi, so you get very little per core.

Posted: **Mon Jun 11, 2012 1:25 am**

Daniel Shawul wrote: No. Shared memory (or. local memory as AMD calls it) is allocated per multi-processor not for each cuda core. Each mp consists of a group of 8 cuda cores, and the latest one using fermi architecture have 32 cuda cores per mp. Infact the 48kb figure is for the fermi, so you get very little per core.

Cheers Daniel. I was being a little optimistic! It's interesting to note from http://www.anandtech.com/show/5699/nvid ... 0-review/2 that going from Fermi to Kepler Nvidia has halved the amount of memory available to each CUDA core. The greater number of cores on the die obviously helps the graphics performance, but sacrificing the memory sure isn't going to help general purpose computing

Posted: **Mon Jun 11, 2012 2:28 am**

johnhamlen wrote:
Daniel Shawul wrote: No. Shared memory (or. local memory as AMD calls it) is allocated per multi-processor not for each cuda core. Each mp consists of a group of 8 cuda cores, and the latest one using fermi architecture have 32 cuda cores per mp. Infact the 48kb figure is for the fermi, so you get very little per core.
Cheers Daniel. I was being a little optimistic! It's interesting to note from http://www.anandtech.com/show/5699/nvid ... 0-review/2 that going from Fermi to Kepler Nvidia has halved the amount of memory available to each CUDA core. The greater number of cores on the die obviously helps the graphics performance, but sacrificing the memory sure isn't going to help general purpose computing

Well I am not that familiar with the kepler as that is brand new. It seems there are far more number of cores per SM now after some innovations. For the fermi it was 32 now it is 192. Anyway you have to remeber that an SM runs a lot more threads than the number of cores. So for example the fermi can run upto 1024 threads in an SM (at 100% occupancy). If you divide 48kb/1024 you get very very small. When doing my MC simulaiton, I didn't even have space to store moves generated temporarily for move sorting and other stuff. So you can understand why I am strongly opposed to a piece list board representation ,that atleast in my case would have consumed more than 2kb/thread. Simply unacceptable.

Posted: **Mon Jun 11, 2012 3:06 pm**

Daniel Shawul wrote:Well I am not that familiar with the kepler as that is brand new. It seems there are far more number of cores per SM now after some innovations. For the fermi it was 32 now it is 192. Anyway you have to remeber that an SM runs a lot more threads than the number of cores. So for example the fermi can run upto 1024 threads in an SM (at 100% occupancy). If you divide 48kb/1024 you get very very small. When doing my MC simulaiton, I didn't even have space to store moves generated temporarily for move sorting and other stuff. So you can understand why I am strongly opposed to a piece list board representation ,that atleast in my case would have consumed more than 2kb/thread. Simply unacceptable.

Yes, it's an interesting move by Nvidia, it seemed to have the GPGPU/supercomputer accelerator market sewn up, but it's new generation of cards are slower then the last. Maybe this doesn't matter as most of it's revenue probably comes from the gaming market.
Yes, I wasn't thinking of multiple threads per ALU. Still got my single-threaded hat on! Though even with a single tread, the memory allocation isn't generous is it. What a shame

BTW - I found this very interesting link that goes into some of these issues. There are also some recent posts on GPU chess. It's not your blog is it?!
http://parallelis.com/k10-why-nvidia-had-to-do-it/

Posted: **Mon Jun 11, 2012 4:53 pm**

I am casually reading about the kepler architecture now. Despite the huge increase in number of cores per SMX, number of registers barely increased while the shared mem & cache are kept the same at 64kb now serving even more threads

Well it seems it is not going to help us chess programmers a lot, but high performance linear algebra computations would go much faster. With this trend , it seems doing other things i.e parallel move generation / evaluation etc, seem more promising to harness the full power of the device, than targeting to use each core to do its own search...

Posted: **Mon Jun 11, 2012 4:56 pm**

http://zeta-chess.blogspot.de/2012/03/n ... -7750.html

http://zeta-chess.blogspot.de/2012/03/a ... mance.html

--
Srdja

Posted: **Tue Jun 12, 2012 1:14 am**

Daniel Shawul wrote:I am casually reading about the kepler architecture now. Despite the huge increase in number of cores per SMX, number of registers barely increased while the shared mem & cache are kept the same at 64kb now serving even more threads Well it seems it is not going to help us chess programmers a lot, but high performance linear algebra computations would go much faster. With this trend , it seems doing other things i.e parallel move generation / evaluation etc, seem more promising to harness the full power of the device, than targeting to use each core to do its own search...

Yes, it all looks rather stark when laid out here: http://blog.cuvilib.com/2012/03/28/nvid ... hitecture/ 6x the number of cores per SM, but the same amount of L1 cache

. Kepler is definitely optimised for gaming, but sadly not computer chess gaming

Cheers
John

Posted: **Sat Jul 14, 2012 4:06 pm**

Looking for the Integer Throughput of different GPUs is a bit sophisticated...

Meanwhile i could imagine that Nvidias Fermi, (4xx, 5xx) have more Integer Throughput than AMDs 7xxx...

http://zeta-chess.blogspot.de/2012/07/i ... -gpus.html

http://zeta-chess.blogspot.de/2012/07/a ... gpgpu.html

--
Srdja

TalkChess.com

Choosing a GPU platform: AMD and Nvidia

Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia

Re: i choose AMD

Re: Choosing a GPU platform: AMD and Nvidia

Re: Choosing a GPU platform: AMD and Nvidia