hi John,johnhamlen wrote:Gerd, Sven, Edmund and Daniel have been kind enough to be giving me advice on SIMD friendly move generation over on this thread
http://www.talkchess.com/forum/viewtopic.php?t=43971
A good piece of advice from Daniel was that I should be:I promised that I'd start a separate thread so ...... any thoughts on:...first picking up one of better gpus with the larger cache size. It is hard to fit everything on the L1.
1) AMD vs Nvidia for compute devices. I've found this rather wonderful web-site on my travels http://www.clbenchmark.com/result.jsp . It seems like AMD's 7000 series have taken back the compute crown even though Nvidia's 600 series are winning all the frames-per-second gaming reviews.
2) It seems like many AMD cards report 32Kb of local memory whilst many Nvidia cards report 48Kb. Is this available to every processing element? e.g. The Nvidia GTX 550 Ti has 192 "CUDA cores", does that mean that every one of them has 48kb of local memory i.e. a total of 9Mb of fast, local memory split evenly across the device?
Many thanks in advance,
John
things work different AMD versus Nvidia.
In AMD you have opencl and every SIMD must execute the same instructions at the same time (roughly).
In Nvidia every SIMD can execute a different instruction stream, so for SMP that's easier.
For Nvidia you have the Tesla series. Get a Fermi type tesla and you don't need much of a RAM of course.
Executing different instruction streams is a massive advantage for computerchess.
As for cache sizes, forget the caches, they work different from what you're used to.
L1 cache of Nvidia is 64 KB. You can choose how to split that up in instructioncache and datacache. This is not a datacache how you guess it works like.
AMD has separated caches there. It has a 8KB L1i and a local shared cache of 32KB.
8KB is a huge limitation for computerchess if i may say so.
If you don't want to get a tesla, get a Nvidia 580 to toy with.
Note there is a way to get things to work on AMD, but it's going to complicate things a lot for you, as you need several wavefronts after each other to get things done then. That's very complicated.
As for the GTX680, i'm not so sure you like that as a gpgpu platform. You need to research it for gpgpu.
For prime numbers is really slow that 680 but i didn't figure out details why and how. You want to do that.
For initial experiments i'd suggest pick up a cheap 2nd hand Fermi card from ebay.
Make sure it's a fermi with 32 cuda cores in each SIMD. Note all kind of names get mixed everywhere.
In OpenCL it's called compute unit. AMD has 64 cores in each compute unit.
The problem with gpgpu is your own time to figure out what speed every individual instruction is working on and how well the compiler is doing things for you.
The gpu's have in fact a lot of hardware instructions for conditional moves, replacing simple branches, but figuring out which branch the compiler understands to replace, that's gonna eat most of your time, as branches are too slow