Nvidias K20 with Recursion

smatovic · Post by **smatovic** » Tue Dec 04, 2012 10:53 pm

Heyho,

Nvidia CUDA 5 has recursion support for the K20 GPU, so one Cuda Kernel is able to call another Kernel.

So what do you think,
is it possible to run YBWC efficient with more than 10000 threads with about 10 Knps per thread???

--
Srdja

diep · Post by **diep** » Wed Dec 05, 2012 3:09 pm

smatovic wrote:Heyho,

Nvidia CUDA 5 has recursion support for the K20 GPU, so one Cuda Kernel is able to call another Kernel.

So what do you think,
is it possible to run YBWC efficient with more than 10000 threads with about 10 Knps per thread???

--
Srdja

with todays search depths that's possible.

However you probably want a layered approach at nvidia.

With so many threads a layered approach you cannot avoid.

note i don't see why you would need that many threads, as the thing has 2496 cores.

The 1.15Ghz Tesla's i have here i had indeed also calculated to get to 10k nps a core by the way.

So that would result in 5000 threads for a single K20 gpu, or 992 at the C2075's i've got here.

In the layered approach you want to distinguish between the last few plies and the innernodes in normal search.

Majority of the search you c an search without referencing a global hashtable, that will speed up the nps considerable. You could use as a temporarily hash the 16 kilobyte of L1 or so.

Then you run a different code for generating the splitpoints. So you follow YBW, but basically you let always other SIMD's take over the last few plies of search.

When a SIMD is ready you migrate its 16KB L1 into the device RAM referencing to the root position you searched there. So that block of 16KB belongs then to a specific position.

This allows a single block copy.

The device RAM's hashtable therefore is total different from a normal hashtable.

You just want to profit the current/next iteration from the fact that you already searched it. This is a bunch of plies closer to root than the last few plies are.

Only for the positions a lot closer to the root, you can use the traditional concept.

Of course to get a better speedup you probably want a different layer of SMP between each SIMD, as you can get a pretty good speedup from it.

So within 1 gpu you have 2 layers of SMP and quite a lot of different sorts of codes.

By not doing global stores when doing the last few plies + qsearch, you can run 90% of all calculations within each compute unit.

Only at Nvidia you can easily do all this - as at AMD it's not psosible to run different codes at the same time (despite lots of promises).

So your YBW speedup from each SIMD should be great. Of course within each SIMD it's a lot tougher.

By doing this you already limit the YBW problem you describe significantly.

Note that Nvidia has no official PDF yet on their website describing the K20 exactly.

Yet if we divide 2496 by 13 we get 192 streamcores kind of.

So your YBW never deals with 10k threads ever, but it's basically 13 SIMD's.

Now between each SIMD you have your layer of SMP code ideally. And out of that 13 you should see a very good speedup.

Then within each SIMD you have up to 192 cores. That's quite a lot actually.
More than you would like to ideally put to work within a single SMP search.

So the 2nd layer of SMP search is within each SIMD with 192 cores and you bet that will be a lot less efficient than the speedup you have in the first layer.

Note you also want to search in a different manner. You'll reduce less in the search than the LMR at most modern programs, as this is very dependant upon hashtable.

Basically you want to reduce in the search between the SIMD's where you always have good hashtable information. It will miss nothing tactical this search. First 35 plies you see everything with no failure of supernullmove with superreductions to get on top of each other.

That'll give +200 elo that todays PC software doesn't have.

smatovic · Post by **smatovic** » Wed Dec 05, 2012 3:55 pm

note i don't see why you would need that many threads, as the thing has 2496 cores.

I thought Fermi needs 4 clock cycles for warp computation,
so you will need to run 4x as many threads as you have cores...not sure about Kepler.

So within 1 gpu you have 2 layers of SMP and quite a lot of different sorts of codes.

I understand,
YBW would not scale well with so many threads so you have to build layers,
and this kind of design would not be very portable between different GPU architectures.

--
Srdja

diep · Post by **diep** » Wed Dec 05, 2012 11:12 pm

smatovic wrote:
note i don't see why you would need that many threads, as the thing has 2496 cores.
I thought Fermi needs 4 clock cycles for warp computation,
so you will need to run 4x as many threads as you have cores...not sure about Kepler.

So within 1 gpu you have 2 layers of SMP and quite a lot of different sorts of codes.
I understand,
YBW would not scale well with so many threads so you have to build layers,
and this kind of design would not be very portable between different GPU architectures.

--
Srdja

2 threads per core is ok if you program well. Yet i'm the first to admit that things are not so transparant in gpgpu world... ...becasue of lack of information. AMD is the worst there but Nvidia also doesn't give out lots of technical data.

Indeed you need complex SMP code for gpu. In fact you need 4 layers to do it well.

Of course it is portable within Nvidia. There is only nvidia for computerchess. In theory things are possible at AMD as well - but forget it - nvidia is easier there thanks to the nvidia possibility running different instruction streams at the same time at different SIMD's. At AMD the entire gpu or gpu's have to execute the same instruction at the same time.

Fermi architecture would be probably easiest to start with.

If you get a cheap Fermi card with a core or 448 then you already have enough to toy with for a cheap price. For computerchess there is not a big difference between Tesla and the gamerscards. That difference is there especially for the double precision performance which you do not need.

Other manufacturers you could take a look at are even less attractive. For example Xeon Phi (larrabee) has vectors of 512 bits or so. That's not interesting to program chessprogram for. If you treat each 'integer' in such vector as a 'core' you go play chess with, then you will run into trouble at larrabee, as treating them independant has huge latency price (the instructions to do it are there) of 7+ cycles. So you directly lose the advantage there of doing things vectorized at Xeon Phi.

In short the only cheap attractive architecture to do gpgpu at is Nvidia.

AMD has the huge disadvantage that all the 2048 cores execute the same instruction at the same time and if you would run several different codes after each other the small L1i of 8 KB will simply kill you.

So from my viewpoint seen AMD's biggest disadvantage is the no hope of ever building a chessprogram with a decent evaluation function at it, provided you'll get it going at all.

Even then it would be possible to program for AMD as you can nowadays also program in assembler for AMD. GPU assembler yeah. You also have to write your own 'compiler' and binary producer (assembler + linker huh) if i understand well. Lots of extra work that nvidia doesn't have. For some major organisations, NCSA type, it's good that AMD provides this assembler manner. OpenCL is too limited simply as a programming language compared to what the gpu's have to offer not to mention about other problems OpenCL has. Yet for small guy like me it is not so interesting to program in it unpaid.

At the Nvidia K20 of course already problems are tougher than at Nvidia Fermi, as you move up to 192 cores within a single SIMD. Not to mention the 2048 you have to deal with at AMD.

So Nvidia is only architecture to take into account.

On the SMP layers. You need 4 layers actually, though you can get away with 3 putting 1 gpu a machine.

a) First you need SMP layer over cluster. So for MPI.
b) Then second layer is SMP between different gpu's that is through the RAM

Now you might want to kick out above 2 SMP's as those are the last thing to parallellize maybe for you, but i will get back at it soon why.

c) SMP layer between the SIMD's or in case of K20 they called SMX.
d) vectorized SMP within 1 SIMD

This is 4 layers of SMP.

Now compare it with current Diep. It has 2 layers:

a) MPI layer (still in progress)
b) SMP between cores - comparable with SMP between SIMD's for a gpu.

We already see how complex A and B are.

But now the main problem. Suppose you do supreme effort to parallellize things and get it done. A gpu is very powerful then.

How does it compare if i simply add a few nodes to my cluster?

So in short, a single GPU can of course never compete with an entire cluster. You want *at least* several gpu's.

Now some will say: "no problem".

But i disagree. See how bad the speedup of deep blue was, to give 1 example. They all historically always used pathetic SMP algorithms over clusters / supercomputers.

The first one to not be satisfied with a bad implementation there, that was me.

There is 2 ways to add gpu's. That's 1 gpu a machine, or several gpu's a machine.

The last is relative easy. The best solution is however 1 gpu a machine and just cluster a lot.

Latency between nodes is faster anyway than latency from the motherboard to the gpu.

In this manner you would get away with 3 layers of SMP.

So this would be heavenly for me to create such program, as i'm a big expert there, yet i wouldn't do it unpaid

I'm sure i speak for others there as well - though most will not public admit this.

It is total fulltime work to do it, let's face it.

A great interesting challenge - yet so fulltime work and only some experts will manage to get it going well - that you really need to pay a guy fulltime for it.

It is possible that with go you get away in a much easier way, as they are still living in the 1970s from search viewpoint seen.

Of course you want a similar system like chessprograms in go, just more selectivity - you search 30-40 ply handsdown in go then.

If you do that - of course you have the same problems in search like a chessprogram. Yet no one has a professional search so far in go. It's not very professional there what happens there.

We saw how some programmers who did do well in computer-go, how in the end they didn't get paid for their big efforts. So it's not attractive for westerners to try just that. It will need to be a Japanese or Chinese guy who is good in game tree search, who some years from now might build not only a good evaluation function, but also a good search.

He'll stumble upon the same problems then computerchess has right now in SMP search.

But for now it would be easy to parallellize the monte carlo type searches they do there in computer - go.

Yet for a professional search at a GPU - it's difficult to avoid programming 3 layers of SMP.

I hope i made myself clear there. It's pretty useless to build just 2 layers, as i already have Diep to work at a CPU and i already have it SMP to work.

So we shouldn't compare a single gpu with a single cpu. My cluster has a switch with 36 ports and i've got 16 network cards.

I've got another network wit h16 network cards and a switch of 16 ports. Bit older, but still ok latency.

Still just a few microseconds.

The latency from gpu's to the cpu is bestcase 250 microseconds or so.
Bandwidth a lot better of course.

So the forms of SMP you have to create also are total different forms of SMP search.

Very interesting challenge. If someone would pay me, i am sure i would show up with something great there.

Yet realize that a chessprogram at a gpu you can never sell.

It's too big as a hobbyist project. That was my conclusion in 2007 and talking to AMD and Nvidia, it appeared they weren't very interested back then in sponsoring gpu's for it.

That changed later on at Nvidia - yet i got the tesla's for parameter tuning, not to run diep at it

GPU's are very interesting things however. For crunching they are the cheapest solution and not by a little.

For computerchess - the real problem there is that sometimes some organisations pay the chessprogrammers - but they would never pay the programmers who would be best to get things to work on a gpu, to do gpu work with them.

As the project in itself is not a commercial project - no good programmer will be interested in carrying out such project without decent payment.

Only some students would - and odds they get something ok to work is rather little - it takes already a lot of time to learn how to low level program. gpgpu is yet another challenge there.

One of the biggest problems is getting good accurate technical information about the gpu's. You don't get that by just reading the PUBLIC documentation.

This is what i concluded after some months of research in 2007 and having a lot of extra information on gpu's now and having done more with gpu's, both AMD (opencl) as well as Nvidia (cuda), the conclusion still is the same.

it's fulltime work and this is a project only for genius guys, and 0 of the ones i know there would do it unpaid.

diep · Post by **diep** » Thu Dec 06, 2012 12:12 am

All that said in my previous posting - realize the huge potential of a gpgpu chessprogram. It's real difficult to build it. Yet it will win handsdown of course.

We know reality however. If someone manages to get paid, odds he is also an expert on SMP search is rather little. And not being an expert there - forget it - it will fail hopelessly.

There is a lot of pitfalls in such project that can lose you factor 10 here. Factor n there and so on, totally wasting any advantage gpu's offer.

That said - it's possible in Nvidia to split the L1 cache such that you can really throw in a lot of instructions in the L1 cache. More than any other manycore architecture.

It's possible to run different codes at different SIMD's, big advantage.

So the total nps you can reach is really huge. A couple of billions of nps is not a problem at all. And that in SOFTWARE.

So without the algorithmic disadvantage that hardware implementations so far had like Deep Blue and Hydra.

Both were pioneer projects - yet losing too much to the simple advantages that software had.

In my calculation and experiments, also from others who tried, it is possible to get things going at under factor 5 cost overhead compared to a CPU.

If we look at other codes, the real good gpgpu coders in fact lose only 50% or so after lots of toying.

If i factor in factor 5, that's *way* too much, yet let's calculate with it. I had done my calculations *before* Fermi was released - do not forget that.

So using a factor 4 nowadays is more realistic. Factor 4 overhead that gpu's have which you do not have at a CPU in terms of nps speed.

Let's take Fermi core now - as i don't have experience on K20 yet.

The 512 core gpu's are up for grabs on ebay - so pricewise there is 0 problems. Just of course a small power usage problem of a few kilowatts.

We take as example 512 cores at 1.5Ghz.

Instead of the cpu's 1000 clocks a node we use now 4000 clocks a node.

A single core then is delivering: 1.5Ghz / 4000 clocks = 1.5 mln / 4 = 375k nps a core.

Of course this is without storing things globally last few plies in hashtable, so you'll lose some to not doing that as well.

In fact in my calculation i factored in loss of factor 2.5 here, which is equivalent to using up 10k clocks a node.

Yet we were busy with raw NPS speed.

Now we take 375k * 512 cores = 192 million nps.

In my worst case calculation at the time, again that wasn't the superior fermi core back then, i got down to 44 million nps for the Tesla @ 448 cores and 1.15Ghz.

Now it's a matter of throwing in a few gpu's in a few machines. Normally spoken it's pretty interesting to bulid say a 16 node cluster.

16 * 192 mln nps = 3 billion nps raw speed.

Of course that's raw speed. if we go back then to possible losses those are there of course. You'll lose last plies around factor 3 effectively in efficiency that hashtables give to the CPU's which you do not have there.

Still that's an effective speed of 1 billion nps.

So the *only problem* is efficient SMP implementations as i already explained several times here in this forum.

It's all about how good the SMP implementation is, and there is about half a dozen experts on that on planet earth from which i'm one.

With that many cores, your gpgpu program could be a total beginner effectively, or it could win every single game in the hand of an expert.

That's entirely up to how well the SMP job has been carried out.

We see how clumsy it was done for Hydra and Deep Blue and all other hybrid SMP type hardware.

Yet realize this gpgpu cluster, *it could be very powerful*.

If we'd use the same amount of nodes that hydra used - 64 cards,
we speak of an amount of nps that's going to give an overflow at some calculators that just go till 32 bits

64 * 375 million nps = 24 billion nps.

It's no problem to scale the above thing up to 64 machines by the way - realize this is pretty peanuts for a good SMP search at a cluster.

Yet those 24 billion nps are worth nothing when you don't get a good SMP speedup out of it.

It's just a SMP search problem you know. Plenty of guys out there to get this massive nps.

Realize also the low hardware price of it.

Suppose you'd do a 36 node cluster. Sure you need a bit of power to run it.
Count at 500 watt a machine. So 18 kilowatt in total. That's for some companies, eastern europe, and some other areas on this planet peanuts by the way, if the thing is doing well.

The hardware cost is $200 for each node or so. $200 for each gpu off ebay and then a few cables and a switch 2nd hand off ebay also for peanuts.

It's total peanuts price a node.

Now hydra of course as far as i can see it also was a cheapskate project if we look at price a node.

Deep Blue on other hand was pretty expensive project - producing real cpu's.

Now if yo usay: "i want it at home but 1 kilowatt is the max i can draw here".

Then you can use a gpu or 4, with a tad less cores.

Or maybe just 2x GTX 590 cards.

You'll get then 1 billion nps at a single machine.

So all this is really powerful.

Bad shame it isn't gonna happen huh?

Todays society is only about good software you know.

In case you have such software - then gpu's kick butt.

smatovic · Post by **smatovic** » Fri Dec 07, 2012 4:04 pm

As you said, to build such an smp gpu solution would be a full time job....
so i will keep the best-first-search i already implemented

--
Srdja

diep · Post by **diep** » Fri Dec 07, 2012 4:19 pm

smatovic wrote:As you said, to build such an smp gpu solution would be a full time job....
so i will keep the best-first-search i already implemented

--
Srdja

if you aren't gonna try to optimize everything fulltime, building software like this for gpu's is pretty much nonsense.

p.s. i write that while busy with some gpgpu code - not for chess though

smatovic · Post by **smatovic** » Fri Dec 07, 2012 4:42 pm

p.s. i write that while busy with some gpgpu code - not for chess though Smile

...there can be only two topics you are working on,
BitCoin-Mining or Password-Cracking

Have fun,
Srdja

ps: the best-first-search is not thaaat bad...

diep · Post by **diep** » Fri Dec 07, 2012 5:40 pm

smatovic wrote:
p.s. i write that while busy with some gpgpu code - not for chess though Smile
...there can be only two topics you are working on,
BitCoin-Mining or Password-Cracking

Have fun,
Srdja

ps: the best-first-search is not thaaat bad...

Factorisation code actually.

bitcoins are a dead end and password-cracking is only for hackers - i'm not a hacker.

Nvidias K20 with Recursion

Nvidias K20 with Recursion

Re: Nvidias K20 with Recursion

Re: Nvidias K20 with Recursion

Re: Nvidias K20 with Recursion

Re: Nvidias K20 with Recursion

Re: Nvidias K20 with Recursion

Re: Nvidias K20 with Recursion

Re: Nvidias K20 with Recursion

Re: Nvidias K20 with Recursion