Daniel Shawul wrote:diep wrote:Daniel Shawul wrote:bob wrote:Daniel Shawul wrote:bob wrote:Daniel Shawul wrote:
I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.
GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.
Vector machine like the Cray has nothing to do with GPUs. The Cray was a true parallel SMP box with up to 32 CPUs with shared memory and a high bandwidth low latency memory.
No time for bullshit. Nobody said they are exactly the same. They certainly share a lot them being vector processors. You have an algorithm that depends on a certain fearture of hardware just like GPUs (i.e high bandwindth) so it is very appropriate to discuss the bandwidth aspect which is what we did.
Let's simply not "go there". The Cray and the GPU boxes have next to nothing in common. The GPUs have "no bandwidth" when compared to the Cray machines. The GPUs can not do things like 32 128 bit reads per clock cycle an things like that. Chess on a GPU and chess on a Cray really have nothing in common. Unless you want to try to claim that chess on a GPU is equivalent to chess on an N-core PC. The Cray was not SIMD...
Again nobody said it is SIMD. If it was it means you did a sort of GPU chess in the 70s, which you didn't. So enough with the strawmans.
My point is your algorithm as you stated many times in your paper is meant for a special hardware Cray. It is pure hypocracy to say GPUs are specialized while having your algorithm specialized for something else itself. AFAIK people make modifications on DTS to make it suitable for current hardware...
The GPU serves thousands of threads compared to 4 or 8 threads that CPUs have but the maximum theoretical bandwidth for GPUs is hundreds of GB / s.
Daniel - if you google you'll find all the old cray manuals. It's a total other architecture than todays hardware.
At todays hardware it's all about the weak latencies to the RAM and to other cores.
When i designed in 2002/2003 Diep for the SGI supercomputer, the origin 3800 @ 512 processors, it was guessed by then, that i wouldn't succeed in scaling well as all other supercomputerchessprograms before Diep, they simply lost factor 40 to factor 50 at the NUMA machines. When i started to realize the ugly latency from CPU to CPU at the origin3800 @ 512 cpu's, as i had written a special testprogram for that; the supercomputer wasn't tested at all before by the Dutch Government - they hadn't invented yet at the government yet that one should first test something to be delivering what you paid those dozens of millions for, prior to using it.
Also it's possible i wrote the first supercomputer on the planet to test at a NUMA system the latency from this cpu to a remote cpu, with all cpu's at the same time doing that.
As i found out there was some one-way ping pong tests performed with entire box idle and routers in fact even optimized then for such test their caches causing a lower latency; total useless of course if you want to use all cores.
So i wrote first a bunch of tests, or better, the few tests i could do at the supercomputer i had to 'waste' to such tests.
The situation didn't get better past few years. CPU's got a lot faster, yet latency from this cpu to another node didn't really improve a lot.
To give an example calculation: each L5420 core of today i've got here at home in the cluster is exactly factor 10 faster than the R14000 cpu's at the origin3800.
Yet the blocked read latency to hashtable at the origin3800 was 5.8 microseconds. Real bad i found that back then.
That would be 580 nanoseconds now and i can assure you no cluster is gonna deliver it in 580 ns to you. It's gonna be several microseconds, and you bet i write another benchmark to measure that exactly
So i'll have exact numbers there soon.
But from your previous posts i'm sure you realize the latencies of todays hardware. Around 2005-2007 at nvidia, when we tested the first time the latency, the latency RAM <=> gpu wasn't real good either. Actually one would write it using the word 0.x milliseconds, not microseconds in fact. FPGA not doing that much better there by the way, Chrilly couldn't get under 10 microseconds there easily. At pci-e this would be slightly better, but even then it's roughly a microsecond to get through and back the pci-e.
0.3 - 0.6 microseconds are what you hear most of the manufacturers.
But on top of that comes the card and the switch and sometimes also other routers as well (level 2 and level 3). So i doubt it's much better than this 5.8 us still

Vincent, I only commented on what Bob stated about the hardware used in the DTS paper mentioned. Copying the search stack every time a processor asks for "HELP" to shared memory is a bandwidth consumer. The solution for that is a hardware one (the cray). Obviously it is not common nowadays. The rest is Bob putting words in my mouth (eg. crayblitz is SIMD) or strawman. It couldn't be SIMD or SIMT obviously since crafty is none of that but it has peculiarity with the huge bandwidth requirement of vector processors...
Look Daniel, copying the search stack is not needed in todays hardware.
In diep it's setting a few pointers - search stack is in shared memory.
So your criticism here is not correct that for DTS copying the stack would be a bandwidth consumer at todays PC hardware. The proof for this is simple - it's in Diep and it works in diep like this for a year or 14+ now and i've never been doing secret about this.
My intention is and always was to make a SMP algorithm for Diep that i can use for years to come. Now it might not be easy to add more cores to future CPU's because of the cache coherency and other reasons, mostly software scaling reasons, yet i don't want to make a SMP algorithm that's outdated in advance a few years from now.
My criticism against DTS i can give in examples as we can never rule out that there is alternative solutions of course to such problems.
My simple criticism is that there is basically 2 models to search efficiently with DTS. That's have all info at some sort of centralized manner, or broadcast everything. The broadcasting i don't see scale very well, as it's an O ( n ^ 2 ) type solution which because of it's O ( n ^ 2 ) therefore qualifies as a "Nimzo-Workstatt-Cheapskate-Big-NPS-Low-Performance type solution".
In case of some sort of centralization one can't avoid the problem that all cores will hammer themselves onto the centralized datastructure nonstop asking for a job.
This is a big bandwidth waste at modern hardware - yet a different type of bandwidth waste than you referred to.
Now for a 2 socket box @ 32 threads you can still limit this problem somehow, we c an't rule that out.
Yet the hammering of cores onto a centralized datastructure is of course an exponential bandwidth problem; furthermore the short term problem i foresee is the fact that one would need to design it without spinlocks as those will total kill you.
This where the algorithm is total based upon the idea of having every thread spin around finding a job for itself.
That doesn't scale.