YBWC: Active Reparenting

Daniel Shawul · Post by **Daniel Shawul** » Wed Apr 18, 2012 2:50 am

diep wrote:
Daniel Shawul wrote: It is pretty much confirmed now with the small +4 elo gain. That improvement is probably due to better selection of split points unlike conventional way of working threads picking up idle threads ( instead of the other way round).
Sorry i must laugh for this statement. Even if the elogain would be true at slower time controls, remember they just tested it at bullet, it would be explained in a total different manner.

However those game tree search theories one never posts online, because they'd win another 200 elo then by increased insight in game tree search.

Vincent

i didn't test it others did and said +4 elo. so i accept that it would be small since i predicted it will be that way anyway.
Amazing how people wait for the opportunity to give a lecture on the obvious

Daniel Shawul · Post by **Daniel Shawul** » Wed Apr 18, 2012 2:58 am

diep wrote:
bob wrote:
Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.
Vector machine like the Cray has nothing to do with GPUs. The Cray was a true parallel SMP box with up to 32 CPUs with shared memory and a high bandwidth low latency memory.
No time for bullshit. Nobody said they are exactly the same. They certainly share a lot them being vector processors. You have an algorithm that depends on a certain fearture of hardware just like GPUs (i.e high bandwindth) so it is very appropriate to discuss the bandwidth aspect which is what we did.
Let's simply not "go there". The Cray and the GPU boxes have next to nothing in common. The GPUs have "no bandwidth" when compared to the Cray machines. The GPUs can not do things like 32 128 bit reads per clock cycle an things like that. Chess on a GPU and chess on a Cray really have nothing in common. Unless you want to try to claim that chess on a GPU is equivalent to chess on an N-core PC. The Cray was not SIMD...
Bob don't take the bait - nowadays computerchess gets overcrowded by some who are busy with things that's far above their IQ level.

And one has to admit - see how far they got so far by tuning trivial small modifications and testing them well.

and you are back to your old self ...

Daniel Shawul · Post by **Daniel Shawul** » Wed Apr 18, 2012 3:05 am

diep wrote:
Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.
Vector machine like the Cray has nothing to do with GPUs. The Cray was a true parallel SMP box with up to 32 CPUs with shared memory and a high bandwidth low latency memory.
No time for bullshit. Nobody said they are exactly the same. They certainly share a lot them being vector processors. You have an algorithm that depends on a certain fearture of hardware just like GPUs (i.e high bandwindth) so it is very appropriate to discuss the bandwidth aspect which is what we did.
Let's simply not "go there". The Cray and the GPU boxes have next to nothing in common. The GPUs have "no bandwidth" when compared to the Cray machines. The GPUs can not do things like 32 128 bit reads per clock cycle an things like that. Chess on a GPU and chess on a Cray really have nothing in common. Unless you want to try to claim that chess on a GPU is equivalent to chess on an N-core PC. The Cray was not SIMD...
Again nobody said it is SIMD. If it was it means you did a sort of GPU chess in the 70s, which you didn't. So enough with the strawmans.
My point is your algorithm as you stated many times in your paper is meant for a special hardware Cray. It is pure hypocracy to say GPUs are specialized while having your algorithm specialized for something else itself. AFAIK people make modifications on DTS to make it suitable for current hardware...
The GPU serves thousands of threads compared to 4 or 8 threads that CPUs have but the maximum theoretical bandwidth for GPUs is hundreds of GB / s.
Daniel - if you google you'll find all the old cray manuals. It's a total other architecture than todays hardware.

At todays hardware it's all about the weak latencies to the RAM and to other cores.

When i designed in 2002/2003 Diep for the SGI supercomputer, the origin 3800 @ 512 processors, it was guessed by then, that i wouldn't succeed in scaling well as all other supercomputerchessprograms before Diep, they simply lost factor 40 to factor 50 at the NUMA machines. When i started to realize the ugly latency from CPU to CPU at the origin3800 @ 512 cpu's, as i had written a special testprogram for that; the supercomputer wasn't tested at all before by the Dutch Government - they hadn't invented yet at the government yet that one should first test something to be delivering what you paid those dozens of millions for, prior to using it.

Also it's possible i wrote the first supercomputer on the planet to test at a NUMA system the latency from this cpu to a remote cpu, with all cpu's at the same time doing that.

As i found out there was some one-way ping pong tests performed with entire box idle and routers in fact even optimized then for such test their caches causing a lower latency; total useless of course if you want to use all cores.

So i wrote first a bunch of tests, or better, the few tests i could do at the supercomputer i had to 'waste' to such tests.

The situation didn't get better past few years. CPU's got a lot faster, yet latency from this cpu to another node didn't really improve a lot.

To give an example calculation: each L5420 core of today i've got here at home in the cluster is exactly factor 10 faster than the R14000 cpu's at the origin3800.

Yet the blocked read latency to hashtable at the origin3800 was 5.8 microseconds. Real bad i found that back then.

That would be 580 nanoseconds now and i can assure you no cluster is gonna deliver it in 580 ns to you. It's gonna be several microseconds, and you bet i write another benchmark to measure that exactly

So i'll have exact numbers there soon.

But from your previous posts i'm sure you realize the latencies of todays hardware. Around 2005-2007 at nvidia, when we tested the first time the latency, the latency RAM <=> gpu wasn't real good either. Actually one would write it using the word 0.x milliseconds, not microseconds in fact. FPGA not doing that much better there by the way, Chrilly couldn't get under 10 microseconds there easily. At pci-e this would be slightly better, but even then it's roughly a microsecond to get through and back the pci-e.

0.3 - 0.6 microseconds are what you hear most of the manufacturers.

But on top of that comes the card and the switch and sometimes also other routers as well (level 2 and level 3). So i doubt it's much better than this 5.8 us still

Vincent, I only commented on what Bob stated about the hardware used in the DTS paper mentioned. Copying the search stack every time a processor asks for "HELP" to shared memory is a bandwidth consumer. The solution for that is a hardware one (the cray). Obviously it is not common nowadays. The rest is Bob putting words in my mouth (eg. crayblitz is SIMD) or strawman. It couldn't be SIMD or SIMT obviously since crafty is none of that but it has peculiarity with the huge bandwidth requirement of vector processors...

mcostalba · Post by **mcostalba** » Wed Apr 18, 2012 8:31 am

mcostalba wrote: I am running a test with real games because the speed-up in SMP case is difficult to measure reliably. After about 5K games on QUAD (4 threads) at 60''+0.1 we are at +4 ELO (no crashes), test is still running...

I have found a bug in implementation, in this case a silly logic bug, not strictly SMP related. So I pushed the fix to github and restarted the match.

diep · Post by **diep** » Wed Apr 18, 2012 1:24 pm

Daniel Shawul wrote:
diep wrote:
Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.
Vector machine like the Cray has nothing to do with GPUs. The Cray was a true parallel SMP box with up to 32 CPUs with shared memory and a high bandwidth low latency memory.
No time for bullshit. Nobody said they are exactly the same. They certainly share a lot them being vector processors. You have an algorithm that depends on a certain fearture of hardware just like GPUs (i.e high bandwindth) so it is very appropriate to discuss the bandwidth aspect which is what we did.
Let's simply not "go there". The Cray and the GPU boxes have next to nothing in common. The GPUs have "no bandwidth" when compared to the Cray machines. The GPUs can not do things like 32 128 bit reads per clock cycle an things like that. Chess on a GPU and chess on a Cray really have nothing in common. Unless you want to try to claim that chess on a GPU is equivalent to chess on an N-core PC. The Cray was not SIMD...
Again nobody said it is SIMD. If it was it means you did a sort of GPU chess in the 70s, which you didn't. So enough with the strawmans.
My point is your algorithm as you stated many times in your paper is meant for a special hardware Cray. It is pure hypocracy to say GPUs are specialized while having your algorithm specialized for something else itself. AFAIK people make modifications on DTS to make it suitable for current hardware...
The GPU serves thousands of threads compared to 4 or 8 threads that CPUs have but the maximum theoretical bandwidth for GPUs is hundreds of GB / s.
Daniel - if you google you'll find all the old cray manuals. It's a total other architecture than todays hardware.

At todays hardware it's all about the weak latencies to the RAM and to other cores.

When i designed in 2002/2003 Diep for the SGI supercomputer, the origin 3800 @ 512 processors, it was guessed by then, that i wouldn't succeed in scaling well as all other supercomputerchessprograms before Diep, they simply lost factor 40 to factor 50 at the NUMA machines. When i started to realize the ugly latency from CPU to CPU at the origin3800 @ 512 cpu's, as i had written a special testprogram for that; the supercomputer wasn't tested at all before by the Dutch Government - they hadn't invented yet at the government yet that one should first test something to be delivering what you paid those dozens of millions for, prior to using it.

Also it's possible i wrote the first supercomputer on the planet to test at a NUMA system the latency from this cpu to a remote cpu, with all cpu's at the same time doing that.

As i found out there was some one-way ping pong tests performed with entire box idle and routers in fact even optimized then for such test their caches causing a lower latency; total useless of course if you want to use all cores.

So i wrote first a bunch of tests, or better, the few tests i could do at the supercomputer i had to 'waste' to such tests.

The situation didn't get better past few years. CPU's got a lot faster, yet latency from this cpu to another node didn't really improve a lot.

To give an example calculation: each L5420 core of today i've got here at home in the cluster is exactly factor 10 faster than the R14000 cpu's at the origin3800.

Yet the blocked read latency to hashtable at the origin3800 was 5.8 microseconds. Real bad i found that back then.

That would be 580 nanoseconds now and i can assure you no cluster is gonna deliver it in 580 ns to you. It's gonna be several microseconds, and you bet i write another benchmark to measure that exactly

So i'll have exact numbers there soon.

But from your previous posts i'm sure you realize the latencies of todays hardware. Around 2005-2007 at nvidia, when we tested the first time the latency, the latency RAM <=> gpu wasn't real good either. Actually one would write it using the word 0.x milliseconds, not microseconds in fact. FPGA not doing that much better there by the way, Chrilly couldn't get under 10 microseconds there easily. At pci-e this would be slightly better, but even then it's roughly a microsecond to get through and back the pci-e.

0.3 - 0.6 microseconds are what you hear most of the manufacturers.

But on top of that comes the card and the switch and sometimes also other routers as well (level 2 and level 3). So i doubt it's much better than this 5.8 us still
Vincent, I only commented on what Bob stated about the hardware used in the DTS paper mentioned. Copying the search stack every time a processor asks for "HELP" to shared memory is a bandwidth consumer. The solution for that is a hardware one (the cray). Obviously it is not common nowadays. The rest is Bob putting words in my mouth (eg. crayblitz is SIMD) or strawman. It couldn't be SIMD or SIMT obviously since crafty is none of that but it has peculiarity with the huge bandwidth requirement of vector processors...

Look Daniel, copying the search stack is not needed in todays hardware.
In diep it's setting a few pointers - search stack is in shared memory.

So your criticism here is not correct that for DTS copying the stack would be a bandwidth consumer at todays PC hardware. The proof for this is simple - it's in Diep and it works in diep like this for a year or 14+ now and i've never been doing secret about this.

My intention is and always was to make a SMP algorithm for Diep that i can use for years to come. Now it might not be easy to add more cores to future CPU's because of the cache coherency and other reasons, mostly software scaling reasons, yet i don't want to make a SMP algorithm that's outdated in advance a few years from now.

My criticism against DTS i can give in examples as we can never rule out that there is alternative solutions of course to such problems.

My simple criticism is that there is basically 2 models to search efficiently with DTS. That's have all info at some sort of centralized manner, or broadcast everything. The broadcasting i don't see scale very well, as it's an O ( n ^ 2 ) type solution which because of it's O ( n ^ 2 ) therefore qualifies as a "Nimzo-Workstatt-Cheapskate-Big-NPS-Low-Performance type solution".

In case of some sort of centralization one can't avoid the problem that all cores will hammer themselves onto the centralized datastructure nonstop asking for a job.

This is a big bandwidth waste at modern hardware - yet a different type of bandwidth waste than you referred to.

Now for a 2 socket box @ 32 threads you can still limit this problem somehow, we c an't rule that out.

Yet the hammering of cores onto a centralized datastructure is of course an exponential bandwidth problem; furthermore the short term problem i foresee is the fact that one would need to design it without spinlocks as those will total kill you.

This where the algorithm is total based upon the idea of having every thread spin around finding a job for itself.

That doesn't scale.

Daniel Shawul · Post by **Daniel Shawul** » Wed Apr 18, 2012 3:01 pm

Look Daniel, copying the search stack is not needed in todays hardware.
In diep it's setting a few pointers - search stack is in shared memory.

So your criticism here is not correct that for DTS copying the stack would be a bandwidth consumer at todays PC hardware. The proof for this is simple - it's in Diep and it works in diep like this for a year or 14+ now and i've never been doing secret about this.

I understand what you are doing, but the paper said it copies the tree state to shared memory which is what my argument is based up on. Maybe that "tree state" is not that big since it is only needed to make split point selection among many (not start parallel search right away). Btw I did not criticize anything just merely recited what Bob warned in the paper about the copying done and the special hardware used.

My intention is and always was to make a SMP algorithm for Diep that i can use for years to come. Now it might not be easy to add more cores to future CPU's because of the cache coherency and other reasons, mostly software scaling reasons, yet i don't want to make a SMP algorithm that's outdated in advance a few years from now.

My criticism against DTS i can give in examples as we can never rule out that there is alternative solutions of course to such problems.

My simple criticism is that there is basically 2 models to search efficiently with DTS. That's have all info at some sort of centralized manner, or broadcast everything. The broadcasting i don't see scale very well, as it's an O ( n ^ 2 ) type solution which because of it's O ( n ^ 2 ) therefore qualifies as a "Nimzo-Workstatt-Cheapskate-Big-NPS-Low-Performance type solution".

The "HELP" request is actually broadcast , so whichever thread receives that copies its tree state to shared memory thereby building a list tree states for idle processors to examine. This obviously can not work in inherently decentralized architecture like clusters. We also don't need to examine all split points to get a good split point to start parallel search, so broadcasting to many processors can not scale well anyway IMO.

In case of some sort of centralization one can't avoid the problem that all cores will hammer themselves onto the centralized datastructure nonstop asking for a job.

This is a big bandwidth waste at modern hardware - yet a different type of bandwidth waste than you referred to.

Now for a 2 socket box @ 32 threads you can still limit this problem somehow, we c an't rule that out.

Yet the hammering of cores onto a centralized datastructure is of course an exponential bandwidth problem; furthermore the short term problem i foresee is the fact that one would need to design it without spinlocks as those will total kill you.

This where the algorithm is total based upon the idea of having every thread spin around finding a job for itself.

That doesn't scale.

I checked a little bit about the hardware used Cray YMP. It is an SMT machine which can also operate SIMD in lock step fashion (bob thought it didn't). That is how the peak performance was obtained. Also It has high throughput and low latency that resulted in expensive design. The GPUs are much cheaper because they hide latency through switching warps to get high throuput. The latest ones added cache hierarchy which made them more expensive and the distinction with CPUs becomes more blurred...

bob · Post by **bob** » Wed Apr 18, 2012 9:34 pm

Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Weren't your algorithm DTS originally meant for Cray, which is a vector processor ?? Just admit you are lazy or busy to do gpu computing.
Vector machine like the Cray has nothing to do with GPUs. The Cray was a true parallel SMP box with up to 32 CPUs with shared memory and a high bandwidth low latency memory.
No time for bullshit. Nobody said they are exactly the same. They certainly share a lot them being vector processors. You have an algorithm that depends on a certain fearture of hardware just like GPUs (i.e high bandwindth) so it is very appropriate to discuss the bandwidth aspect which is what we did.
Let's simply not "go there". The Cray and the GPU boxes have next to nothing in common. The GPUs have "no bandwidth" when compared to the Cray machines. The GPUs can not do things like 32 128 bit reads per clock cycle an things like that. Chess on a GPU and chess on a Cray really have nothing in common. Unless you want to try to claim that chess on a GPU is equivalent to chess on an N-core PC. The Cray was not SIMD...
Again nobody said it is SIMD. If it was it means you did a sort of GPU chess in the 70s, which you didn't. So enough with the strawmans.
My point is your algorithm as you stated many times in your paper is meant for a special hardware Cray. It is pure hypocracy to say GPUs are specialized while having your algorithm specialized for something else itself. AFAIK people make modifications on DTS to make it suitable for current hardware...
The GPU serves thousands of threads compared to 4 or 8 threads that CPUs have but the maximum theoretical bandwidth for GPUs is hundreds of GB / s.

My DTS search was NOT "optimized for the Cray". It was actually developed on a non-Cray shared memory system, and it has been run on a ton of different machines. It IS optimized for shared memory machines with symmetric hardware (no true master-slave cpus and such).

Max theoretical bandwidth is pointless when a PC has no hope of reaching those numbers. There's absolutely nothing in DTS that I would change to use it in Crafty. I don't use it because I don't want to get rid of the clean recursive search and go to an iterated search that is significantly messier to write, test, debug, and maintain. One can look at Cray Blitz's search to see what I mean. But it would work perfectly, and has been implemented on PCs more than once by others, with no problems of any kind other than the complexity.

bob · Post by **bob** » Wed Apr 18, 2012 9:38 pm

Daniel Shawul wrote:
bob wrote:
Daniel Shawul wrote:
Rein Halbersma wrote:
mcostalba wrote:In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.
Perhaps Don Dailey can comment on this further, but it seems like you have rediscovered "randomized work stealing"! The MIT Cilk project used such a provably optimal scheduler to implement CilkChess. More speculatively: I wouldn't be surprised if even the convoluted DTS-type scheduling of the YBW algorithm would be another manifestation of work stealing.
What he did is not randomized work stealing, there is nothing random there. He just checked existing split points to reattach itself.
This will most probably not decreases idle time because working threads pick up threads quickly anyway if you let them, but the selection made by the former i.e higher up
in the tree may help. I actually mentioned about the randomized work stealing that I use on cluster search in my first post.
But this might not be the smartest thing to do since we could have differnt network interfaces with varying latency. MPI allows you to group processors for most efficient work sharing.

Cilk begins with an overhead of creating and destryoing threads,as Vincent explained. I don't know if it keeps a thread pool
for that but it is still expensive anyway. For simple applications like parallel fibonacci serial versions outperform parallel versions due
to this overhead. OpenMP has added task based parallelism now but its performance is poor compared to Cilk. http://myxman.org/dp/node/182

Also from what I read the performance of Socrates chess is affected by this overhead too. The authors have blamed the poor performance
on the algorithm used (Jamboree search) but that seems to me very similar to YBW algorithm. http://publications.csail.mit.edu/lcs/p ... TM-548.pdf
A speedup of 48 out of 256 is what they get and I also read somewhere eles not many extensions/reductions were used.
So I have my doubts about cilk eventhough it makes life a lot easier.
Here's a key point I did not see mentioned. I have measured my "idle time" many times. It is extremely low. If that is true for most everyone, this is not going to help any, other than to save the overhead of a real split, as opposed to a "fake split" that only has to do one copy...
I mentioned this in the first line of the post you replied to. I quote again.
This will most probably not decreases idle time because working threads pick up threads quickly anyway if you let them, but the selection made by the former i.e higher up
in the tree may help.
It is pretty much confirmed now with the small +4 elo gain. That improvement is probably due to better selection of split points unlike conventional way of working threads picking up idle threads ( instead of the other way round).

1. 4 elo is tough to measure. Even 30,000 games gives an error bar of +/-4 so that's nowhere near enough. To get to +/- 1, you need something north of 150,000 games. Did you ACTUALLY measure accurately enough to say +4 elo?

2. I don't see any measurable idle time in my current search, and saw absolutely none in CB's search. That is not much difference, I don't see how it would add up to +4 elo...

Daniel Shawul · Post by **Daniel Shawul** » Thu Apr 19, 2012 1:18 am

bob wrote: My DTS search was NOT "optimized for the Cray". It was actually developed on a non-Cray shared memory system, and it has been run on a ton of different machines. It IS optimized for shared memory machines with symmetric hardware (no true master-slave cpus and such).

Ok if you say so. I read this in your paper and it is hard to interpret it any other way

To process the HELP command, a simple test was added to the sequential search code. After any node is processed, HELP(i) is checked, and if set, the tree state is copied to shared memory, the flag is cleared, and the search continues normally. This reduces the idle time for a processor to the time required for a processor to search one node and return to the top of Search() and then store the "tree state" for the idle processor to examine. (About 30 microseconds on a Cray C90.) After this time, an idle processor will have at least one tree state to examine for split points, making this reasonably efficient.

So you copy the tree state when ever an idle processor asks for it. How big is that tree state? The common YBW approach where working threads pick up idle ones and the actual copying is done when the idle processor is about to start the parallel search. Apparently you saw some benefit from gathering split points and picking up the best one. Marco did something similar only that he checks for the best one on the fly. Why didn't you do that? Does the working processor issue a different split point than it is currently working on when it gets a HELP from idle processors? I can only understand the copying if that is the case.

Max theoretical bandwidth is pointless when a PC has no hope of reaching those numbers. There's absolutely nothing in DTS that I would change to use it in Crafty. I don't use it because I don't want to get rid of the clean recursive search and go to an iterated search that is significantly messier to write, test, debug, and maintain. One can look at Cray Blitz's search to see what I mean. But it would work perfectly, and has been implemented on PCs more than once by others, with no problems of any kind other than the complexity.

I got curious and did some reading on cray blitz mainly from the talk by Harry Nielson. I see he is the Cray expert and did use vectorization to optimize your porgram and gave you an Attacks() which is much faster. Move generation is vectorized using SIMD on short vectors (similar to current SIMD with registers). Attacks() speeded up by 75% according to him. So you (atleast Harry) did do some SIMD on the cray. The NPS of that era were were shocking (2000, 10000 etc) so I don't know if findings then are directly applicable for current hardware. That was on Cray-1. Cray X-MP with 2/4 processors and then Cray Y-MP with 8 is where you developed DTS. If you wanted to, you could have written a purely SIMD chess on the Cray at that time, but since it is also an MIMD machine there is no need for that. NVIDIA gpus are SIMT (not SIMD) which is somewhere in between. They are more flexible than pure SIMD machines and that allows for applications other than what is commonly considered SIMD friendly.

Daniel Shawul · Post by **Daniel Shawul** » Thu Apr 19, 2012 1:30 am

1. 4 elo is tough to measure. Even 30,000 games gives an error bar of +/-4 so that's nowhere near enough. To get to +/- 1, you need something north of 150,000 games. Did you ACTUALLY measure accurately enough to say +4 elo?

I did not do the testing but only predicted it won't help much. Marco here did, and said that he got +4 elo from some thousand blitz games. So it does confirm my suspicion that it won't help much.

2. I don't see any measurable idle time in my current search, and saw absolutely none in CB's search. That is not much difference, I don't see how it would add up to +4 elo...

Sure mine has also very small idle time. That is why my nps is almost doubles on 2 processors, but has some losses on more than 8 processors. The OP thought he might gain some on that which is why I replied with my guess that it will not help much in that regard. My guess (as I stated before) is that selection of the better split point among those available may help even if there is 0 reduction of idle time. To avoid confusion , by better split point I mean one which has more work. If you can split near the root, you will have fewer splits which means you avoid the copying of search stack every time we split. Similar to the reason why we limit parallel search above a certain depth.

YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting