Trying to improve lazy smp

cdani · Post by **cdani** » Sun Apr 12, 2015 9:12 pm

New improved version:

NewDepth = Depth + (((thread_id + 1) & 1) ^ 1) + (thread_id > 2) + (thread_id > 4) + (thread_id > 6) + (thread_id > 7)

8 threads: +142 (from +134)

I updated the zip file only with the x64 popcnt version.

Dann Corbit · Post by **Dann Corbit** » Mon Apr 13, 2015 1:42 am

cdani wrote:New improved version:

NewDepth = Depth + (((thread_id + 1) & 1) ^ 1) + (thread_id > 2) + (thread_id > 4) + (thread_id > 6) + (thread_id > 7)

8 threads: +142 (from +134)

I updated the zip file only with the x64 popcnt version.

This is something completely new in search. I would like to see the theory that accounts for it.

In some way it makes sense, in that we are verifying/trusting forward moves. Something like guessing on the ponder node, assuming that the opponent will make the move that you guess.

But we are also putting less effort at the root.

So the big question is why does this change make for better chess play?

cdani · Post by **cdani** » Mon Apr 13, 2015 7:57 am

Dann Corbit wrote: This is something completely new in search. I would like to see the theory that accounts for it.

In some way it makes sense, in that we are verifying/trusting forward moves. Something like guessing on the ponder node, assuming that the opponent will make the move that you guess.

But we are also putting less effort at the root.

So the big question is why does this change make for better chess play?

My motivation was that 8 threads at the same depth or depth+1 was doing bad, so I thought why not try more.

I can think for example that it helps when at depth 2 at thread 0 thinks that a move is good, but the thread 7 discovers that is bad move, so at next iteration the thread 0, that will work at depth 3, will benefit of this previous search and find immediately that this move is bad, a lot earlier that it would have been normally. This thread 7 sometimes can find very quick that the move is bad because it is following a long tactical line that was simply cut early in previous depths.

Following the example, all the other threads will also know this very soon, as the time that go from depth 2 to depth 3 at root, the time that sets the pace in all the threads, is very quick compared to the time that takes normally the depth of other threads, that remember that are working at higher depths.

Of course those other threads than the root one does not have enough time to finish their iterations most times, but statistically they are seeing the most probable moves with his limited time, the first ones in his search list, and this will save some time everywhere.

Also the known randomness of the MP search sure helps.

Just my guesses.

cdani · Post by **cdani** » Mon Apr 13, 2015 8:43 am

I guess that the optimal distribution of depths between threads will be something like more threads near the basic depth, and less far from this basic depth. Will be very interesting to see what is slowly being shaped.

Also I must try to restart immediately one thread that finishes quickly. Now it just waits to the next iteration.

matthewlai · Post by **matthewlai** » Mon Apr 13, 2015 10:44 am

cdani wrote: After those attempts, I will try to obtain access to a 12 or 16 core machine, to see how this must be modified to scale well at those machines. May be some of you have experience on an ISP that offers such services. I will pay just for a month, because I suppose it will not be any cheap, but I think it will be enough.

I would recommend Amazon EC2 for that.

With cc2.8xlarge for example, you can get an entire dual socket Intel Xeon E5-2670 machine (2x8 cores, 32 hyperthreads) for about $2/hour, rounded up to hour.

Or, if you don't mind your instance (VM) being shut down randomly (in my experience, that very rarely happens), you can get an instance from Amazon's spare capacity for about 1/10 to 1/5 the price.

However, that's now a NUMA system (since it's dual socket), so that makes things a little more complicated, if your engine doesn't already support NUMA.

AlvaroBegue · Post by **AlvaroBegue** » Mon Apr 13, 2015 11:01 am

cdani wrote:New improved version:

NewDepth = Depth + (((thread_id + 1) & 1) ^ 1) + (thread_id > 2) + (thread_id > 4) + (thread_id > 6) + (thread_id > 7)

What's the point of this complicated formula? To begin with, isn't `(((thread_id + 1) & 1) ^ 1)' the same thing as `(thread_id & 1)'? Also, if you are testing with 8 threads `(thread_id > 7)' is always 0.

Finally, since the only feature of this mapping that can possibly matter is how many threads add how much depth, wouldn't it make more sense to describe this particular setting (for 8 threads) as {2, 2, 2, 1, 1} (2 threads add 0 plies, 2 threads add 1 ply, 2 threads add 2 ply, 1 thread adds 3 ply, 1 thread adds 4 ply)?

cdani · Post by **cdani** » Mon Apr 13, 2015 3:07 pm

matthewlai wrote:
cdani wrote: After those attempts, I will try to obtain access to a 12 or 16 core machine, to see how this must be modified to scale well at those machines. May be some of you have experience on an ISP that offers such services. I will pay just for a month, because I suppose it will not be any cheap, but I think it will be enough.
I would recommend Amazon EC2 for that.

With cc2.8xlarge for example, you can get an entire dual socket Intel Xeon E5-2670 machine (2x8 cores, 32 hyperthreads) for about $2/hour, rounded up to hour.

Or, if you don't mind your instance (VM) being shut down randomly (in my experience, that very rarely happens), you can get an instance from Amazon's spare capacity for about 1/10 to 1/5 the price.

However, that's now a NUMA system (since it's dual socket), so that makes things a little more complicated, if your engine doesn't already support NUMA.

Thanks!! I will review this. I never worked with NUMA, so I will start.

cdani · Post by **cdani** » Mon Apr 13, 2015 3:13 pm

AlvaroBegue wrote:
cdani wrote:New improved version:

NewDepth = Depth + (((thread_id + 1) & 1) ^ 1) + (thread_id > 2) + (thread_id > 4) + (thread_id > 6) + (thread_id > 7)
What's the point of this complicated formula? To begin with, isn't `(((thread_id + 1) & 1) ^ 1)' the same thing as `(thread_id & 1)'? Also, if you are testing with 8 threads `(thread_id > 7)' is always 0.

You have reason. Initially I wanted to work with 1..8 instead of 0..7, and at some point I just did it bad. I will review this. Thanks!

AlvaroBegue wrote: Finally, since the only feature of this mapping that can possibly matter is how many threads add how much depth, wouldn't it make more sense to describe this particular setting (for 8 threads) as {2, 2, 2, 1, 1} (2 threads add 0 plies, 2 threads add 1 ply, 2 threads add 2 ply, 1 thread adds 3 ply, 1 thread adds 4 ply)?

Sure I plan to simplify this in a more readable way. Until now I was adding a change every time.

In my nexts tries, I will do it better

bob · Post by **bob** » Mon Apr 13, 2015 10:55 pm

cdani wrote:
AlvaroBegue wrote:
cdani wrote:New improved version:

NewDepth = Depth + (((thread_id + 1) & 1) ^ 1) + (thread_id > 2) + (thread_id > 4) + (thread_id > 6) + (thread_id > 7)
What's the point of this complicated formula? To begin with, isn't `(((thread_id + 1) & 1) ^ 1)' the same thing as `(thread_id & 1)'? Also, if you are testing with 8 threads `(thread_id > 7)' is always 0.
You have reason. Initially I wanted to work with 1..8 instead of 0..7, and at some point I just did it bad. I will review this. Thanks!

AlvaroBegue wrote: Finally, since the only feature of this mapping that can possibly matter is how many threads add how much depth, wouldn't it make more sense to describe this particular setting (for 8 threads) as {2, 2, 2, 1, 1} (2 threads add 0 plies, 2 threads add 1 ply, 2 threads add 2 ply, 1 thread adds 3 ply, 1 thread adds 4 ply)?
Sure I plan to simplify this in a more readable way. Until now I was adding a change every time.

In my nexts tries, I will do it better

Make a note: "quick and dirty" is often "quick and buggy". Done that too many times, haven't done it in quite a while however, as I simply always "measure twice, cut once" nowadays. My dad had it right.

cdani · Post by **cdani** » Tue Apr 14, 2015 12:17 am

bob wrote: Make a note: "quick and dirty" is often "quick and buggy". Done that too many times, haven't done it in quite a while however, as I simply always "measure twice, cut once" nowadays. My dad had it right.

Yes

Well. I will try to improve.

New best version at 4 threads: +122 elo (from +117)

Not published for the moment, I'm triying it at 8 threads.

Added to depths, in readable format, of the four threads: 0 (obviously), +1, +2, +3.

Yes, is the one I wanted to try first, but by prudence I arrived to it slowly.

Trying to improve lazy smp

Re: Trying to improve lazy smp

Re: Trying to improve lazy smp

Re: Trying to improve lazy smp

Re: Trying to improve lazy smp

Re: Trying to improve lazy smp

Re: Trying to improve lazy smp

Re: Trying to improve lazy smp

Re: Trying to improve lazy smp

Re: Trying to improve lazy smp

Re: Trying to improve lazy smp