Trying to improve lazy smp

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Trying to improve lazy smp

Post by cdani »

New improved version:

NewDepth = Depth + (((thread_id + 1) & 1) ^ 1) + (thread_id > 2) + (thread_id > 4) + (thread_id > 6) + (thread_id > 7)

8 threads: +142 (from +134)

I updated the zip file only with the x64 popcnt version.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Trying to improve lazy smp

Post by Dann Corbit »

cdani wrote:New improved version:

NewDepth = Depth + (((thread_id + 1) & 1) ^ 1) + (thread_id > 2) + (thread_id > 4) + (thread_id > 6) + (thread_id > 7)

8 threads: +142 (from +134)

I updated the zip file only with the x64 popcnt version.
This is something completely new in search. I would like to see the theory that accounts for it.

In some way it makes sense, in that we are verifying/trusting forward moves. Something like guessing on the ponder node, assuming that the opponent will make the move that you guess.

But we are also putting less effort at the root.

So the big question is why does this change make for better chess play?
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Trying to improve lazy smp

Post by cdani »

Dann Corbit wrote: This is something completely new in search. I would like to see the theory that accounts for it.

In some way it makes sense, in that we are verifying/trusting forward moves. Something like guessing on the ponder node, assuming that the opponent will make the move that you guess.

But we are also putting less effort at the root.

So the big question is why does this change make for better chess play?
My motivation was that 8 threads at the same depth or depth+1 was doing bad, so I thought why not try more.

I can think for example that it helps when at depth 2 at thread 0 thinks that a move is good, but the thread 7 discovers that is bad move, so at next iteration the thread 0, that will work at depth 3, will benefit of this previous search and find immediately that this move is bad, a lot earlier that it would have been normally. This thread 7 sometimes can find very quick that the move is bad because it is following a long tactical line that was simply cut early in previous depths.

Following the example, all the other threads will also know this very soon, as the time that go from depth 2 to depth 3 at root, the time that sets the pace in all the threads, is very quick compared to the time that takes normally the depth of other threads, that remember that are working at higher depths.

Of course those other threads than the root one does not have enough time to finish their iterations most times, but statistically they are seeing the most probable moves with his limited time, the first ones in his search list, and this will save some time everywhere.

Also the known randomness of the MP search sure helps.

Just my guesses.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Trying to improve lazy smp

Post by cdani »

I guess that the optimal distribution of depths between threads will be something like more threads near the basic depth, and less far from this basic depth. Will be very interesting to see what is slowly being shaped.

Also I must try to restart immediately one thread that finishes quickly. Now it just waits to the next iteration.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Trying to improve lazy smp

Post by matthewlai »

cdani wrote: After those attempts, I will try to obtain access to a 12 or 16 core machine, to see how this must be modified to scale well at those machines. May be some of you have experience on an ISP that offers such services. I will pay just for a month, because I suppose it will not be any cheap, but I think it will be enough.
I would recommend Amazon EC2 for that.

With cc2.8xlarge for example, you can get an entire dual socket Intel Xeon E5-2670 machine (2x8 cores, 32 hyperthreads) for about $2/hour, rounded up to hour.

Or, if you don't mind your instance (VM) being shut down randomly (in my experience, that very rarely happens), you can get an instance from Amazon's spare capacity for about 1/10 to 1/5 the price.

However, that's now a NUMA system (since it's dual socket), so that makes things a little more complicated, if your engine doesn't already support NUMA.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Trying to improve lazy smp

Post by AlvaroBegue »

cdani wrote:New improved version:

NewDepth = Depth + (((thread_id + 1) & 1) ^ 1) + (thread_id > 2) + (thread_id > 4) + (thread_id > 6) + (thread_id > 7)
What's the point of this complicated formula? To begin with, isn't `(((thread_id + 1) & 1) ^ 1)' the same thing as `(thread_id & 1)'? Also, if you are testing with 8 threads `(thread_id > 7)' is always 0.

Finally, since the only feature of this mapping that can possibly matter is how many threads add how much depth, wouldn't it make more sense to describe this particular setting (for 8 threads) as {2, 2, 2, 1, 1} (2 threads add 0 plies, 2 threads add 1 ply, 2 threads add 2 ply, 1 thread adds 3 ply, 1 thread adds 4 ply)?
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Trying to improve lazy smp

Post by cdani »

matthewlai wrote:
cdani wrote: After those attempts, I will try to obtain access to a 12 or 16 core machine, to see how this must be modified to scale well at those machines. May be some of you have experience on an ISP that offers such services. I will pay just for a month, because I suppose it will not be any cheap, but I think it will be enough.
I would recommend Amazon EC2 for that.

With cc2.8xlarge for example, you can get an entire dual socket Intel Xeon E5-2670 machine (2x8 cores, 32 hyperthreads) for about $2/hour, rounded up to hour.

Or, if you don't mind your instance (VM) being shut down randomly (in my experience, that very rarely happens), you can get an instance from Amazon's spare capacity for about 1/10 to 1/5 the price.

However, that's now a NUMA system (since it's dual socket), so that makes things a little more complicated, if your engine doesn't already support NUMA.
Thanks!! I will review this. I never worked with NUMA, so I will start.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Trying to improve lazy smp

Post by cdani »

AlvaroBegue wrote:
cdani wrote:New improved version:

NewDepth = Depth + (((thread_id + 1) & 1) ^ 1) + (thread_id > 2) + (thread_id > 4) + (thread_id > 6) + (thread_id > 7)
What's the point of this complicated formula? To begin with, isn't `(((thread_id + 1) & 1) ^ 1)' the same thing as `(thread_id & 1)'? Also, if you are testing with 8 threads `(thread_id > 7)' is always 0.
You have reason. Initially I wanted to work with 1..8 instead of 0..7, and at some point I just did it bad. I will review this. Thanks!

AlvaroBegue wrote: Finally, since the only feature of this mapping that can possibly matter is how many threads add how much depth, wouldn't it make more sense to describe this particular setting (for 8 threads) as {2, 2, 2, 1, 1} (2 threads add 0 plies, 2 threads add 1 ply, 2 threads add 2 ply, 1 thread adds 3 ply, 1 thread adds 4 ply)?
Sure I plan to simplify this in a more readable way. Until now I was adding a change every time.

In my nexts tries, I will do it better :-)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Trying to improve lazy smp

Post by bob »

cdani wrote:
AlvaroBegue wrote:
cdani wrote:New improved version:

NewDepth = Depth + (((thread_id + 1) & 1) ^ 1) + (thread_id > 2) + (thread_id > 4) + (thread_id > 6) + (thread_id > 7)
What's the point of this complicated formula? To begin with, isn't `(((thread_id + 1) & 1) ^ 1)' the same thing as `(thread_id & 1)'? Also, if you are testing with 8 threads `(thread_id > 7)' is always 0.
You have reason. Initially I wanted to work with 1..8 instead of 0..7, and at some point I just did it bad. I will review this. Thanks!

AlvaroBegue wrote: Finally, since the only feature of this mapping that can possibly matter is how many threads add how much depth, wouldn't it make more sense to describe this particular setting (for 8 threads) as {2, 2, 2, 1, 1} (2 threads add 0 plies, 2 threads add 1 ply, 2 threads add 2 ply, 1 thread adds 3 ply, 1 thread adds 4 ply)?
Sure I plan to simplify this in a more readable way. Until now I was adding a change every time.

In my nexts tries, I will do it better :-)
Make a note: "quick and dirty" is often "quick and buggy". Done that too many times, haven't done it in quite a while however, as I simply always "measure twice, cut once" nowadays. My dad had it right. :)
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Trying to improve lazy smp

Post by cdani »

bob wrote: Make a note: "quick and dirty" is often "quick and buggy". Done that too many times, haven't done it in quite a while however, as I simply always "measure twice, cut once" nowadays. My dad had it right. :)
Yes :oops: Well. I will try to improve.

New best version at 4 threads: +122 elo (from +117)

Not published for the moment, I'm triying it at 8 threads.

Added to depths, in readable format, of the four threads: 0 (obviously), +1, +2, +3.

Yes, is the one I wanted to try first, but by prudence I arrived to it slowly.