Hi,
has somebody experience what happens when in SMP (all?) threads analyse the same position?
I've read that an engine uses this with success. Because other processes fill evaluated positions in the same hashtable this is supposed to speedup the (main) process.
Has someone tested this? How is avoided that different processes do redundant work and analyse the same position at the same time before it is stored in hash?
What happens when hashtable gets full? Slowdown or not?
How does it work with many (e.g.16) cpus? Is it's scalability better than normal SMP?
SMP: on same branch instead splitting?
Moderators: hgm, Rebel, chrisw
-
- Posts: 8
- Joined: Fri Jun 28, 2013 9:03 am
- Location: Germany
-
- Posts: 8
- Joined: Fri Jun 28, 2013 9:03 am
- Location: Germany
Re: SMP: on same branch instead splitting?
In original was written:
"So, it seems instead of parking threads when they have no work to do, it's better to spin them on the same search tree, since the hashtable was already mostly filled out."
This may improve SMP performance.
Perhaps this can be more generalized, in some kind of way i don't know. Maybe s.b. finds a clever trick to do SMP other than classical splitting. Maybe searching some knodes again is faster than splitting overhead.
Komodo 8 has a very effective SMP. Everyone is asking himself how they do it.
Frank
"So, it seems instead of parking threads when they have no work to do, it's better to spin them on the same search tree, since the hashtable was already mostly filled out."
This may improve SMP performance.
Perhaps this can be more generalized, in some kind of way i don't know. Maybe s.b. finds a clever trick to do SMP other than classical splitting. Maybe searching some knodes again is faster than splitting overhead.
Komodo 8 has a very effective SMP. Everyone is asking himself how they do it.
Frank
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: SMP: on same branch instead splitting?
Speedup is lousy for more than 2 processors. In fact, it is lousy for 2 as well... there are lots of old threads on this topic...Highluder wrote:Hi,
has somebody experience what happens when in SMP (all?) threads analyse the same position?
I've read that an engine uses this with success. Because other processes fill evaluated positions in the same hashtable this is supposed to speedup the (main) process.
Has someone tested this? How is avoided that different processes do redundant work and analyse the same position at the same time before it is stored in hash?
What happens when hashtable gets full? Slowdown or not?
How does it work with many (e.g.16) cpus? Is it's scalability better than normal SMP?
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: SMP: on same branch instead splitting?
Hard to say, I'm using lazy smp as described by Dan Homan, i.e. each other thread starts crunching on depth+1.
Whenever one of the helpers (or slaves if you wish) finishes the iteration, all others are aborted immediately.
The good thing is there is zero synchronization/copying overhead except at the start of each iteration, no need to specify minimum split depth and most importantly the implementation is trivial compared to YBW.
I don't have enough data on this but judging from CCRL I get about 100 elo for 4 cores vs 1.
When I look at YBW engines they get about the same.
Also from what I've seen in TCEC it had no problems competing with state of the art smp implementations (could have been luck of course).
I also suspect that (some?) YBW engines don't scale at all above 8 cores but I have no data about how lazy smp (if at all) scales above 4 cores.
I certainly don't plan to switch to anything else.
Whenever one of the helpers (or slaves if you wish) finishes the iteration, all others are aborted immediately.
The good thing is there is zero synchronization/copying overhead except at the start of each iteration, no need to specify minimum split depth and most importantly the implementation is trivial compared to YBW.
I don't have enough data on this but judging from CCRL I get about 100 elo for 4 cores vs 1.
When I look at YBW engines they get about the same.
Also from what I've seen in TCEC it had no problems competing with state of the art smp implementations (could have been luck of course).
I also suspect that (some?) YBW engines don't scale at all above 8 cores but I have no data about how lazy smp (if at all) scales above 4 cores.
I certainly don't plan to switch to anything else.