Implementation of multithreaded search in Jazz

hgm · Post by **hgm** » Fri Apr 26, 2013 8:05 pm

The way I implemented it had the threads just walking the currently active tree as outlined in the hash table until they arrived in a node where they could grab a move of their own (i.e. a node marked as split point). That way they never had to pass each other moves or positions at all.

Daniel Shawul · Post by **Daniel Shawul** » Fri Apr 26, 2013 9:43 pm

hgm wrote:The way I implemented it had the threads just walking the currently active tree as outlined in the hash table until they arrived in a node where they could grab a move of their own (i.e. a node marked as split point). That way they never had to pass each other moves or positions at all.

By that do you mean moves are exclusively owned by one processor? This is one way to reduce redundancy that the ABDADA algorithm uses. I had a look into it now since i am interested in implementing some kind of TT driven parallelization. The algorithm tries to avoid redundancy using a two pass algorithm. An exclusivity flag is set and cleared in the TT, and also number of processors incremented/decremented when starting/finishing evaluation of a position. The three main steps are:

First move is searched by all processors (YBW)
A first pass over the rest of the moves where moves are assigned exclusively to one processor.
A second pass where a processor can search moves that are being searched.

It is a bit odd that the algorithm is presented as a distributed algorithm even though it relies on a shared TT. It does requests to it asynchronously though (i.e. while generating moves), so i guess it is similar to transposition driven scheduling in that aspect.

hgm · Post by **hgm** » Fri Apr 26, 2013 11:03 pm

I would have to check that, as this was done in 2010 and I only worked a few days on it. But it sounds about right. When the first N moves in a node were searched (and the thread had seen their score), and when it returned itself with a score for move M, it would then start probing the hash for move N+1, N+2, ..., M, ... to see if there were results, and start searching the first one that was not busy and did not yet contain a satisfactory result. The idea was that this would be very fast, as all these entries would be in the thread's cache (unless they had actually been changed). If probe N+1 returned a result, N would be incremented. (The result could give a beta cutoff, just like when the thread would have searched it by itself.) The 'second pass' would start when no 'idle' move was found. That would set a 'force' flag in the node, which would cause the thread to enter it even if it was 'busy', and start probing again at N+1.

I don't think threads needed to know what they 'owned'; first one to have seen all the daughter scores would write the hash entry with the result. If more would write it, they would write the same. I did not put any locks on declaring the node 'busy'. In the rare case that two threads would enter an 'idle' node at exactly the same time, then two threads would be cooperating on the same sub-tree although one of them could have been searching another sub-tree. That did not seem a disaster, perhaps only a small loss in efficiency because they split closer to the leaves, and thus more often.

Daniel Shawul · Post by **Daniel Shawul** » Sat Apr 27, 2013 3:53 am

Ok you have clearly thought about an advanced work sharing scheme despite my thought that you simply started a separate search and shared the TT. Many do that btw. I tried this one now by hacking my YBW code so that all threads are attached at the root, and then letting them pick root moves from their own move stack. The effect is the same as a fork() and shmget(). Anyway surprisingly I got a slowdown in search i.e given i didn't make a mistake. It is expected that most of the work will be redundant but don't know why the nps decreased. Later.

Code: Select all

D&#58;\Nebiyu\Release>nebiyu mt 2 go quit
processors &#91;1&#93;
+ Thread 1 started.
processors &#91;2&#93;
&#91;search_time = 5557ms, max_time = 29250ms , moves_left 10, max_nodes 0&#93;
2 -28 0 44  a2a3  g8f6  EBF=6.08
2 -18 0 68  a2a4  g8f6  EBF=7.70
2 0 0 120  b1c3  g8f6  EBF=10.42
3 34 1 263  b1c3  g8f6  g1f3  EBF=6.03
4 0 1 400  b1c3  g8f6  g1f3  b8c6  EBF=4.18
5 16 1 584  b1c3  g8f6  g1f3  b8c6  a2a4  EBF=3.33
6 0 1 1163  b1c3  g8f6  g1f3  b8c6  a2a4  a7a5  EBF=3.03
7 22 2 4727  b1c3  g8f6  g1f3  b8c6  a2a4  e7e5  a1a2  EBF=3.17
8 0 3 10402  b1c3  g8f6  g1f3  b8c6  d2d4  d7d5  c1e3  c8e6  EBF=3.02
9 16 5 29240  b1c3  g8f6  g1f3  b8c6  e2e4  e7e6  f1e2  f8b4  e2d3  EBF=3.00
10 4 8 79269  b1c3  g8f6  g1f3  b8c6  e2e4  e7e5  f1d3  f8d6  c3b5  d6c5  EBF=2.97
11 6 12 129298  b1c3  g8f6  g1f3  b8c6  e2e4  d7d5  f1d3  c8e6  e1e2  c6b4  d3b5  c7c6  EBF=2.80
11 14 22 286605  d2d4  g8f6  g1f3  b8c6  a2a4  a7a5  c1e3  f6d5  b1d2  d5e3  f2e3  EBF=3.02
12 12 33 458207  d2d4  g8f6  g1f3  b8c6  a2a4  h7h5  c1f4  f6d5  f4d2  a7a5  b1c3  c6b4  c3d5  b4d5  EBF=2.86
13 10 44 627515  d2d4  g8f6  g1f3  b8c6  a2a4  d7d5  c1e3  c8e6  b1d2  d8d6  a1a3  f6g4  e3g5  EBF=2.69
14 10 93 1512396  d2d4  g8f6  g1f3  b8c6  a2a4  d7d5  c1f4  a7a5  b1d2  c8f5  a1a3  a8a6  a3e3  d8d7  EBF=2.67
15 16 160 2646999  d2d4  g8f6  g1f3  d7d5  d1d3  b8c6  c1f4  a7a5  a2a3  e7e6  b1d2  f8d6  f4d6  d8d6  a1d1  EBF=2.59
16 12 266 4441048  d2d4  g8f6  g1f3  d7d5  b1c3  b8c6  c1f4  c8f5  a1c1  e7e6  f3h4  f6h5  h4f5  h5f4  d1d2  d8g5  EBF=2.52
nodes = 9130226 <68 qnodes> time = 5568ms nps = 1639767
splits = 0 badsplits = 0
move d2d4
Bye Bye

D&#58;\Nebiyu\Release>nebiyu go quit
processors &#91;1&#93;
&#91;search_time = 5557ms, max_time = 29250ms , moves_left 10, max_nodes 0&#93;
2 -28 0 46  a2a3  g8f6  EBF=6.23
2 -18 0 80  a2a4  b8c6  EBF=8.40
2 0 0 209  b1c3  g8f6  EBF=13.93
3 34 0 361  b1c3  g8f6  g1f3  EBF=6.75
4 0 1 564  b1c3  g8f6  g1f3  b8c6  EBF=4.58
5 16 1 1772  b1c3  g8f6  g1f3  b8c6  a2a4  EBF=4.23
6 0 1 2624  b1c3  g8f6  g1f3  b8c6  a2a4  d7d5  EBF=3.51
7 4 2 6531  b1c3  g8f6  g1f3  d7d5  f3d4  b8c6  c3b5  c6d4  b5d4  EBF=3.33
7 14 2 11316  d2d4  g8f6  c1e3  b8c6  g1f3  f6d5  b1d2  d5e3  f2e3  EBF=3.62
7 16 3 21279  e2e4  e7e5  a2a4  b8c6  b1c3  g8e7  g1e2  EBF=3.98
8 18 4 33255  e2e4  e7e5  b1c3  g8e7  g1f3  b8c6  f1d3  d7d5  EBF=3.52
9 16 5 46414  e2e4  e7e5  b1c3  b8c6  f1d3  g8f6  a2a4  c6b4  d3c4  EBF=3.16
10 14 7 67696  e2e4  e7e5  b1c3  b8c6  f1d3  g8f6  a2a4  c6b4  g1e2  b4d3  c2d3  EBF=2.92
11 14 11 139057  e2e4  e7e5  b1c3  b8c6  f1d3  g8f6  a2a4  c6b4  g1e2  b4d3  c2d3  EBF=2.82
12 16 22 319256  e2e4  e7e5  b1c3  b8c6  g1f3  g8f6  f1b5  f8b4  d1e2  b4d6  c3d5  f6d5  e4d5  EBF=2.77
13 16 35 564827  e2e4  e7e5  b1c3  b8c6  g1f3  g8f6  f1d3  f8d6  c3b5  d6c5  d1e2  a7a6  b5c3  EBF=2.67
14 16 53 886612  e2e4  e7e5  b1c3  b8c6  g1f3  g8f6  f1d3  f8d6  e1e2  a7a6  a2a4  c6b4  h1e1  b4d3  c2d3  EBF=2.57
14 18 72 1259050  g1f3  g8f6  d2d4  d7d5  b1c3  c8f5  c1f4  g7g6  e2e3  b8d7  f1d3  a7a5  d1e2  a8a7  EBF=2.64
15 22 133 2427997  g1f3  g8f6  d2d4  d7d5  b1c3  c8f5  c1f4  g7g6  e2e3  b8c6  f1d3  f6h5  f4g5  f7f6  g5h4  EBF=2.58
16 18 217 4034789  g1f3  g8f6  d2d4  d7d5  b1c3  c8f5  c1f4  g7g6  e2e3  b8c6  f1d3  f6h5  f4g5  f7f6  d3f5  g6f5  EBF=2.51
17 14 413 7846457  g1f3  g8f6  d2d4  d7d5  b1c3  c8f5  c1f4  b8c6  a1c1  f6h5  f4e3  d8d7  a2a4  a7a5  h2h4  a8a6  e1d2  EBF=2.47
nodes = 10564187 <69 qnodes> time = 5574ms nps = 1895261
splits = 0 badsplits = 0
move g1f3
Bye Bye

Daniel Shawul · Post by **Daniel Shawul** » Sat Apr 27, 2013 4:30 am

I forgot to update nps counts from the second thread. So now the nps has almost doubled as shown below. Note there are no splits / bad splits so it is working as intended. The output from the 2nd thread is suppressed but it can print pv too so everything is working as expected. Next I will check if this gives any speed up , i.e. without a work sharing scheme in TT, by running it on 30 test positions. The threads synchronize at the end of each iteration but there is really no need for it as they can do their own iterative deepening. But that requires me to hack more so i will leave that one for now, plus synchronization penalty should be minimal since threads will pass through searched moves quickly by looking at TT as you pointed out.

Code: Select all

D&#58;\Nebiyu\Release>nebiyu mt 2 go quit
processors &#91;1&#93;
+ Thread 1 started.
processors &#91;2&#93;
&#91;search_time = 5557ms, max_time = 29250ms , moves_left 10, max_nodes 0&#93;
2 -28 0 46  a2a3  g8f6  EBF=15.22
3 34 0 555  b1c3  g8f6  g1f3  EBF=7.85
4 0 1 688  b1c3  g8f6  g1f3  b8c6  EBF=6.20
5 16 1 2185  b1c3  g8f6  g1f3  b8c6  h2h4  EBF=4.42
6 0 1 2912  b1c3  g8f6  g1f3  b8c6  h2h4  a7a5  EBF=4.02
7 12 2 12684  b1c3  h7h5  b2b4  g8f6  g1f3  b8c6  a1b1  EBF=4.02
8 6 3 25330  b1c3  h7h5  b2b4  g8f6  a1b1  g7g6  g1f3  b8c6  EBF=3.40
8 26 4 41847  e2e4  a7a5  f1d3  g8f6  g1e2  h7h5  b1c3  b8c6  EBF=3.89
9 12 6 86063  e2e4  d7d5  e4d5  d8d5  b1c3  d5e5  g1e2  b8d7  h2h4  EBF=3.40
10 2 10 156033  e2e4  d7d5  b1c3  g8f6  e4d5  f6d5  f1b5  c8d7  g1e2  a7a5  EBF=3.18
10 4 13 210201  b1c3  b8c6  d2d4  d7d5  h2h4  g8f6  d1d3  c6b4  d3b5  b4c6  EBF=3.28
11 12 18 405640  b1c3  d7d5  d2d4  c8f5  g1f3  g8f6  c1e3  f6g4  d1d2  g4e3  f2e3  EBF=3.12
12 8 24 594808  b1c3  d7d5  d2d4  c8f5  g1f3  g8f6  c1f4  a7a5  c3b5  b8a6  d1d2  e8d7  d2a5  f5c2  EBF=2.92
12 10 35 783188  e2e4  d7d5  e4d5  d8d5  b1c3  d5e6  g1e2  g8f6  d2d4  b8d7  c1e3  a7a5  EBF=3.00
13 10 47 1295102  e2e4  b8c6  g1f3  g8f6  b1c3  d7d5  e4d5  f6d5  c3d5  d8d5  c2c4  d5e4  f1e2  EBF=2.86
14 6 131 3415778  e2e4  e7e5  g1f3  b8c6  f1d3  f8c5  b1c3  g8f6  e1e2  e8e7  h2h4  c6d4  f3d4  c5d4  EBF=2.84
14 16 148 3723370  b1c3  d7d5  d2d4  b8c6  g1f3  g8f6  c1f4  f6h5  f4e3  c8f5  d1d2  c6b4  a1c1  d8d6  EBF=2.86
15 18 193 5983855  b1c3  d7d5  d2d4  b8c6  g1f3  g8f6  c1f4  f6h5  f4e3  c8f5  d1d2  h5f6  a2a3  f6e4  c3e4  f5e4  EBF=2.75
16 16 315 9215509  b1c3  d7d5  d2d4  b8c6  c1f4  c8e6  g1f3  g8f6  h2h4  a7a5  a1c1  d8d7  d1d3  e8c8  f3e5  c6e5  f4e5  EBF=2.64
17 16 455 14500878  b1c3  d7d5  d2d4  b8c6  c1f4  c8e6  g1f3  g8f6  h2h4  a7a5  a1c1  d8d7  d1d3  e8c8  f3e5  c6e5  f4e5  EBF=2.56
nodes = 19474619 <68 qnodes> time = 5531ms nps = 3520994
splits = 0 badsplits = 0
move b1c3
Bye Bye

Daniel Shawul · Post by **Daniel Shawul** » Sat Apr 27, 2013 6:05 pm

I tested the speed up and there is none. It is all wasted search overhead, except for some positions where speedup looks good for other reasons.
Edit I turned off fail highs at the root which was causing problems before and now has all the 30 positions. The result is a slowdown on average.

Average overhead, speedup and nps scaling resp:

Code: Select all

Avg   2.134033652	0.949268578	1.738188119

Per position:

Code: Select all

Pos	Overhead	Speedu	NpsScaling
1	1.124	1.569	1.763
2	1.950	0.805	1.570
3	1.242	1.514	1.882
4	1.475	1.142	1.684
5	3.477	0.517	1.797
6	2.369	0.725	1.717
7	1.910	0.961	1.836
8	2.672	0.657	1.755
9	0.629	2.829	1.781
10	2.506	0.716	1.795
11	1.591	1.103	1.756
12	2.701	0.669	1.807
13	2.579	0.711	1.835
14	1.914	0.889	1.702
15	2.175	0.772	1.679
16	2.030	0.869	1.763
17	1.779	0.898	1.599
18	2.504	0.647	1.620
19	1.887	0.825	1.553
20	1.588	1.034	1.641
21	1.765	0.949	1.677
22	3.389	0.543	1.838
23	1.994	0.903	1.801
24	1.859	0.953	1.771
25	1.890	0.890	1.681
26	2.310	0.723	1.670
27	1.177	1.548	1.821
28	1.940	0.882	1.712
29	1.903	0.893	1.701
30	5.691	0.341	1.938

Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz