Implementation of multithreaded search in Jazz
Moderators: hgm, Rebel, chrisw
-
- Posts: 27828
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Implementation of multithreaded search in Jazz
The way I implemented it had the threads just walking the currently active tree as outlined in the hash table until they arrived in a node where they could grab a move of their own (i.e. a node marked as split point). That way they never had to pass each other moves or positions at all.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Implementation of multithreaded search in Jazz
By that do you mean moves are exclusively owned by one processor? This is one way to reduce redundancy that the ABDADA algorithm uses. I had a look into it now since i am interested in implementing some kind of TT driven parallelization. The algorithm tries to avoid redundancy using a two pass algorithm. An exclusivity flag is set and cleared in the TT, and also number of processors incremented/decremented when starting/finishing evaluation of a position. The three main steps are:hgm wrote:The way I implemented it had the threads just walking the currently active tree as outlined in the hash table until they arrived in a node where they could grab a move of their own (i.e. a node marked as split point). That way they never had to pass each other moves or positions at all.
- First move is searched by all processors (YBW)
- A first pass over the rest of the moves where moves are assigned exclusively to one processor.
- A second pass where a processor can search moves that are being searched.
-
- Posts: 27828
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Implementation of multithreaded search in Jazz
I would have to check that, as this was done in 2010 and I only worked a few days on it. But it sounds about right. When the first N moves in a node were searched (and the thread had seen their score), and when it returned itself with a score for move M, it would then start probing the hash for move N+1, N+2, ..., M, ... to see if there were results, and start searching the first one that was not busy and did not yet contain a satisfactory result. The idea was that this would be very fast, as all these entries would be in the thread's cache (unless they had actually been changed). If probe N+1 returned a result, N would be incremented. (The result could give a beta cutoff, just like when the thread would have searched it by itself.) The 'second pass' would start when no 'idle' move was found. That would set a 'force' flag in the node, which would cause the thread to enter it even if it was 'busy', and start probing again at N+1.
I don't think threads needed to know what they 'owned'; first one to have seen all the daughter scores would write the hash entry with the result. If more would write it, they would write the same. I did not put any locks on declaring the node 'busy'. In the rare case that two threads would enter an 'idle' node at exactly the same time, then two threads would be cooperating on the same sub-tree although one of them could have been searching another sub-tree. That did not seem a disaster, perhaps only a small loss in efficiency because they split closer to the leaves, and thus more often.
I don't think threads needed to know what they 'owned'; first one to have seen all the daughter scores would write the hash entry with the result. If more would write it, they would write the same. I did not put any locks on declaring the node 'busy'. In the rare case that two threads would enter an 'idle' node at exactly the same time, then two threads would be cooperating on the same sub-tree although one of them could have been searching another sub-tree. That did not seem a disaster, perhaps only a small loss in efficiency because they split closer to the leaves, and thus more often.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Implementation of multithreaded search in Jazz
Ok you have clearly thought about an advanced work sharing scheme despite my thought that you simply started a separate search and shared the TT. Many do that btw. I tried this one now by hacking my YBW code so that all threads are attached at the root, and then letting them pick root moves from their own move stack. The effect is the same as a fork() and shmget(). Anyway surprisingly I got a slowdown in search i.e given i didn't make a mistake. It is expected that most of the work will be redundant but don't know why the nps decreased. Later.
Code: Select all
D:\Nebiyu\Release>nebiyu mt 2 go quit
processors [1]
+ Thread 1 started.
processors [2]
[search_time = 5557ms, max_time = 29250ms , moves_left 10, max_nodes 0]
2 -28 0 44 a2a3 g8f6 EBF=6.08
2 -18 0 68 a2a4 g8f6 EBF=7.70
2 0 0 120 b1c3 g8f6 EBF=10.42
3 34 1 263 b1c3 g8f6 g1f3 EBF=6.03
4 0 1 400 b1c3 g8f6 g1f3 b8c6 EBF=4.18
5 16 1 584 b1c3 g8f6 g1f3 b8c6 a2a4 EBF=3.33
6 0 1 1163 b1c3 g8f6 g1f3 b8c6 a2a4 a7a5 EBF=3.03
7 22 2 4727 b1c3 g8f6 g1f3 b8c6 a2a4 e7e5 a1a2 EBF=3.17
8 0 3 10402 b1c3 g8f6 g1f3 b8c6 d2d4 d7d5 c1e3 c8e6 EBF=3.02
9 16 5 29240 b1c3 g8f6 g1f3 b8c6 e2e4 e7e6 f1e2 f8b4 e2d3 EBF=3.00
10 4 8 79269 b1c3 g8f6 g1f3 b8c6 e2e4 e7e5 f1d3 f8d6 c3b5 d6c5 EBF=2.97
11 6 12 129298 b1c3 g8f6 g1f3 b8c6 e2e4 d7d5 f1d3 c8e6 e1e2 c6b4 d3b5 c7c6 EBF=2.80
11 14 22 286605 d2d4 g8f6 g1f3 b8c6 a2a4 a7a5 c1e3 f6d5 b1d2 d5e3 f2e3 EBF=3.02
12 12 33 458207 d2d4 g8f6 g1f3 b8c6 a2a4 h7h5 c1f4 f6d5 f4d2 a7a5 b1c3 c6b4 c3d5 b4d5 EBF=2.86
13 10 44 627515 d2d4 g8f6 g1f3 b8c6 a2a4 d7d5 c1e3 c8e6 b1d2 d8d6 a1a3 f6g4 e3g5 EBF=2.69
14 10 93 1512396 d2d4 g8f6 g1f3 b8c6 a2a4 d7d5 c1f4 a7a5 b1d2 c8f5 a1a3 a8a6 a3e3 d8d7 EBF=2.67
15 16 160 2646999 d2d4 g8f6 g1f3 d7d5 d1d3 b8c6 c1f4 a7a5 a2a3 e7e6 b1d2 f8d6 f4d6 d8d6 a1d1 EBF=2.59
16 12 266 4441048 d2d4 g8f6 g1f3 d7d5 b1c3 b8c6 c1f4 c8f5 a1c1 e7e6 f3h4 f6h5 h4f5 h5f4 d1d2 d8g5 EBF=2.52
nodes = 9130226 <68 qnodes> time = 5568ms nps = 1639767
splits = 0 badsplits = 0
move d2d4
Bye Bye
D:\Nebiyu\Release>nebiyu go quit
processors [1]
[search_time = 5557ms, max_time = 29250ms , moves_left 10, max_nodes 0]
2 -28 0 46 a2a3 g8f6 EBF=6.23
2 -18 0 80 a2a4 b8c6 EBF=8.40
2 0 0 209 b1c3 g8f6 EBF=13.93
3 34 0 361 b1c3 g8f6 g1f3 EBF=6.75
4 0 1 564 b1c3 g8f6 g1f3 b8c6 EBF=4.58
5 16 1 1772 b1c3 g8f6 g1f3 b8c6 a2a4 EBF=4.23
6 0 1 2624 b1c3 g8f6 g1f3 b8c6 a2a4 d7d5 EBF=3.51
7 4 2 6531 b1c3 g8f6 g1f3 d7d5 f3d4 b8c6 c3b5 c6d4 b5d4 EBF=3.33
7 14 2 11316 d2d4 g8f6 c1e3 b8c6 g1f3 f6d5 b1d2 d5e3 f2e3 EBF=3.62
7 16 3 21279 e2e4 e7e5 a2a4 b8c6 b1c3 g8e7 g1e2 EBF=3.98
8 18 4 33255 e2e4 e7e5 b1c3 g8e7 g1f3 b8c6 f1d3 d7d5 EBF=3.52
9 16 5 46414 e2e4 e7e5 b1c3 b8c6 f1d3 g8f6 a2a4 c6b4 d3c4 EBF=3.16
10 14 7 67696 e2e4 e7e5 b1c3 b8c6 f1d3 g8f6 a2a4 c6b4 g1e2 b4d3 c2d3 EBF=2.92
11 14 11 139057 e2e4 e7e5 b1c3 b8c6 f1d3 g8f6 a2a4 c6b4 g1e2 b4d3 c2d3 EBF=2.82
12 16 22 319256 e2e4 e7e5 b1c3 b8c6 g1f3 g8f6 f1b5 f8b4 d1e2 b4d6 c3d5 f6d5 e4d5 EBF=2.77
13 16 35 564827 e2e4 e7e5 b1c3 b8c6 g1f3 g8f6 f1d3 f8d6 c3b5 d6c5 d1e2 a7a6 b5c3 EBF=2.67
14 16 53 886612 e2e4 e7e5 b1c3 b8c6 g1f3 g8f6 f1d3 f8d6 e1e2 a7a6 a2a4 c6b4 h1e1 b4d3 c2d3 EBF=2.57
14 18 72 1259050 g1f3 g8f6 d2d4 d7d5 b1c3 c8f5 c1f4 g7g6 e2e3 b8d7 f1d3 a7a5 d1e2 a8a7 EBF=2.64
15 22 133 2427997 g1f3 g8f6 d2d4 d7d5 b1c3 c8f5 c1f4 g7g6 e2e3 b8c6 f1d3 f6h5 f4g5 f7f6 g5h4 EBF=2.58
16 18 217 4034789 g1f3 g8f6 d2d4 d7d5 b1c3 c8f5 c1f4 g7g6 e2e3 b8c6 f1d3 f6h5 f4g5 f7f6 d3f5 g6f5 EBF=2.51
17 14 413 7846457 g1f3 g8f6 d2d4 d7d5 b1c3 c8f5 c1f4 b8c6 a1c1 f6h5 f4e3 d8d7 a2a4 a7a5 h2h4 a8a6 e1d2 EBF=2.47
nodes = 10564187 <69 qnodes> time = 5574ms nps = 1895261
splits = 0 badsplits = 0
move g1f3
Bye Bye
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Implementation of multithreaded search in Jazz
I forgot to update nps counts from the second thread. So now the nps has almost doubled as shown below. Note there are no splits / bad splits so it is working as intended. The output from the 2nd thread is suppressed but it can print pv too so everything is working as expected. Next I will check if this gives any speed up , i.e. without a work sharing scheme in TT, by running it on 30 test positions. The threads synchronize at the end of each iteration but there is really no need for it as they can do their own iterative deepening. But that requires me to hack more so i will leave that one for now, plus synchronization penalty should be minimal since threads will pass through searched moves quickly by looking at TT as you pointed out.
Code: Select all
D:\Nebiyu\Release>nebiyu mt 2 go quit
processors [1]
+ Thread 1 started.
processors [2]
[search_time = 5557ms, max_time = 29250ms , moves_left 10, max_nodes 0]
2 -28 0 46 a2a3 g8f6 EBF=15.22
3 34 0 555 b1c3 g8f6 g1f3 EBF=7.85
4 0 1 688 b1c3 g8f6 g1f3 b8c6 EBF=6.20
5 16 1 2185 b1c3 g8f6 g1f3 b8c6 h2h4 EBF=4.42
6 0 1 2912 b1c3 g8f6 g1f3 b8c6 h2h4 a7a5 EBF=4.02
7 12 2 12684 b1c3 h7h5 b2b4 g8f6 g1f3 b8c6 a1b1 EBF=4.02
8 6 3 25330 b1c3 h7h5 b2b4 g8f6 a1b1 g7g6 g1f3 b8c6 EBF=3.40
8 26 4 41847 e2e4 a7a5 f1d3 g8f6 g1e2 h7h5 b1c3 b8c6 EBF=3.89
9 12 6 86063 e2e4 d7d5 e4d5 d8d5 b1c3 d5e5 g1e2 b8d7 h2h4 EBF=3.40
10 2 10 156033 e2e4 d7d5 b1c3 g8f6 e4d5 f6d5 f1b5 c8d7 g1e2 a7a5 EBF=3.18
10 4 13 210201 b1c3 b8c6 d2d4 d7d5 h2h4 g8f6 d1d3 c6b4 d3b5 b4c6 EBF=3.28
11 12 18 405640 b1c3 d7d5 d2d4 c8f5 g1f3 g8f6 c1e3 f6g4 d1d2 g4e3 f2e3 EBF=3.12
12 8 24 594808 b1c3 d7d5 d2d4 c8f5 g1f3 g8f6 c1f4 a7a5 c3b5 b8a6 d1d2 e8d7 d2a5 f5c2 EBF=2.92
12 10 35 783188 e2e4 d7d5 e4d5 d8d5 b1c3 d5e6 g1e2 g8f6 d2d4 b8d7 c1e3 a7a5 EBF=3.00
13 10 47 1295102 e2e4 b8c6 g1f3 g8f6 b1c3 d7d5 e4d5 f6d5 c3d5 d8d5 c2c4 d5e4 f1e2 EBF=2.86
14 6 131 3415778 e2e4 e7e5 g1f3 b8c6 f1d3 f8c5 b1c3 g8f6 e1e2 e8e7 h2h4 c6d4 f3d4 c5d4 EBF=2.84
14 16 148 3723370 b1c3 d7d5 d2d4 b8c6 g1f3 g8f6 c1f4 f6h5 f4e3 c8f5 d1d2 c6b4 a1c1 d8d6 EBF=2.86
15 18 193 5983855 b1c3 d7d5 d2d4 b8c6 g1f3 g8f6 c1f4 f6h5 f4e3 c8f5 d1d2 h5f6 a2a3 f6e4 c3e4 f5e4 EBF=2.75
16 16 315 9215509 b1c3 d7d5 d2d4 b8c6 c1f4 c8e6 g1f3 g8f6 h2h4 a7a5 a1c1 d8d7 d1d3 e8c8 f3e5 c6e5 f4e5 EBF=2.64
17 16 455 14500878 b1c3 d7d5 d2d4 b8c6 c1f4 c8e6 g1f3 g8f6 h2h4 a7a5 a1c1 d8d7 d1d3 e8c8 f3e5 c6e5 f4e5 EBF=2.56
nodes = 19474619 <68 qnodes> time = 5531ms nps = 3520994
splits = 0 badsplits = 0
move b1c3
Bye Bye
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Implementation of multithreaded search in Jazz
I tested the speed up and there is none. It is all wasted search overhead, except for some positions where speedup looks good for other reasons.
Edit I turned off fail highs at the root which was causing problems before and now has all the 30 positions. The result is a slowdown on average.
Average overhead, speedup and nps scaling resp:
Per position:
Edit I turned off fail highs at the root which was causing problems before and now has all the 30 positions. The result is a slowdown on average.
Average overhead, speedup and nps scaling resp:
Code: Select all
Avg 2.134033652 0.949268578 1.738188119
Code: Select all
Pos Overhead Speedu NpsScaling
1 1.124 1.569 1.763
2 1.950 0.805 1.570
3 1.242 1.514 1.882
4 1.475 1.142 1.684
5 3.477 0.517 1.797
6 2.369 0.725 1.717
7 1.910 0.961 1.836
8 2.672 0.657 1.755
9 0.629 2.829 1.781
10 2.506 0.716 1.795
11 1.591 1.103 1.756
12 2.701 0.669 1.807
13 2.579 0.711 1.835
14 1.914 0.889 1.702
15 2.175 0.772 1.679
16 2.030 0.869 1.763
17 1.779 0.898 1.599
18 2.504 0.647 1.620
19 1.887 0.825 1.553
20 1.588 1.034 1.641
21 1.765 0.949 1.677
22 3.389 0.543 1.838
23 1.994 0.903 1.801
24 1.859 0.953 1.771
25 1.890 0.890 1.681
26 2.310 0.723 1.670
27 1.177 1.548 1.821
28 1.940 0.882 1.712
29 1.903 0.893 1.701
30 5.691 0.341 1.938