ABDADA speedup results

Daniel Shawul · Post by **Daniel Shawul** » Wed May 01, 2013 5:54 pm

I implemented the transposition table based parallelization scheme ABDADA. I got a speedup of 1.64X from two processors which is encouraging. There are a few positions where the speedup was magnified because the parallel search searched few nodes but even taking those out while keeping positions it performed badly (less than 1x speedup) gives a minimum of 1.4x.
Average result:

Code: Select all

1.469	1.639	1.890

The first is the search overhead, followed by speed up and nps scaling. For comparison here are the numbers when just running two threads without coordinating parallel search using TT like ABDADA does.

Code: Select all

1.963	1.189	1.914

The speedup from a YBW splitting algorithm is 2 from two processors and infact I get super-linear speed up. In any case ABDADA seems to be a good way to get parallel search working quickly.
The modifications I needed to make was
a) Surprizingly I didn't need to make modifications of my TT since I had a flag that called 'CRAP' besides LOWER,UPPER,EXACT which I used to indicate that another processor is working on the position.
b) With incremental move generator i had a bit of a problem with doing a two pass search but i am sure for most this will be very easy.

------------
Full result:

Code: Select all

Pos	Overhead	Time	Nps
1	1.206	1.293	1.560
2	1.331	1.463	1.948
3	1.363	1.428	1.949
4	0.581	3.492	2.027
5	0.528	3.433	1.812
6	1.263	1.519	1.919
7	1.159	1.649	1.911
8	1.057	1.809	1.912
9	1.913	0.804	1.539
10	1.414	1.247	1.764
11	1.364	1.406	1.918
12	3.225	0.615	1.982
13	1.305	1.511	1.973
14	5.601	0.376	2.103
15	0.868	2.189	1.901
16	2.337	0.759	1.773
17	1.451	1.288	1.871
18	1.880	0.982	1.843
19	1.500	1.319	1.974
20	1.544	1.284	1.983
21	1.306	1.615	2.110
22	1.101	1.694	1.865
23	0.746	2.715	2.027
24	1.176	1.712	2.011
25	1.160	1.520	1.763
26	1.818	1.010	1.837
27	0.568	3.227	1.832
28	0.625	2.991	1.869
29	1.521	1.266	1.926
30	1.159	1.572	1.822

Michel · Post by **Michel** » Wed May 01, 2013 7:20 pm

Thanks for sharing this.

I am trying to understand the ABDADA algorithm. Do I understand correctly that processors searching a node do not communicate a beta cutoff to each other?

Neither do they seem to communicate the value of alpha.

This would seem quite strange.

Daniel Shawul · Post by **Daniel Shawul** » Wed May 01, 2013 7:54 pm

Michel wrote:Thanks for sharing this.

I am trying to understand the ABDADA algorithm. Do I understand correctly that processors searching a node do not communicate a beta cutoff to each other?

Neither do they seem to communicate the value of alpha.

This would seem quite strange.

Yes you are right but i don't think that hurts performance a lot. All the moves start working on the first move right away so they will have the first's moves score for a pretty good bound, and also most fail highs occur there. With negascout YBW , I only take care of fail highs since there are no alpha updates. I am trying to find out which algorithm is the best for the amount of effort put on to it, and so far ABDADA seems to be the best. The implementation may be much easier with a separate process and IPC shared memory but i used threads since I had a YBW framework. The other alternatives are root splitting, un-synchronized iterative deepening, shared TT and ABDADA.

Michel · Post by **Michel** » Wed May 01, 2013 7:55 pm

It seems to me that if the threads shared the tt record (through a pointer) rather than retrieving and storing it, they could update and consult the best value in real time (all properly synchronized of course). Wouldn't that be better?

Daniel Shawul · Post by **Daniel Shawul** » Wed May 01, 2013 8:15 pm

Michel wrote:It seems to me that if the threads shared the tt record (through a pointer) rather than retrieving and storing it, they could update and consult the best value in real time (all properly synchronized of course). Wouldn't that be better?

I guess that would be a TT entry serving as a split point. Then there would come locking and other complicated implementation which is probably not worth it. The other nice thing about abdada is it is fault tolerant. The N-1 processors could fail and play would still go on. If you have to do it in Gnuchess, try the separate process approach as that is simple and will serve as an example implementation. I am not sure what toga does but it uses threads.

Michel · Post by **Michel** » Wed May 01, 2013 8:21 pm

I guess that would be a TT entry serving as a split point.

Probably. But it would still be vastly simpler than the usual YBW algorithm since no copying of state seems to be involved and no separate split point stack has to be maintained.

You are guessing right. I am looking for the cheapest possible way to implement MP in GNU Chess. The time I can spend on it is unfortunately infinitesimal.

EDIT. Perhaps using the TT entry itself as a split point would create too much cache pressure since a thread has no a priori way of knowing that there are no other threads searching the current node. So it would have to consult the tt between every make/unmake which is probably expensive.

jdart · Post by **jdart** » Thu May 02, 2013 2:52 am

I tried this ages ago, and the problem I had was that the hash table filled up too quickly with "CRAP" as you rightly put it, and that impacted overall performance. With much larger hash sizes possible today maybe that is less of an issue. Still, there is also the point that the hashtable is the main shared data structure across threads and adding more memory traffic to that is going to hurt multi-threaded performance, especially as >2 threads are involved, because x86 processors will serialize access if cores hit the same cache lines.

--Jon

Daniel Shawul · Post by **Daniel Shawul** » Thu May 02, 2013 1:20 pm

jdart wrote:I tried this ages ago, and the problem I had was that the hash table filled up too quickly with "CRAP" as you rightly put it, and that impacted overall performance. With much larger hash sizes possible today maybe that is less of an issue. Still, there is also the point that the hashtable is the main shared data structure across threads and adding more memory traffic to that is going to hurt multi-threaded performance, especially as >2 threads are involved, because x86 processors will serialize access if cores hit the same cache lines.

--Jon

Yes it does a maximum of twice hash probes a normal search does. Luckily on SMP systems most of it is probably cached. In any case after a thread puts in CRAP in TT entry i.e. to prevent other threads from searching same move ,it will replace it with proper results after it is done. So there is not much hurt done. Also I use a depth>=4 for ABDADA as well thinking wrongly that it might be costly to do it at the leaves like YBW splitting is. This is unnecessary but i see that it can help with filling up the TT entries. The number of candidate nodes will be much less.

Another improvement i do that was not suggested in the paper is on the second pass I check only the moves that were flagged as CRAP. To do that one doesn't have to make a move and ask the TT for each move. I store a flag in the 'score stack' of each move to indicate it was not searched, so it picks only those who are not searched making it same as the sequential search costwise except for the additional cost of having a loop which is insignificant.Edit This latter statement is incorrect because we did the hash probe in the first pass any way. The cost of probes is somewhere in between 1x and 2x. In practice i don't see this slowing down the search significantly maybe be because of bigger TT size as you pointed out.

Michel · Post by **Michel** » Thu May 02, 2013 6:46 pm

I find it really difficult to understand this algorithm. So any explanations are appreciated.

It seems when the threads are finished searching a node they all write to the ttentry for that node. So they are overwriting each other's results. Is this correct?

Is this based on the idea that in normal circumstances all threads will calculate the same value for the node?

Daniel Shawul · Post by **Daniel Shawul** » Thu May 02, 2013 7:38 pm

Michel wrote:I find it really difficult to understand this algorithm. So any explanations are appreciated.

It seems when the threads are finished searching a node they all write to the ttentry for that node. So they are overwriting each other's results. Is this correct?

Is this based on the idea that in normal circumstances all threads will calculate the same value for the node?

Indeed it is kind of confusing the way it is presented in the paper as a distributed algorithm. But it is a neat algorithm that incorporates YBW in it without splitting and other complex stuff. Here is how I understood it.

a) You start two or more threads (processes are better) from the root to do independent searches. With this simple scheme most of the work will be redundant. Even if one thread is started later than the other, it will quickly catch up to searching same part of the tree because of hash cutoffs from result of the first thread. This is important because we will try to exploit it by devising a work sharing scheme as in (b). You can see there will be some speed up even without any enhancements, and indeed i get a speed up of about 1.1x with it.

b) We know the threads spend their time more or less in the same part of the tree so we plan to divide work by a two pass scheme. We can arbitrarily let one thread do the even moves and the other to do the odd numbered moves. But we need a second pass to complete the evaluation of the position with the roles reversed. But that is not how it is actually done in ABDADA. It uses an 'exclusiveP' flag to indicate moves to be searched exclusively. Threads can pick moves as they get them (i.e not even/odd manner) but the effect is the same. To get the YBW effect a two pass search is made as follows:

The first move is always searched by both threads non-exclusively. So both threads will have a proper bound to begin with.
On the first pass the threads search moves exclusively. If a thread picks a move, it will store a flag CRAP in TT entry. The paper increments and decrement number of processors (nprocs) but i didn't need that. Anyway when the other thread sees this flag, it skips searching that move and picks another which is how the work is shared.
On the second pass, the threads search the moves they skipped in the first pass to complete their solution. Hopefully the other thread has searched the skipped moves so this pass will be pretty fast. If not it will search them and complete by itself.

So exclusiveP=false for the first move and the second pass, while it is set to true on the first pass where moves are searched exclusively. There is a pseudo code in chesswiki by Gerd that is easier to read than the one in the paper.

ABDADA speedup results

ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results

Re: ABDADA speedup results