parallel speedup and assorted trivia

Dann Corbit · Post by **Dann Corbit** » Thu Jun 11, 2015 9:33 am

bob wrote:
Adam Hair wrote:Thanks for sharing this information Bob. The distribution of the time to depth for the 16 cpus samples looks like what I have seen at 8 cpus for various engines. No surprise there, but it is nice to have some confirmation that my data is not unusual.
For my dissertation I did something like 32 runs with 16 processors. Then averaged each run. Threw out the best 2 and worst 2, and then used the rest. As you can see the times fluctuate. When I ran with just 1gb of hash, I had one that was terrible. had a speedup for one position of .15x or something like that. Didn't see such wild fluctuations with this 16gb run. There are lots of fluctuations, but pretty sane fluctuations based on all the SMP testing I have done/seen over the years. There are a few positions that will produce almost exactly the same speedup run after run. There are some that look like random numbers. Most are somewhere in the middle of that...

I have seen something related to this that is rather strange.
Take a test position and perform the 4 simple reflections. Now run it through chess engine X.
The 4 positions will take wildly different times to solve, sometimes orders of magnitude. Despite the fact that the positions are all logically identical and the correct move is always just the rotated move.

I assume it has to do with move ordering.

bob · Post by **bob** » Thu Jun 11, 2015 5:03 pm

Dann Corbit wrote:
bob wrote:
Adam Hair wrote:Thanks for sharing this information Bob. The distribution of the time to depth for the 16 cpus samples looks like what I have seen at 8 cpus for various engines. No surprise there, but it is nice to have some confirmation that my data is not unusual.
For my dissertation I did something like 32 runs with 16 processors. Then averaged each run. Threw out the best 2 and worst 2, and then used the rest. As you can see the times fluctuate. When I ran with just 1gb of hash, I had one that was terrible. had a speedup for one position of .15x or something like that. Didn't see such wild fluctuations with this 16gb run. There are lots of fluctuations, but pretty sane fluctuations based on all the SMP testing I have done/seen over the years. There are a few positions that will produce almost exactly the same speedup run after run. There are some that look like random numbers. Most are somewhere in the middle of that...
I have seen something related to this that is rather strange.
Take a test position and perform the 4 simple reflections. Now run it through chess engine X.
The 4 positions will take wildly different times to solve, sometimes orders of magnitude. Despite the fact that the positions are all logically identical and the correct move is always just the rotated move.

I assume it has to do with move ordering.

I'd suspect this is an issue only for bit board programs, and it's caused by BSF/BSR which doesn't change as the position is altered. So yes, you'd get different move ordering which would randomly change the size/shape of the tree. it's one of several bit board "quirks" you choose to live with to get the good things bit boards offer.

bob · Post by **bob** » Thu Jun 11, 2015 5:09 pm

I now have spreadsheets for 8, 16 and 20 cpus, 4 runs each. All in the same place as before.

Brief summary

8cpu

mean speedup: 7.96 8.20 6.60 7.69 (simple average)
geometric mean: 7.46 7.85 6.27 7.73

16cpu

mean speedup: 12.44 10.84 12.83 14.73
geometric mean: 11.75 10.52 12.48 13.75

Whether you like geometric or simple mean better is a personal choice. geometric mean assumes each position is just as important as any other position, while simple mean assumes it is the total time spent searching all positions that is most important. Either approach can be justified.

I'm leaving off the 20 cpu stuff but it is also available. I hope to complete the 4 cpu data tonight. Leaving only 2cpu which is a bit harder since it runs much longer than 16/20 processors.

So far I am pretty happy with the new parallel search code, the numbers look pretty good overall.

bob · Post by **bob** » Fri Jun 12, 2015 2:49 am

I am going to re-run everything. Somehow 3-4-5 piece egtbs were used and I didn't want that distraction in the data. I have the 16 and 20 cpu tests running now. I'll post updated numbers as they come out. I am not sure how this will affect anything, if there is any affect on speedup at all. But I wanted it to be consistently "just search".

Dann Corbit · Post by **Dann Corbit** » Fri Jun 12, 2015 10:42 am

bob wrote:
Dann Corbit wrote:
bob wrote:
Adam Hair wrote:Thanks for sharing this information Bob. The distribution of the time to depth for the 16 cpus samples looks like what I have seen at 8 cpus for various engines. No surprise there, but it is nice to have some confirmation that my data is not unusual.
For my dissertation I did something like 32 runs with 16 processors. Then averaged each run. Threw out the best 2 and worst 2, and then used the rest. As you can see the times fluctuate. When I ran with just 1gb of hash, I had one that was terrible. had a speedup for one position of .15x or something like that. Didn't see such wild fluctuations with this 16gb run. There are lots of fluctuations, but pretty sane fluctuations based on all the SMP testing I have done/seen over the years. There are a few positions that will produce almost exactly the same speedup run after run. There are some that look like random numbers. Most are somewhere in the middle of that...
I have seen something related to this that is rather strange.
Take a test position and perform the 4 simple reflections. Now run it through chess engine X.
The 4 positions will take wildly different times to solve, sometimes orders of magnitude. Despite the fact that the positions are all logically identical and the correct move is always just the rotated move.

I assume it has to do with move ordering.
I'd suspect this is an issue only for bit board programs, and it's caused by BSF/BSR which doesn't change as the position is altered. So yes, you'd get different move ordering which would randomly change the size/shape of the tree. it's one of several bit board "quirks" you choose to live with to get the good things bit boards offer.

That is a very interesting explanation, and it makes sense. I never thought of that.

bob · Post by **bob** » Sat Jun 13, 2015 2:12 am

I am catching back up. I have completed 20cpu and 16cpu runs (4 each). I am now doing the 1 cpu run which takes 18-24 hours.

I decided to bite the bullet and run with 64gb of hash, which is 4g entries, since the typical tree size is around 30B nodes with these positions and depths. I have verified that transparent huge pages works flawlessly under linux, assuming you believe that 2mb is "huge page". I have yet to discover a way to make this work with 1gb huge pages. That does work, but not transparently, you have to use a mmap() kludge, plus pre-allocating huge pages at boot time.

The linux huge paged will cheerfully take groups of 4k pages in a program and convert them to a 2mb page. Which does eliminate one level of page tables. But huge pages can eliminate much more, if I can convince it to do this with 1gb pages.

bob · Post by **bob** » Sun Jun 14, 2015 6:08 pm

I have not finished this, but I am making progress. I am running the new test with 64 gb since that gave the best 1cpu performance at these tree sizes. It has hurt NPS and speedup just a bit. Linux seamlessly supports 2mb large pages with no program changed needed. To use 1gb pages is not as user-friendly since you reserve these at boot up for the time being. So these results are with 2mb large pages in use, which turns a 4-probe virtual-to-physical mapping into a 3-probe removing one level from the usual PGD->PUD->PMD->PTE hierarchy.

results (I will make the spreadsheets available when everything is done. I lack one more 4 cpu run, and then 4 two cpu runs which will take a while (probably 10 hours each or so, since the 1 cpu run takes around 18 hours to complete).

4cpus
-------

mean speedups over 1 = 3.83, 3.32, 4.05
geometric mean speedups = 3.76, 3.25, 3.80

8cpus
-------

mean = 6.49, 6.42, 6.81, 7.05 (mean of mean = 6.69)
gmean = 6.28, 6.17, 6.35, 6.57 (gmean of gmean = 6.34)

16cpus
--------

mean = 12.09, 11.81, 12.29, 12.09 (mean of mean = 12.07)
gmean = 11.70, 10.92, 11.74, 11.70 (gmean of gmean = 11.51)

20cpus
--------

mean = 12.95, 11.44, 14.21, 13.23 (mean = 12.96)
gmean = 12.24, 11.62, 13.64, 12.39 (gmean of gmean) = 12.45

I'll skip all the nps and nodes stuff, those will be in the spreadsheets I will put on my web page once everything is done.

If we stick with the geometric mean, we get

4p=~3.5 8p=6.3 16p=11.5 and 20p=12.45

My very old linear approximation predicts

4p=3.1, 8p=5.9, 16p=11.5 20p=14.3

The 16p and 20p numbers are down a bit. But the NPS scaling is also down thanks to the huge memory size used for hash. For 20 processors, the NPS scaling is down from 16.5x to 14.6x. I am likely going to re-run this one more time using real 1gb huge pages. It only requires changing my memory allocation procedure to use mmap() rather than malloc() which is easy enough. That should kill the TLB issues for the hash table pretty well, and should let the rest of the stuff migrate into 2mb large pages as well.

More on that once the current test finishes. I have been looking carefully, but it appears that there is no current support for transparent 1gb huge pages. I'm going to look through some of the development stuff. Right now all I see is code to handle PMD splits, but no PUD at all, which would be the next required step to automatically handle 1gb pages transparently, which is the way this ought to work eventually.

At least the transparent 2mb pages work as expected with zero programming changes required anywhere.

Laskos · Post by **Laskos** » Mon Jun 15, 2015 5:52 pm

bob wrote:I have not finished this, but I am making progress. I am running the new test with 64 gb since that gave the best 1cpu performance at these tree sizes. It has hurt NPS and speedup just a bit. Linux seamlessly supports 2mb large pages with no program changed needed. To use 1gb pages is not as user-friendly since you reserve these at boot up for the time being. So these results are with 2mb large pages in use, which turns a 4-probe virtual-to-physical mapping into a 3-probe removing one level from the usual PGD->PUD->PMD->PTE hierarchy.

results (I will make the spreadsheets available when everything is done. I lack one more 4 cpu run, and then 4 two cpu runs which will take a while (probably 10 hours each or so, since the 1 cpu run takes around 18 hours to complete).

4cpus
-------

mean speedups over 1 = 3.83, 3.32, 4.05
geometric mean speedups = 3.76, 3.25, 3.80

8cpus
-------

mean = 6.49, 6.42, 6.81, 7.05 (mean of mean = 6.69)
gmean = 6.28, 6.17, 6.35, 6.57 (gmean of gmean = 6.34)

16cpus
--------

mean = 12.09, 11.81, 12.29, 12.09 (mean of mean = 12.07)
gmean = 11.70, 10.92, 11.74, 11.70 (gmean of gmean = 11.51)

20cpus
--------

mean = 12.95, 11.44, 14.21, 13.23 (mean = 12.96)
gmean = 12.24, 11.62, 13.64, 12.39 (gmean of gmean) = 12.45

I'll skip all the nps and nodes stuff, those will be in the spreadsheets I will put on my web page once everything is done.

If we stick with the geometric mean, we get

4p=~3.5 8p=6.3 16p=11.5 and 20p=12.45

My very old linear approximation predicts

4p=3.1, 8p=5.9, 16p=11.5 20p=14.3

The 16p and 20p numbers are down a bit. But the NPS scaling is also down thanks to the huge memory size used for hash. For 20 processors, the NPS scaling is down from 16.5x to 14.6x. I am likely going to re-run this one more time using real 1gb huge pages. It only requires changing my memory allocation procedure to use mmap() rather than malloc() which is easy enough. That should kill the TLB issues for the hash table pretty well, and should let the rest of the stuff migrate into 2mb large pages as well.

More on that once the current test finishes. I have been looking carefully, but it appears that there is no current support for transparent 1gb huge pages. I'm going to look through some of the development stuff. Right now all I see is code to handle PMD splits, but no PUD at all, which would be the next required step to automatically handle 1gb pages transparently, which is the way this ought to work eventually.

At least the transparent 2mb pages work as expected with zero programming changes required anywhere.

Very high speedup, higher than I would expect. What is 2 cpu speedup? In the first run, 1->2 speedup was lower than I expected, 2->4 higher, even superlinear.

bob · Post by **bob** » Mon Jun 15, 2015 6:58 pm

Laskos wrote:
bob wrote:I have not finished this, but I am making progress. I am running the new test with 64 gb since that gave the best 1cpu performance at these tree sizes. It has hurt NPS and speedup just a bit. Linux seamlessly supports 2mb large pages with no program changed needed. To use 1gb pages is not as user-friendly since you reserve these at boot up for the time being. So these results are with 2mb large pages in use, which turns a 4-probe virtual-to-physical mapping into a 3-probe removing one level from the usual PGD->PUD->PMD->PTE hierarchy.

results (I will make the spreadsheets available when everything is done. I lack one more 4 cpu run, and then 4 two cpu runs which will take a while (probably 10 hours each or so, since the 1 cpu run takes around 18 hours to complete).

4cpus
-------

mean speedups over 1 = 3.83, 3.32, 4.05
geometric mean speedups = 3.76, 3.25, 3.80

8cpus
-------

mean = 6.49, 6.42, 6.81, 7.05 (mean of mean = 6.69)
gmean = 6.28, 6.17, 6.35, 6.57 (gmean of gmean = 6.34)

16cpus
--------

mean = 12.09, 11.81, 12.29, 12.09 (mean of mean = 12.07)
gmean = 11.70, 10.92, 11.74, 11.70 (gmean of gmean = 11.51)

20cpus
--------

mean = 12.95, 11.44, 14.21, 13.23 (mean = 12.96)
gmean = 12.24, 11.62, 13.64, 12.39 (gmean of gmean) = 12.45

I'll skip all the nps and nodes stuff, those will be in the spreadsheets I will put on my web page once everything is done.

If we stick with the geometric mean, we get

4p=~3.5 8p=6.3 16p=11.5 and 20p=12.45

My very old linear approximation predicts

4p=3.1, 8p=5.9, 16p=11.5 20p=14.3

The 16p and 20p numbers are down a bit. But the NPS scaling is also down thanks to the huge memory size used for hash. For 20 processors, the NPS scaling is down from 16.5x to 14.6x. I am likely going to re-run this one more time using real 1gb huge pages. It only requires changing my memory allocation procedure to use mmap() rather than malloc() which is easy enough. That should kill the TLB issues for the hash table pretty well, and should let the rest of the stuff migrate into 2mb large pages as well.

More on that once the current test finishes. I have been looking carefully, but it appears that there is no current support for transparent 1gb huge pages. I'm going to look through some of the development stuff. Right now all I see is code to handle PMD splits, but no PUD at all, which would be the next required step to automatically handle 1gb pages transparently, which is the way this ought to work eventually.

At least the transparent 2mb pages work as expected with zero programming changes required anywhere.
Very high speedup, higher than I would expect. What is 2 cpu speedup? In the first run, 1->2 speedup was lower than I expected, 2->4 higher, even superlinear.

These take forever to run. But I do have the first two complete (2 cpus).

mean speedup (all positions count equally): 1.71 1.84
geometric mean: 1.64 1.65

NPS speedup: 1.93 1.92

tree size change:12.6% 4.3%

The speedups are better than I expected in one way, worse in another.

Better in that the search overhead is nowhere near as high as the last time I did an exhaustive analysis (probably 10 years ago or longer). I used to see about 30% overhead per cpu added, which is what led to my simple linear approximation. with 2 cpus, the tree size grew by about 15% (30% of second cpu's nodes were overhead). Why that has dropped is not yet understood by me. But I am looking. Vincent Diepeveen used to postulate that the more aggressive the pruning and such, the harder it became to efficiently search the tree (more overhead introduced). I never understood the reasoning, and am now convinced that it might actually be helping overhead rather than hurting, because with all the reductions and such, it is not harder to change your mind in the middle of a search because later moves get searched less thoroughly.

Speedup is worse than I expected in that if I can get the NPS scaling up to close to 100% (i.e. close to 20x with 20 cpus, not the current 14.5x) then the pure speedup ought to rise by about 1/3 as well. I can get another 2x from nps by going to the manual 1gb huge pages, but I want to finish these last two runs first so I have a complete set of homogeneous data with everything exactly the same except for number of threads used.

The major issue I see right now is 64 gigs for hash = 64 * 512 2mb pages. 32K 2mb pages. The typical TLB can only map 10 of those or so. So it is getting thrashed badly. 32K pages = 512KB of page tables (PGD, PUD, PTE/PMB). That won't fit in L2, so the virtual to real runs at L3 speeds, not TLB speed. Going to 1gb pages turns this into 64 1gb pages, which will even fit in L2 (64 entries * 8 bytes, only 8 cache blocks).

But I seem to have other issues that need addressing. Locks are almost non-existent now (global locks). But there is shared data that is written to, and I am now starting to look at that, since that causes lots of cache invalidates / cache forwarding traffic. I suppose I am going to have to start playing with the hardware performance monitor counters (MSRs) to see exactly what is causing the bottleneck. But clearly, with 14.5 nps scaling, if I can take it to 20x as it should be, then the SMP search speedup will also scale upward by that almost 33% value as well, which is more than a little significant.

If you want the raw data, I'll copy all the spreadsheets to my web page. Just be aware that only the first two 2-processor columns are any good. I just copied the original spreadsheet (modified to add your geometric mean in addition to the arithmetic mean) so the third and 4th columns have data from either 16 or 20 cpu tests until I get the current results.

I did add two more columns, although I am not sure how helpful they are. They give the standard deviation for the horizontal rows. For speedup, they show how variable each position was, and then at the bottom, it shows the variability for the mean/geomean for each test.

Note that these are not searched to the same depth as the first run. Increasing hash size from 1gb to 64 gb reduces the size of the trees significantly, forcing me to increase some of the depths as I wanted to keep the 20cpu time per position somewhere in the 2-6 minute range since that is a normal time for real tournament games.

The pre-25.0 parallel search wasn't bad, but it certainly was nowhere near as good as what I currently have running. This is really clean and direct, and as I mentioned, almost all global locks are gone. I am looking at the last one as I am not even sure it is needed, but the area of the search is extremely difficult to visualize. One golden rule is that when you acquire locks, you MUST acquire them in the same order or risk a deadlock. When threads are aborting other threads, I have to be careful since I need to lock a split block before telling threads there to stop searching due to an unexpected fail-high. Since everybody helps everybody, it is possible two different threads could fail high, and since the split blocks are not necessarily linked in the same order for both, there's that "potential". I don't want to tell threads to stop in general, I only want to tell them to stop working on THIS split block, or any split block that is hierarchically below this split block (a re-split done below this split point). Messy. Very messy. Only thing is, such aborts are not that common anyway. So I don't know that changing this will have any significant impact anyway.

bob · Post by **bob** » Mon Jun 15, 2015 7:30 pm

BTW if you are interested in the raw log files, I have carefully saved those as well. I have a group of files named like this:

{#cpus}cpu.log.[a-d]

so for 4 processor runs, the files are 4cpu.log.a, 4cpu.log.b, etc.

I wrote some simple C programs to run thru the logs and extract the time, nps and nodes for each position, since the logs represent a ton of data. But if you want to see them, let me know and I'll stick them in the same place with the excel spreadsheets.

I have looked at everything reasonably carefully, but nothing says there is no problems in the spreadsheets. I had one case where excel kept zapping a formula on the first position, total nodes searched data, for reasons unknown. The formula just vanished, replaced by the value of the formula at that instant. After that, changing anything in the row would not change that one value. I've tried to check everything, but it is not a 100% vetted data listing yet. I doubt there is any issue with the raw numbers (times, NPS or nodes searched), and I checked and re-checked every formula in the spreadsheets.

My plan is to finally write this parallel search up and publish an explanation, along with either this data, or perhaps with 1gb page data if I get that to working. There's no rush as to when I finish this since the explanation is, by necessity, going to be pretty long.

parallel speedup and assorted trivia

Re: New data

Re: New data

Re: more data

Re: more data (minor error)

Re: New data

Re: more data (minor error)

new data

Re: new data

Re: new data

Re: new data