Laskos wrote:bob wrote:I have not finished this, but I am making progress. I am running the new test with 64 gb since that gave the best 1cpu performance at these tree sizes. It has hurt NPS and speedup just a bit. Linux seamlessly supports 2mb large pages with no program changed needed. To use 1gb pages is not as user-friendly since you reserve these at boot up for the time being. So these results are with 2mb large pages in use, which turns a 4-probe virtual-to-physical mapping into a 3-probe removing one level from the usual PGD->PUD->PMD->PTE hierarchy.
results (I will make the spreadsheets available when everything is done. I lack one more 4 cpu run, and then 4 two cpu runs which will take a while (probably 10 hours each or so, since the 1 cpu run takes around 18 hours to complete).
4cpus
-------
mean speedups over 1 = 3.83, 3.32, 4.05
geometric mean speedups = 3.76, 3.25, 3.80
8cpus
-------
mean = 6.49, 6.42, 6.81, 7.05 (mean of mean = 6.69)
gmean = 6.28, 6.17, 6.35, 6.57 (gmean of gmean = 6.34)
16cpus
--------
mean = 12.09, 11.81, 12.29, 12.09 (mean of mean = 12.07)
gmean = 11.70, 10.92, 11.74, 11.70 (gmean of gmean = 11.51)
20cpus
--------
mean = 12.95, 11.44, 14.21, 13.23 (mean = 12.96)
gmean = 12.24, 11.62, 13.64, 12.39 (gmean of gmean) = 12.45
I'll skip all the nps and nodes stuff, those will be in the spreadsheets I will put on my web page once everything is done.
If we stick with the geometric mean, we get
4p=~3.5 8p=6.3 16p=11.5 and 20p=12.45
My very old linear approximation predicts
4p=3.1, 8p=5.9, 16p=11.5 20p=14.3
The 16p and 20p numbers are down a bit. But the NPS scaling is also down thanks to the huge memory size used for hash. For 20 processors, the NPS scaling is down from 16.5x to 14.6x. I am likely going to re-run this one more time using real 1gb huge pages. It only requires changing my memory allocation procedure to use mmap() rather than malloc() which is easy enough. That should kill the TLB issues for the hash table pretty well, and should let the rest of the stuff migrate into 2mb large pages as well.
More on that once the current test finishes. I have been looking carefully, but it appears that there is no current support for transparent 1gb huge pages. I'm going to look through some of the development stuff. Right now all I see is code to handle PMD splits, but no PUD at all, which would be the next required step to automatically handle 1gb pages transparently, which is the way this ought to work eventually.
At least the transparent 2mb pages work as expected with zero programming changes required anywhere.
Very high speedup, higher than I would expect. What is 2 cpu speedup? In the first run, 1->2 speedup was lower than I expected, 2->4 higher, even superlinear.
These take forever to run. But I do have the first two complete (2 cpus).
mean speedup (all positions count equally): 1.71 1.84
geometric mean: 1.64 1.65
NPS speedup: 1.93 1.92
tree size change:12.6% 4.3%
The speedups are better than I expected in one way, worse in another.
Better in that the search overhead is nowhere near as high as the last time I did an exhaustive analysis (probably 10 years ago or longer). I used to see about 30% overhead per cpu added, which is what led to my simple linear approximation. with 2 cpus, the tree size grew by about 15% (30% of second cpu's nodes were overhead). Why that has dropped is not yet understood by me. But I am looking. Vincent Diepeveen used to postulate that the more aggressive the pruning and such, the harder it became to efficiently search the tree (more overhead introduced). I never understood the reasoning, and am now convinced that it might actually be helping overhead rather than hurting, because with all the reductions and such, it is not harder to change your mind in the middle of a search because later moves get searched less thoroughly.
Speedup is worse than I expected in that if I can get the NPS scaling up to close to 100% (i.e. close to 20x with 20 cpus, not the current 14.5x) then the pure speedup ought to rise by about 1/3 as well. I can get another 2x from nps by going to the manual 1gb huge pages, but I want to finish these last two runs first so I have a complete set of homogeneous data with everything exactly the same except for number of threads used.
The major issue I see right now is 64 gigs for hash = 64 * 512 2mb pages. 32K 2mb pages. The typical TLB can only map 10 of those or so. So it is getting thrashed badly. 32K pages = 512KB of page tables (PGD, PUD, PTE/PMB). That won't fit in L2, so the virtual to real runs at L3 speeds, not TLB speed. Going to 1gb pages turns this into 64 1gb pages, which will even fit in L2 (64 entries * 8 bytes, only 8 cache blocks).
But I seem to have other issues that need addressing. Locks are almost non-existent now (global locks). But there is shared data that is written to, and I am now starting to look at that, since that causes lots of cache invalidates / cache forwarding traffic. I suppose I am going to have to start playing with the hardware performance monitor counters (MSRs) to see exactly what is causing the bottleneck. But clearly, with 14.5 nps scaling, if I can take it to 20x as it should be, then the SMP search speedup will also scale upward by that almost 33% value as well, which is more than a little significant.
If you want the raw data, I'll copy all the spreadsheets to my web page. Just be aware that only the first two 2-processor columns are any good. I just copied the original spreadsheet (modified to add your geometric mean in addition to the arithmetic mean) so the third and 4th columns have data from either 16 or 20 cpu tests until I get the current results.
I did add two more columns, although I am not sure how helpful they are. They give the standard deviation for the horizontal rows. For speedup, they show how variable each position was, and then at the bottom, it shows the variability for the mean/geomean for each test.
Note that these are not searched to the same depth as the first run. Increasing hash size from 1gb to 64 gb reduces the size of the trees significantly, forcing me to increase some of the depths as I wanted to keep the 20cpu time per position somewhere in the 2-6 minute range since that is a normal time for real tournament games.
The pre-25.0 parallel search wasn't bad, but it certainly was nowhere near as good as what I currently have running. This is really clean and direct, and as I mentioned, almost all global locks are gone. I am looking at the last one as I am not even sure it is needed, but the area of the search is extremely difficult to visualize. One golden rule is that when you acquire locks, you MUST acquire them in the same order or risk a deadlock. When threads are aborting other threads, I have to be careful since I need to lock a split block before telling threads there to stop searching due to an unexpected fail-high. Since everybody helps everybody, it is possible two different threads could fail high, and since the split blocks are not necessarily linked in the same order for both, there's that "potential". I don't want to tell threads to stop in general, I only want to tell them to stop working on THIS split block, or any split block that is hierarchically below this split block (a re-split done below this split point). Messy. Very messy. Only thing is, such aborts are not that common anyway. So I don't know that changing this will have any significant impact anyway.