SMP speedup discussion

bob · Post by **bob** » Sun Jul 05, 2015 6:35 pm

I have started re-running my SMP tests yet again after carefully going through the SMP code looking for any possible problematic race conditions (some race conditions are perfectly acceptable, some are not).

But I've been debating one particular issue for a while and I do not see anything that particularly favors one way of reporting speedup over another. Here goes.

(1) take a long game played with max cores. This gives a target time per move for each test position using max cores, whatever that is (obviously something you have, not a wish-list). This gives an accurate speedup estimate over running on a single core. The down-side is the single-core test case takes forever. The up-side is that this gives a speedup estimate that is quite accurate for the time control you choose and the hardware platform you test on.

(2) Take a long game played with one core, then test up to max cores. This is a much more computationally-palatable test since now the 1 core test runs in reasonable time. But it will certainly under-estimate the speedup if you were to move to the N-core box because many tests have shown that the deeper you search (or the longer you search) the better the speedup. I have never studied this in a careful way, and it might be yet another interesting research project to see if there is a reasonable curve that depicts speedup vs time limit.

(3) A sort of corollary of (1) would show up if the time control you are interested in is very fast. Then the speedup will certainly be worse, but the testing is more palatable since you are now not looking at such a huge 1 core test duration.

The reason I raise this question is, it is very difficult if not impossible to compare apples to applies today, when everybody that produces SMP speedup data tests in a different way. Yes, the most relevant test would probably be games, but that is beyond unpalatable for obvious reasons.

Ergo, my question. I'm running right now shooting for an average time of 3-4 minutes with 20 cores. But that's really annoying when you make the 1 core run, as the 20 core test takes maybe 1 1/2 hours, so the 1 core run can take up to 30 hours (assuming one gets a 20x speedup, not very likely.)

Thoughts? Not just for programmers either, which is why I posted this here. And no, it really is NOT practical to produce that speedup curve with many different time limits. Nobody will run such a test if it takes months to produce all the data...

So what is a good number? Or perhaps even 3 good numbers (i.e. 5 seconds ala' speed-chess, 30 secs for fairly fast games, and 3-4 minutes for tournament-type long time control games).

jdart · Post by **jdart** » Mon Jul 06, 2015 2:35 am

Depends on your hardware, I think, and also on your NPS.

With Crafty, which I have seen hit 50M NPS pretty easy, I expect it doesn't take long at all to start seeing whatever effects you are going to see with SMP. For example a 1 minute search at 50m NPS is 3 billion nodes. So a longer search is probably not going to stress the CPU and memory much more than a 1 minute search does.

--Jon

bob · Post by **bob** » Mon Jul 06, 2015 5:51 am

jdart wrote:Depends on your hardware, I think, and also on your NPS.

With Crafty, which I have seen hit 50M NPS pretty easy, I expect it doesn't take long at all to start seeing whatever effects you are going to see with SMP. For example a 1 minute search at 50m NPS is 3 billion nodes. So a longer search is probably not going to stress the CPU and memory much more than a 1 minute search does.

--Jon

I tend to see the NPS climb (using 20 cores) beyond at least one minute. I might try to run a test with a couple of positions, gathering exactly this data. But in general, over the years, the deeper it goes, the better the speedup, probably primarily related to the "min_split_depth" limit that prevents it from splitting too near the tips.

One other piece of code I am going to write before 25.0 is released is an "autotune" option. Something like "autotune #cores time" where time is the time per move you want to optimize for. None of my SMP search data so far has been optimized at all, there are now about 10 adjustable parameters. I know how to test each one, I just need to automate the process. But this is probably going to take at least an hour or two for every one minute of the basic time per move.

I am also going to experiment with the time limit. I have tuned all the positions so that the 1 cpu search takes close to 30 minutes, nothing more than 40 or less than 15, trying to target the 20cpu time to 2 minutes per move or so. I'm going to add a modification to the test so that the depth can be reduced by a constant for all positions being tested to back off on the 1 cpu time a good bit to test faster games...

On this 20 core box, 90M nodes per second seems to be the norm unless I go to the 64gb hash size which tends to crush the TLB even with 2mb pages.

And finally, a one minute search turns into a painful test with one core. And it is only going to get worse as the cores climb... I am not sure how I would think about testing with 64 cores. The 1 core test would be painful, or else the 64 core test run will be very quick..

jdart · Post by **jdart** » Mon Jul 06, 2015 5:57 am

I have some auto-adjustment for split depth that adjusts it up and down depending on the observed NPS during a search, within some limits. This is pretty crude and it can bounce the values around quite a lot. Still, with the wide range of hardware people have, some kind of tuning is a good idea.

--Jon

Laskos · Post by **Laskos** » Wed Jul 08, 2015 12:44 am

Maybe just to compute the effective speedup for doubling the number of cores?
Say, 1 core vs 2 core, 2 cored at 1 minute/position
2 core vs 4 core, 4 cored at 1 minute/position
...
8 core vs 16 core, 16 cored at 1 minute/position.

On my 4 core i7, I found that after ~1 minute of search for the fastest of the compared (most cores in testing), the speedup flattens, it doesn't go significantly higher on longer searches.

On 24 positions each doubling in cores would take at most 72 minutes, and you will totally need 4 doublings to 16 cores, or 5 hours for the whole test. I would include more positions to get a reliable result, or, as you do, re-run the test 4 or more times. Another source of errors comes from non zero widening or narrowing. It's not so easy to extrapolate 10 ELO points in widening at fixed depth=10. The real depths used in time-to-depth test are some 20-30, and the widening is impossible to calculate ELO-wise in games at these depths. One can see the total node count in TTD test. Extrapolation is usually that 10 ELO points widening give some 5% to 10% contribution to effective speedup, so effective speedup is equal say 1.05 or 1.10 times TTD speedup. But this estimation has several percents of introduced error. Would be nice if the widening or narrowing is perfectly 0 ELO points, in this case TTD speedup is equal to total effective speedup.

bob · Post by **bob** » Wed Jul 08, 2015 4:02 am

Laskos wrote:Maybe just to compute the effective speedup for doubling the number of cores?
Say, 1 core vs 2 core, 2 cored at 1 minute/position
2 core vs 4 core, 4 cored at 1 minute/position
...
8 core vs 16 core, 16 cored at 1 minute/position.

On my 4 core i7, I found that after ~1 minute of search for the fastest of the compared (most cores in testing), the speedup flattens, it doesn't go significantly higher on longer searches.

On 24 positions each doubling in cores would take at most 72 minutes, and you will totally need 4 doublings to 16 cores, or 5 hours for the whole test. I would include more positions to get a reliable result, or, as you do, re-run the test 4 or more times. Another source of errors comes from non zero widening or narrowing. It's not so easy to extrapolate 10 ELO points in widening at fixed depth=10. The real depths used in time-to-depth test are some 20-30, and the widening is impossible to calculate ELO-wise in games at these depths. One can see the total node count in TTD test. Extrapolation is usually that 10 ELO points widening give some 5% to 10% contribution to effective speedup, so effective speedup is equal say 1.05 or 1.10 times TTD speedup. But this estimation has several percents of introduced error. Would be nice if the widening or narrowing is perfectly 0 ELO points, in this case TTD speedup is equal to total effective speedup.

That's another idea, but it again introduces some distortion, since you now compare different search times. It would certainly simplify the testing however, since the 1 cpu test would be palatable. I am fixing to re-run my old long test. I think I will run that to completion, and then run as you suggest, 1cpu = 2 mins, 2 should be fairly close to 1. The 2 cpus for 2 minutes and see what 4 does, repeating until it is done. I can post both sets of data and see if it works. If it tracks pretty closely, give yourself a raise because right now the 1 cpu run takes around 30 hours non-stop, where the 20's are taking 2 or a little less. I believe I finally have a solid version, and I'll try to report all the data (1 vs 8 self-play to see how that compares.. I am also a little beyond that "narrowing" stuff. I suspect the overhead for 20 cpus is going to be around 20-25% based on testing so far, which is more natural-looking... More probably on thursday as I have 4x 20 core runs set up, followed by a single 1 core run, followed by 4x 16, 8, 4 and 2.

Laskos · Post by **Laskos** » Wed Jul 08, 2015 10:08 am

bob wrote:
That's another idea, but it again introduces some distortion, since you now compare different search times.

It wouldn't be a problem as long as the search time is sufficient to have achieved the maximum speedup. If the parallel search kicked off instantly, then even ultra-short time controls would be fine, but as we see, we need billions of nodes per position to get a reliable result, a result very close to maximum speedup. What I saw is that 1 core NPS kicks in quickly, depending on the hash size. 2 core, because of parallel search is slower to achieve the peak, needs at least a minute. 4 and 8 cores need also at least a minute. So, the fixed time is important, not fixed nodes or depth in computing effective speedup, as 8 cores searches much more nodes in 1 minute than 1 core in the same amount of time. I would guess your very long tests going rigorous to the same depth, needing maybe 3 days to compute, will be close in results to my proposal of a much faster test.

mhull · Post by **mhull** » Fri Jul 10, 2015 7:25 pm

bob wrote:On this 20 core box, 90M nodes per second seems to be the norm unless I go to the 64gb hash size which tends to crush the TLB even with 2mb pages.

Any progress on using 1gb huge pages?

bob · Post by **bob** » Sat Jul 11, 2015 2:45 am

Laskos wrote:
bob wrote:
That's another idea, but it again introduces some distortion, since you now compare different search times.
It wouldn't be a problem as long as the search time is sufficient to have achieved the maximum speedup. If the parallel search kicked off instantly, then even ultra-short time controls would be fine, but as we see, we need billions of nodes per position to get a reliable result, a result very close to maximum speedup. What I saw is that 1 core NPS kicks in quickly, depending on the hash size. 2 core, because of parallel search is slower to achieve the peak, needs at least a minute. 4 and 8 cores need also at least a minute. So, the fixed time is important, not fixed nodes or depth in computing effective speedup, as 8 cores searches much more nodes in 1 minute than 1 core in the same amount of time. I would guess your very long tests going rigorous to the same depth, needing maybe 3 days to compute, will be close in results to my proposal of a much faster test.

I am running my tests with the latest code that finally seems to be looking reasonable. I have finished the 20cpu tests, with an average speedup of 15.3, I have two of the 16 cpu runs done with a speedup in the 10-13x range depending on whether I use raw mean (lower) or geometric mean (which is actually higher here). Once the others run, I am go back and try the same thing with 2 minutes with 1 core, shooting for 1 minute for 2 cores. Then two minutes for 2 cores vs 1 minute for 4 cores, etc. That would make this a one hour @ two minutes process. Each pair of comparisons ought to take 3 hours, 5 comparisons = 15 hours. About 1/2 the time the current 1 cpu run takes in fact.

Be interesting to compare the two results. If they are close, I'd certainly prefer the latter.

BTW I am now convinced this new code has dethroned DTS, but there is a caveat. When the DTS experiments were done, the typical search depth was 10-11 plies at best. I think the shallowest search in these tests is maybe 24 plies. That extra depth would have been beneficial to DTS as well, so it is not quite an apples-to-apples comparison. But it seems clear that 1988 speedup numbers have now been eclipsed. If it wasn't so much trouble, I would try to fix the parallel code in Cray Blitz so the experiments could be re-run after a few search tweaks to bring it into the 21st century.

bob · Post by **bob** » Sat Jul 11, 2015 3:03 am

mhull wrote:
bob wrote:On this 20 core box, 90M nodes per second seems to be the norm unless I go to the 64gb hash size which tends to crush the TLB even with 2mb pages.
Any progress on using 1gb huge pages?

It is both easy and painful. I made it work with mmap(). But the down-side is that (a) you have to allocate the 1gb pages at boot time, and they are then only available through the mmap() trick; (b) you have to set up the hugetlbfs stuff which creates a bogus filesystem that connects to the huge pages. Then mmap()'ing that file into your address space gives you a 1gb page.

The 2mb pages, on the other hand, are completely transparent, and there are linux plans to do the same for 1gb pages. How soon is unknown. I'm undecided on whether to include this in the 25.0 version since it is likely going to generate a ton of questions. On this 128gig box I am using, reserving 64 1gb pages is not exactly odious. And that is getting into the hash table size where 1gb pages make a significant difference. No idea how to do this on windows, or if it is even possible to do. There are also lots of deadlock possibilities since these monster pages are locked in memory.

SMP speedup discussion

SMP speedup discussion

Re: SMP speedup discussion

Re: SMP speedup discussion

Re: SMP speedup discussion

Re: SMP speedup discussion

Re: SMP speedup discussion

Re: SMP speedup discussion

Re: SMP speedup discussion

Re: SMP speedup discussion

Re: SMP speedup discussion