parallel speedup and assorted trivia

bob · Post by **bob** » Fri Jun 05, 2015 7:08 pm

Since we installed the first cluster node for the 2x10core cluster we have, I have been playing with it. I've run the same positions on this box as I used in my dissertation and in the DTS paper published in the ICGA Journal. I'm going to provide a summary here, and if anyone wants the "semi-raw" data I'll be happy to provide it. I have both log-files and an excel spread sheet where the spread sheet is obviously much easier to read. First a quick explanation about the test.

This box is a 2 x 10 core Intel Xeon E5-2660 V3 running at 2.6ghz (not exactly accurate, more in a minute). 24mb L3 cache, 24gb of DRAM. When it first arrived I started testing and found some interesting things.

It has both turbo-boost and hyper-threading, obviously. If I run Crafty on this box using 40 threads, the processor cores all run at 2.6ghz non-stop, never changing. If I disable hyper-threading (this is how it is now configured in fact) and run Crafty using 20 threads, it runs at 2.9ghz continuously, never going up or down (I actually wrote a small program to continuously check the frequencies and report any instance more than +/- 50mhz variance and ran for 12 hours, with no "warnings". So far, so good. 2.9ghz x 20 is not bad. But the problems start soon after. If I run below 20 threads, the processors will run up to 3.2ghz. I did not take the time to figure out exactly how they vary, the fact that they were varying was a problem if one wants to do SMP speedup measurements. I have seen this mistake made here multiple times in the past and didn't want it in my data.

What I decided to do was to run this set of positions with 1, 2, 4, 8, 16 and 20 cores. When I kicked off one of those tests, the script actually ran two instances of Crafty. It would first start a dummy run using 20 - N where N is the number I want to test. It would let that get started for a few seconds (just running one position on infinite search depth) and then start the actual test run using N threads. This way I always had 20 threads running, all processors locked at 2.9ghz, and I could directly compare 1 vs 2, up to 1 vs 20, and not introduce error where 1 would run at 3.2ghz and 20 would run at 2.9ghz.

My next step is going to be to see if I can't modify the Linux kernel "governor" that controls the cpu speed, and just lock 'em at 2.9ghz no matter what, since this really burns a lot of unnecessary compute cycles. This test takes about 200,000 elapsed seconds which is getting close to 2.5 days. I'd like to be able to run without soaking up 20 cores when just testing on one. I do run the "dummy" at nice +19, so once I get past the 20 core run, others can actually do stuff on the machine without bothering my accuracy significantly.

I'm going to provide 3 sets of summary data, for 1, 2, 4, 8, 16 and 20 processors. The data will be pure speedup (time to depth, all runs run to the same depth but each position has a different depth), NPS speedup which gives an idea of the max theoretical speedup I can get since if NPS doesn't run 20x faster on 20 cores, the search can't run 20x faster either in time to depth, and finally total nodes searched for each run to see how the tree size changes (interesting data here by the way).

This is run with the NEW version of Crafty, which has a remarkably lightweight splitting algorithm that avoids any sort of central locking strategy completely. And a local split block is held for maybe a dozen instructions (asm level) max, which is far more effective than the previous versions. NPS scaling is a ways from perfect here as the data will show. Here it is:

Code: Select all

                  1p      2p      4p      8p      16p      20p
speedup           -      1.64    3.68    6.64    10.76    13.17
NPS speedup       -      1.94    3.83    7.25    12.27    14.00
avg nodes       18.3B   21.7B   19.1B   20.3B    21.3B    25.2B

Observations.

(1) I have not yet tweaked anything to see if I can improve that 75% NPS scaling (16 cpus run 12x faster). I'd expect that can be better.

(2) the speedup numbers are VERY good when compared to the NPS speedup, better than I had expected, better than I had seen previously. But as I said, this parallel search is completely new and has been in development for going on a year now, so it represents a departure from the old approach that is better than I had expected. And I have a couple of more ideas to work on, with the goal of using a VERY minimal locking strategy since locks kill performance.

(3) the average nodes per position doesn't grow any faster than it did with Cray Blitz. There has been speculation for years that all the pruning, reductions and such have been making speedup numbers drop over the years. I'm no longer buying that argument.

(4) in looking at these numbers, if the NPS can be driven to near optimal scaling, then the parallel search is going to be "right there" as well. For 20 cores the speedup was 13.2, while the NPS speedup was 14.0, which is an absolute upper bound for speedup. Using my old linear approximation would predict 1 + 19*.7 for 20 cores, or 14.3x faster. Actual is 13.1. So .65 rather than .7 is a good "approximation". For 16, the old formula would predict 11.5 which is a little low compared to the actual number.

I am going to work on this NPS issue for a bit and then re-run, but at least these results provide some information that is interesting. I will note that I normalized on the 20 cpu run. I wanted a search that took long enough at 20 cores to be meaningful. I targeted 3-6 minutes per position and for each position, twiddled with the depth until Crafty was averaging somewhere in that range. That makes the 1 cpu run take absolutely forever.

If anyone is interested, if you send me an email I'll be happy to email you the actual spreadsheet which gives all the above data, but both position by position, as well as an average (which is all I presented above). In the spreadsheet I even compute the speedup going from N to 2N cores in addition to the speedup from 1 to N cores so that I could see how the search efficiency went as I moved from 4 to 8 cores, then 8 to 16. Of course, you can use the above data to calculate that by working backward. If you'd prefer to have the raw data, I have log files for each run, but it IS a lot of data. I have an simple tool that eats a log file and gives me the time, NPS and nodes searched for each position, which I can cut and paste directly into Excel.

Finally, this data lacks one sanity test that I will do soon. I want to run the test again and average things. You can look at the spreadsheet and see anomalies where there are super-linear speedups on some positions, and horrible speedups on others. At least a couple of the positions are problematic in that they produce poor speedup across the board. For example 1.3x/2 2x/4, 3x/8 3x/16 5x/20.

But this is at least a set of carefully measured results. One thing I did do differently is that after each position (these are consecutive positions from a game between MChess Pro and Cray Blitz somewhere around 1992 or so) is that I cleared the hash table. When trying to set the depths to take 3-6 minutes, it was just about impossible to do so, as if I changed the depth on an earlier position, then suddenly later positions would go faster or slower due to hash hits changing. So actual game speedups should be a bit better than these numbers since the hash table would carry across from position to position, producing greater depth (and greater depth improves speedup).

If you have questions, feel free. If you want raw data or the spreadsheet, ask via email (my UAB email).

Laskos · Post by **Laskos** » Fri Jun 05, 2015 7:58 pm

Very interesting. Higher effective speed-up and lower NPS speed-up than I expected. How many positions did you use? The effective speed-up to 2 cores is a bit weird, seems too low in itself, and per core is lower than 4 core and 8 core effective speed-ups per core. Total nodes searched looks weird again on 2 cores, higher than on 4 and 8 cores.

bob · Post by **bob** » Fri Jun 05, 2015 8:02 pm

As a follow-up, I have been looking at the data (the above results just finished this morning). One thing that IS hurting NPS scaling is wait time (wait as in young brothers wait, where threads are waiting on someone to offer them a split point to help with).

The "waiting time" is averaging about 8% on the test, which means 0.08 * 20 cores = 1.6 cores are lost waiting. So there is work to do. I have played around with "gratuitous split points" where a thread can split even when no other threads are idle, and I have played with doing it near the root, etc, but it did not originally pay out very well, but I am experimenting with it again since my entire split approach now is something like the DTS "late join" idea except now "late join" is the ONLY way to help.

More later.

bob · Post by **bob** » Fri Jun 05, 2015 9:39 pm

Laskos wrote:Very interesting. Higher effective speed-up and lower NPS speed-up than I expected. How many positions did you use? The effective speed-up to 2 cores is a bit weird, seems too low in itself, and per core is lower than 4 core and 8 core effective speed-ups per core. Total nodes searched looks weird again on 2 cores, higher than on 4 and 8 cores.

This is a total of 26 positions. Not the complete test. I run it in pieces to avoid totally saturating the machine for weeks.

Not sure about the NPS issue yet. But I suspect some cache funny-business. I am losing about 1.5 cpus waiting, which is not that unexpected with recursive YBW algorithm, so the question is, where are the other 4.5 cpus going? (NPS speedup is about 14x out of 20x.)

Obviously this will be high on my list, but the performance overall does not look half-bad considering that 4.5 processors are MIA, and 1.5 are wasted by waiting on work...

I have not tuned this search much, yet. All the work has been rewriting it and making it work with an absolute minimum number of lock acquisitions.

These positions were from a real game as I mentioned. And there are several that are quite problematic in that things change. One where the best score changes by 0.01, which is a lot of work (and extra nodes) for practically no real benefit. The trees for some are pretty volatile due to this. If you want the spread sheet, drop me an email and you will see what I mean. For example, there is 1 position that shows an 11.x speedup with 4 processors. while with 2 processors it only ran 1.3x faster.. I see this kind of stuff all the time, however.

bob · Post by **bob** » Mon Jun 08, 2015 5:43 pm

I'm still looking at this data. And I have located another issue hurting parallel performance ... move ordering. I looked at the games played Saturday in HGM's tournament and the failed high on first move statistic stays at 90+% which is normal. But in the parallel tests above it is hanging at 80% or so. I ran this test using 1GB of memory for hash. I've run the 20cpu test 4 times and have gotten similar results. I am going to rerun with 16 gb (that's max, the machine has 24gb of RAM) to see what effect (if any) that has. The raw data shows that the average search examined 25 billion nodes, with the largest at 77B. A 1gb hash table (16 bytes per entry) gives 512M hash entries. Pretty small considering 25B nodes were searched on average.

This might open up a whole new set of issues for parallel search on such machines.

Joost Buijs · Post by **Joost Buijs** » Mon Jun 08, 2015 8:47 pm

bob wrote:I am going to rerun with 16 gb (that's max, the machine has 24gb of RAM) to see what effect (if any) that has.

A 20 core machine with only 24GB. of RAM is a very strange combination.
These type of machines are usually sold with 128 or 256 GB of RAM.
At least that is what I see at our university.

When you are going to use larger hash, more in line with the speed of the machine, you probably have to switch to large page memory otherwise the heavy load on the TLB will have a rather diminishing effect on performance.
I already see I an improvement of about 10% with large pages and only 4 GB. hash.
Maybe it won't make a very big difference for Crafty because I think you're not using the transposition table in quiescence.

bob · Post by **bob** » Mon Jun 08, 2015 9:54 pm

Joost Buijs wrote:
bob wrote:I am going to rerun with 16 gb (that's max, the machine has 24gb of RAM) to see what effect (if any) that has.
A 20 core machine with only 24GB. of RAM is a very strange combination.
These type of machines are usually sold with 128 or 256 GB of RAM.
At least that is what I see at our university.

When you are going to use larger hash, more in line with the speed of the machine, you probably have to switch to large page memory otherwise the heavy load on the TLB will have a rather diminishing effect on performance.
I already see I an improvement of about 10% with large pages and only 4 GB. hash.
Maybe it won't make a very big difference for Crafty because I think you're not using the transposition table in quiescence.

This is a cluster. Space is limited to fit things into 1U. And also $$ becomes an issue when you buy N of these things.

I am looking at kernel source at the moment, because 2m pages don't work well after 4gb. Need to then step up to 1gb huge page size which the hardware supports. Just got to figure out how to make the kernel go for that as well.

I am not sure where the 24gb comes from. These processors have 4 memory channels, which would make me think a power of 2 would be the norm.

Joost Buijs · Post by **Joost Buijs** » Mon Jun 08, 2015 10:46 pm

bob wrote:
Joost Buijs wrote:
bob wrote:I am going to rerun with 16 gb (that's max, the machine has 24gb of RAM) to see what effect (if any) that has.
A 20 core machine with only 24GB. of RAM is a very strange combination.
These type of machines are usually sold with 128 or 256 GB of RAM.
At least that is what I see at our university.

When you are going to use larger hash, more in line with the speed of the machine, you probably have to switch to large page memory otherwise the heavy load on the TLB will have a rather diminishing effect on performance.
I already see I an improvement of about 10% with large pages and only 4 GB. hash.
Maybe it won't make a very big difference for Crafty because I think you're not using the transposition table in quiescence.
This is a cluster. Space is limited to fit things into 1U. And also $$ becomes an issue when you buy N of these things.

I am looking at kernel source at the moment, because 2m pages don't work well after 4gb. Need to then step up to 1gb huge page size which the hardware supports. Just got to figure out how to make the kernel go for that as well.

I am not sure where the 24gb comes from. These processors have 4 memory channels, which would make me think a power of 2 would be the norm.

I can imagine that cost will become an issue when you want to build a cluster with these machines.

The memory configuration is very flexible, you can use almost every combination.
They probably have 4 DIMM's of 4GB and 4 DIMM's of 2 GB, but 6 DIMM's of 4 GB is also possible with these boards.

Under Windows large pages are very easy to use, the only drawback is that you have to run your program with admin rights.
About Linux I'm not sure, I think that large page support is available in most Linux distributions without the need to recompile the kernel.

bob · Post by **bob** » Tue Jun 09, 2015 2:47 am

Joost Buijs wrote:
bob wrote:
Joost Buijs wrote:
bob wrote:I am going to rerun with 16 gb (that's max, the machine has 24gb of RAM) to see what effect (if any) that has.
A 20 core machine with only 24GB. of RAM is a very strange combination.
These type of machines are usually sold with 128 or 256 GB of RAM.
At least that is what I see at our university.

When you are going to use larger hash, more in line with the speed of the machine, you probably have to switch to large page memory otherwise the heavy load on the TLB will have a rather diminishing effect on performance.
I already see I an improvement of about 10% with large pages and only 4 GB. hash.
Maybe it won't make a very big difference for Crafty because I think you're not using the transposition table in quiescence.
This is a cluster. Space is limited to fit things into 1U. And also $$ becomes an issue when you buy N of these things.

I am looking at kernel source at the moment, because 2m pages don't work well after 4gb. Need to then step up to 1gb huge page size which the hardware supports. Just got to figure out how to make the kernel go for that as well.

I am not sure where the 24gb comes from. These processors have 4 memory channels, which would make me think a power of 2 would be the norm.
I can imagine that cost will become an issue when you want to build a cluster with these machines.

The memory configuration is very flexible, you can use almost every combination.
They probably have 4 DIMM's of 4GB and 4 DIMM's of 2 GB, but 6 DIMM's of 4 GB is also possible with these boards.

Under Windows large pages are very easy to use, the only drawback is that you have to run your program with admin rights.
About Linux I'm not sure, I think that large page support is available in most Linux distributions without the need to recompile the kernel.

2mb pages is easy. 1gb pages I don't know about yet...

Joost Buijs · Post by **Joost Buijs** » Tue Jun 09, 2015 8:07 am

bob wrote: 2mb pages is easy. 1gb pages I don't know about yet...

2MB pages is what I meant, this is already I big improvement over the default 4KB pages.
Windows doesn't have support for 1GB pages yet, and I don't think it will be any different with Linux.
I read something about it here:

http://stackoverflow.com/questions/2795 ... s-on-linux

parallel speedup and assorted trivia

parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia