This box is a 2 x 10 core Intel Xeon E5-2660 V3 running at 2.6ghz (not exactly accurate, more in a minute). 24mb L3 cache, 24gb of DRAM. When it first arrived I started testing and found some interesting things.
It has both turbo-boost and hyper-threading, obviously. If I run Crafty on this box using 40 threads, the processor cores all run at 2.6ghz non-stop, never changing. If I disable hyper-threading (this is how it is now configured in fact) and run Crafty using 20 threads, it runs at 2.9ghz continuously, never going up or down (I actually wrote a small program to continuously check the frequencies and report any instance more than +/- 50mhz variance and ran for 12 hours, with no "warnings". So far, so good. 2.9ghz x 20 is not bad. But the problems start soon after. If I run below 20 threads, the processors will run up to 3.2ghz. I did not take the time to figure out exactly how they vary, the fact that they were varying was a problem if one wants to do SMP speedup measurements. I have seen this mistake made here multiple times in the past and didn't want it in my data.
What I decided to do was to run this set of positions with 1, 2, 4, 8, 16 and 20 cores. When I kicked off one of those tests, the script actually ran two instances of Crafty. It would first start a dummy run using 20 - N where N is the number I want to test. It would let that get started for a few seconds (just running one position on infinite search depth) and then start the actual test run using N threads. This way I always had 20 threads running, all processors locked at 2.9ghz, and I could directly compare 1 vs 2, up to 1 vs 20, and not introduce error where 1 would run at 3.2ghz and 20 would run at 2.9ghz.
My next step is going to be to see if I can't modify the Linux kernel "governor" that controls the cpu speed, and just lock 'em at 2.9ghz no matter what, since this really burns a lot of unnecessary compute cycles. This test takes about 200,000 elapsed seconds which is getting close to 2.5 days. I'd like to be able to run without soaking up 20 cores when just testing on one. I do run the "dummy" at nice +19, so once I get past the 20 core run, others can actually do stuff on the machine without bothering my accuracy significantly.
I'm going to provide 3 sets of summary data, for 1, 2, 4, 8, 16 and 20 processors. The data will be pure speedup (time to depth, all runs run to the same depth but each position has a different depth), NPS speedup which gives an idea of the max theoretical speedup I can get since if NPS doesn't run 20x faster on 20 cores, the search can't run 20x faster either in time to depth, and finally total nodes searched for each run to see how the tree size changes (interesting data here by the way).
This is run with the NEW version of Crafty, which has a remarkably lightweight splitting algorithm that avoids any sort of central locking strategy completely. And a local split block is held for maybe a dozen instructions (asm level) max, which is far more effective than the previous versions. NPS scaling is a ways from perfect here as the data will show. Here it is:
Code: Select all
1p 2p 4p 8p 16p 20p
speedup - 1.64 3.68 6.64 10.76 13.17
NPS speedup - 1.94 3.83 7.25 12.27 14.00
avg nodes 18.3B 21.7B 19.1B 20.3B 21.3B 25.2B
(1) I have not yet tweaked anything to see if I can improve that 75% NPS scaling (16 cpus run 12x faster). I'd expect that can be better.
(2) the speedup numbers are VERY good when compared to the NPS speedup, better than I had expected, better than I had seen previously. But as I said, this parallel search is completely new and has been in development for going on a year now, so it represents a departure from the old approach that is better than I had expected. And I have a couple of more ideas to work on, with the goal of using a VERY minimal locking strategy since locks kill performance.
(3) the average nodes per position doesn't grow any faster than it did with Cray Blitz. There has been speculation for years that all the pruning, reductions and such have been making speedup numbers drop over the years. I'm no longer buying that argument.
(4) in looking at these numbers, if the NPS can be driven to near optimal scaling, then the parallel search is going to be "right there" as well. For 20 cores the speedup was 13.2, while the NPS speedup was 14.0, which is an absolute upper bound for speedup. Using my old linear approximation would predict 1 + 19*.7 for 20 cores, or 14.3x faster. Actual is 13.1. So .65 rather than .7 is a good "approximation". For 16, the old formula would predict 11.5 which is a little low compared to the actual number.
I am going to work on this NPS issue for a bit and then re-run, but at least these results provide some information that is interesting. I will note that I normalized on the 20 cpu run. I wanted a search that took long enough at 20 cores to be meaningful. I targeted 3-6 minutes per position and for each position, twiddled with the depth until Crafty was averaging somewhere in that range. That makes the 1 cpu run take absolutely forever.
If anyone is interested, if you send me an email I'll be happy to email you the actual spreadsheet which gives all the above data, but both position by position, as well as an average (which is all I presented above). In the spreadsheet I even compute the speedup going from N to 2N cores in addition to the speedup from 1 to N cores so that I could see how the search efficiency went as I moved from 4 to 8 cores, then 8 to 16. Of course, you can use the above data to calculate that by working backward. If you'd prefer to have the raw data, I have log files for each run, but it IS a lot of data. I have an simple tool that eats a log file and gives me the time, NPS and nodes searched for each position, which I can cut and paste directly into Excel.
Finally, this data lacks one sanity test that I will do soon. I want to run the test again and average things. You can look at the spreadsheet and see anomalies where there are super-linear speedups on some positions, and horrible speedups on others. At least a couple of the positions are problematic in that they produce poor speedup across the board. For example 1.3x/2 2x/4, 3x/8 3x/16 5x/20.
But this is at least a set of carefully measured results. One thing I did do differently is that after each position (these are consecutive positions from a game between MChess Pro and Cray Blitz somewhere around 1992 or so) is that I cleared the hash table. When trying to set the depths to take 3-6 minutes, it was just about impossible to do so, as if I changed the depth on an earlier position, then suddenly later positions would go faster or slower due to hash hits changing. So actual game speedups should be a bit better than these numbers since the hash table would carry across from position to position, producing greater depth (and greater depth improves speedup).
If you have questions, feel free. If you want raw data or the spreadsheet, ask via email (my UAB email).