Linux does have support. In fact, you can even see what it has configured. I've just got to figure out how to tell it to use 'em.Joost Buijs wrote:2MB pages is what I meant, this is already I big improvement over the default 4KB pages.bob wrote: 2mb pages is easy. 1gb pages I don't know about yet...
Windows doesn't have support for 1GB pages yet, and I don't think it will be any different with Linux.
I read something about it here:
http://stackoverflow.com/questions/2795 ... s-on-linux
parallel speedup and assorted trivia
Moderators: hgm, Rebel, chrisw
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: parallel speedup and assorted trivia
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: parallel speedup and assorted trivia
Just noticed that after the last bios twiddling and re-boot (to disable hyper-threading for one thing) this machine is now reporting 128gb of ram.
I'm going to see if anyone objects to taking 64 gb of that and turning it into 64 huge (1gb) pages at start-up. There is a mechanism for making huge pages out of normal pages, but it is anything but efficient. 64 huge pages would be a plus, eliminating a LOT of page-table "walking".
I'm going to see if anyone objects to taking 64 gb of that and turning it into 64 huge (1gb) pages at start-up. There is a mechanism for making huge pages out of normal pages, but it is anything but efficient. 64 huge pages would be a plus, eliminating a LOT of page-table "walking".
-
- Posts: 4366
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: parallel speedup and assorted trivia
It would be interesting to test beyond 20 cores. You can lease a quad-CPU AMD system with 64 cores for about $500 a month. AMD is kind of weird because the cores have their own integer units but can share floating-point units. But that shouldn't matter for chess because there is little FPU usage.
I expect many scaling issues to get significantly worse as you move to these big core counts. But as CPUs evolve these should become more common.
--Jon
I expect many scaling issues to get significantly worse as you move to these big core counts. But as CPUs evolve these should become more common.
--Jon
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: parallel speedup and assorted trivia
I've worked with the developer lab in the past. They have put together various machines and made them available for several months at a time. Last time I asked (a good while back) they didn't go beyond 4 socket motherboards, but 4 sockets is now quick a number of processors.jdart wrote:It would be interesting to test beyond 20 cores. You can lease a quad-CPU AMD system with 64 cores for about $500 a month. AMD is kind of weird because the cores have their own integer units but can share floating-point units. But that shouldn't matter for chess because there is little FPU usage.
I expect many scaling issues to get significantly worse as you move to these big core counts. But as CPUs evolve these should become more common.
--Jon
So far, with 20 cores, the scaling I am getting is really good. I'm trying to get huge pages up and going on this box, planning on using 64 1GB pages to see what that does. I'd expect a 10% (or somewhat better) NPS increase.
I'll post more scaling data soon. I have run 4 complete runs with 20 cores, and I am 1/3 through a one core run so I can compute speedups as in the previous post. This version is probably 10% better in terms of NPS scaling than the version at the front of this thread... I almost have no "central locks" left. One for I/O that is only for debugging, and one where I stop threads when reaching a fail high. There I worry about a pair of threads trying to stop other threads, and everybody can't be stopped as I need one of the results to be backed up (a fail high). I'm thinking about this one. If I get rid of that, there will be nothing but split block locks, and this has already made is MUCH faster...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
New data
I am going to try this as a test. I keep this data in an excel spreadsheet. But that's not postable here as far as I can tell, so I told excel to save it as windows .txt. After a lot of editing, the following is what I ended up with.
This test is 1 cpu vs 16 cpu. 4 runs with 16 cpus. I've separated the data into time, then nps, then total nodes. NPS gives an idea of the max speedup I can expect, which is in the 13.7x range given the NPS improvement from 1 to 16 cores is 13.7x. For the speedup, I give both the arithmetic mean of the individual position speedups, and at Kai's suggestion, also the geometric mean as well.
This was run on the 2660 v3 xeon (dual chip, 10 cores per chip, 24MB cache, 128GB ram. For this run I used 16gb of ram (1g hash entries). We are working on getting huge pages to work in the kernel we use, and once that is working I am going to rerun with 64gb of hash and much less TLB thrashing.
Here's the 16 cpu results. I have 20 cpu results as well, and am working on 8, 4 and 2 but they take longer... It is not quite lining up perfectly, and I am not editing it any more. I can make the .xlsx files available on my web page if anyone is interested in something that is easier to read...
This test is 1 cpu vs 16 cpu. 4 runs with 16 cpus. I've separated the data into time, then nps, then total nodes. NPS gives an idea of the max speedup I can expect, which is in the 13.7x range given the NPS improvement from 1 to 16 cores is 13.7x. For the speedup, I give both the arithmetic mean of the individual position speedups, and at Kai's suggestion, also the geometric mean as well.
This was run on the 2660 v3 xeon (dual chip, 10 cores per chip, 24MB cache, 128GB ram. For this run I used 16gb of ram (1g hash entries). We are working on getting huge pages to work in the kernel we use, and once that is working I am going to rerun with 64gb of hash and much less TLB thrashing.
Here's the 16 cpu results. I have 20 cpu results as well, and am working on 8, 4 and 2 but they take longer... It is not quite lining up perfectly, and I am not editing it any more. I can make the .xlsx files available on my web page if anyone is interested in something that is easier to read...
Code: Select all
1 processor 16 processors 16 processors 16 processors 16 processors
time time speedup time speedup time speedup time speedup
682 222 3.07 292 2.34 194 3.52 230 2.97
1,277 106 12.05 104 12.28 139 9.19 105 12.16
3,878 349 11.11 207 18.73 194 19.99 164 23.65
2,675 321 8.33 347 7.71 168 15.92 169 15.83
3,752 182 20.62 297 12.63 268 14.00 218 17.21
4,370 222 19.68 391 11.18 245 17.84 462 9.46
1,731 163 10.62 106 16.33 90 19.23 129 13.42
3,761 588 6.40 477 7.88 406 9.26 143 26.30
2,295 141 16.28 150 15.30 135 17.00 174 13.19
2,340 157 14.90 319 7.34 140 16.71 143 16.36
5,545 222 24.98 366 15.15 87 63.74 131 42.33
2,321 206 11.27 177 13.11 161 14.42 167 13.90
3,387 219 15.47 190 17.83 227 14.92 271 12.50
2,292 287 7.99 196 11.69 597 3.84 94 24.38
4,656 166 28.05 221 21.07 375 12.42 137 33.99
6,152 152 40.47 370 16.63 240 25.63 277 22.21
1,539 140 10.99 228 6.75 113 13.62 151 10.19
1,453 127 11.44 89 16.33 110 13.21 115 12.63
1,348 141 9.56 93 14.49 90 14.98 94 14.34
1,891 257 7.36 118 16.03 160 11.82 145 13.04
1,677 89 18.84 132 12.70 165 10.16 125 13.42
1,341 297 4.52 461 2.91 256 5.24 332 4.04
954 188 5.07 294 3.24 215 4.44 171 5.58
1,503 109 13.79 169 8.89 120 12.53 118 12.74
62,820 5,051 12.44 5,794 10.84 4,895 12.83 4,265 14.73 mean
11.75 10.52 12.48 13.75 geometric mean
NPS(M) NPS(M) speedup NPS(M) speedup NPS(M) speedup NPS(M) speedup
4.9 67.3 13.68 67.9 13.80 69.7 14.17 70.5 14.33
5.1 70.8 13.88 71.6 14.04 72.8 14.27 73.1 14.33
5.0 74.1 14.79 74.0 14.77 72.8 14.53 74.2 14.81
5.4 79.1 14.76 75.7 14.12 77.1 14.38 76.8 14.33
5.2 72.6 14.02 72.8 14.05 72.8 14.05 74.3 14.34
5.1 70.8 13.88 71.6 14.04 71.5 14.02 69.5 13.63
5.2 73.1 14.11 72.5 14.00 72.3 13.96 72.1 13.92
5.1 75.1 14.73 74.2 14.55 73.5 14.41 71.5 14.02
5.2 74.6 14.40 72.6 14.02 72.0 13.90 73.4 14.17
5.4 76.0 14.18 76.1 14.20 74.6 13.92 78.1 14.57
5.2 74.0 14.29 74.7 14.42 71.4 13.78 75.8 14.63
5.5 77.0 14.13 75.1 13.78 72.9 13.38 75.8 13.91
5.0 70.3 14.03 72.1 14.39 71.6 14.29 72.9 14.55
5.2 74.5 14.38 70.1 13.53 70.5 13.61 71.1 13.73
5.1 73.5 14.41 71.8 14.08 69.6 13.65 71.2 13.96
5.0 68.9 13.75 70.7 14.11 69.7 13.91 69.6 13.89
5.0 68.0 13.57 68.3 13.63 65.5 13.07 67.3 13.43
4.8 61.5 12.73 58.2 12.05 60.0 12.42 58.2 12.05
4.9 60.1 12.22 59.9 12.17 58.4 11.87 60.2 12.24
5.1 62.0 12.16 64.8 12.71 60.4 11.84 65.4 12.82
5.0 60.8 12.14 62.4 12.46 60.9 12.16 62.0 12.38
5.4 70.4 13.13 72.5 13.53 68.8 12.84 73.2 13.66
5.4 73.3 13.68 73.6 13.73 71.2 13.28 70.8 13.21
5.3 64.0 12.14 64.8 12.30 64.3 12.20 61.8 11.73
5.1 70.5 13.72 70.3 13.69 69.3 13.50 70.4 13.70 mean
Nodes(M) Nodes(M) change Nodes(M) change Nodes(M) change Nodes(M) change
3,372 14,991 344.6% 19,856 488.8% 13,527 301.2% 16,273 382.6%
6,533 7,560 15.7% 7,500 14.8% 10,148 55.3% 7,703 17.9%
19,299 25,888 34.1% 15,325 -20.6% 14,141 -26.7% 12,225 -36.7%
14,360 25,427 77.1% 26,288 83.1% 12,986 -9.6% 13,031 -9.3%
19,318 13,236 -31.5% 21,694 12.3% 19,580 1.4% 16,224 -16.0%
22,310 15,750 -29.4% 28,079 25.9% 17,526 -21.4% 32,164 44.2%
9,037 11,990 32.7% 7,725 -14.5% 6,553 -27.5% 9,324 3.2%
19,093 44,228 131.6% 35,381 85.3% 29,884 56.5% 10,300 -46.1%
11,805 10,553 -10.6% 10,892 -7.7% 9,723 -17.6% 12,817 8.6%
12,630 11,932 -5.5% 24,296 92.4% 10,477 -17.0% 11,187 -11.4%
28,936 16,449 -43.2% 27,377 -5.4% 6,269 -78.3% 9,939 -65.7%
12,652 15,916 25.8% 13,298 5.1% 11,760 -7.1% 12,729 0.6%
17,105 15,455 -9.6% 13,775 -19.5% 16,313 -4.6% 19,752 15.5%
11,798 21,404 81.4% 13,800 17.0% 42,140 257.2% 6,715 -43.1%
23,803 12,242 -48.6% 15,892 -33.2% 26,148 9.9% 9,782 -58.9%
30,730 10,483 -65.9% 26,156 -14.9% 16,768 -45.4% 19,291 -37.2%
7,678 9,541 24.3% 15,621 103.5% 7,461 -2.8% 10,160 32.3%
7,074 7,828 10.7% 5,200 -26.5% 6,646 -6.1% 6,723 -5.0%
6,630 8,529 28.6% 5,610 -15.4% 5,297 -20.1% 5,672 -14.4%
9,587 15,958 66.5% 7,667 -20.0% 9,697 1.1% 9,522 -0.7%
8,420 5,425 -35.6% 8,253 -2.0% 10,065 19.5% 7,809 -7.3%
7,213 20,968 190.7% 33,479 364.1% 17,657 144.8% 24,346 237.5%
5,097 13,834 171.4% 21,699 325.7% 15,363 201.4% 12,174 138.8%
7,909 6,971 -11.9% 11,001 39.1% 7,753 -2.0% 7,342 -7.2%
13,433 15,107 12.5% 17,328 29.0% 14,328 6.7% 12,634 -6.0%
mean 20,897 14,715 14,276 12,866
geometric mean 19,135 20,270 21,236 25,266
-
- Posts: 38
- Joined: Tue Jul 01, 2008 9:36 pm
Re: New data
Thanks a lot!
Putting files on your web site is fine. Another possibility ist saving as *.csv and pasting here.
Putting files on your web site is fine. Another possibility ist saving as *.csv and pasting here.
-
- Posts: 3226
- Joined: Wed May 06, 2009 10:31 pm
- Location: Fuquay-Varina, North Carolina
Re: New data
Thanks for sharing this information Bob. The distribution of the time to depth for the 16 cpus samples looks like what I have seen at 8 cpus for various engines. No surprise there, but it is nice to have some confirmation that my data is not unusual.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: New data
That looks like hell. Already tried it. I don't know why Microsoft is so damned enamoured with the <tab> character, but using it as a separator just doesn't work.Matthias Hartwich wrote:Thanks a lot!
Putting files on your web site is fine. Another possibility ist saving as *.csv and pasting here.
I'll add the spread sheets to my web page right now. Notice that only 1-16 and 1-20 are done. I am going to try to run 1-8 tonight, but doing 4 runs starts to stretch the time out...
The files can be found at "www.cis.uab.edu/hyatt/crafty/SMP
in that directory there are two files, smp-16.xlsx and mp-20.xlsx
If you notice any errors in formulas or numbers, let me know. I did this pretty quickly. The raw data is just a cut and paste, so those are OK, but many of the columns are formulas. I checked them all, but anything can happen.
Last edited by bob on Wed Jun 10, 2015 11:59 pm, edited 1 time in total.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: New data
For my dissertation I did something like 32 runs with 16 processors. Then averaged each run. Threw out the best 2 and worst 2, and then used the rest. As you can see the times fluctuate. When I ran with just 1gb of hash, I had one that was terrible. had a speedup for one position of .15x or something like that. Didn't see such wild fluctuations with this 16gb run. There are lots of fluctuations, but pretty sane fluctuations based on all the SMP testing I have done/seen over the years. There are a few positions that will produce almost exactly the same speedup run after run. There are some that look like random numbers. Most are somewhere in the middle of that...Adam Hair wrote:Thanks for sharing this information Bob. The distribution of the time to depth for the 16 cpus samples looks like what I have seen at 8 cpus for various engines. No surprise there, but it is nice to have some confirmation that my data is not unusual.
-
- Posts: 38
- Joined: Tue Jul 01, 2008 9:36 pm
Re: New data
Yes, that's why I use ';' as separator.bob wrote:That looks like hell. Already tried it. I don't know why Microsoft is so damned enamoured with the <tab> character, but using it as a separator just doesn't work.Matthias Hartwich wrote:Thanks a lot!
Putting files on your web site is fine. Another possibility ist saving as *.csv and pasting here.