parallel speedup and assorted trivia

bob · Post by **bob** » Tue Jun 09, 2015 5:01 pm

Joost Buijs wrote:
bob wrote: 2mb pages is easy. 1gb pages I don't know about yet...
2MB pages is what I meant, this is already I big improvement over the default 4KB pages.
Windows doesn't have support for 1GB pages yet, and I don't think it will be any different with Linux.
I read something about it here:

http://stackoverflow.com/questions/2795 ... s-on-linux

Linux does have support. In fact, you can even see what it has configured. I've just got to figure out how to tell it to use 'em.

bob · Post by **bob** » Tue Jun 09, 2015 9:50 pm

Just noticed that after the last bios twiddling and re-boot (to disable hyper-threading for one thing) this machine is now reporting 128gb of ram.

I'm going to see if anyone objects to taking 64 gb of that and turning it into 64 huge (1gb) pages at start-up. There is a mechanism for making huge pages out of normal pages, but it is anything but efficient. 64 huge pages would be a plus, eliminating a LOT of page-table "walking".

jdart · Post by **jdart** » Wed Jun 10, 2015 12:51 am

It would be interesting to test beyond 20 cores. You can lease a quad-CPU AMD system with 64 cores for about $500 a month. AMD is kind of weird because the cores have their own integer units but can share floating-point units. But that shouldn't matter for chess because there is little FPU usage.

I expect many scaling issues to get significantly worse as you move to these big core counts. But as CPUs evolve these should become more common.

--Jon

bob · Post by **bob** » Wed Jun 10, 2015 3:27 am

jdart wrote:It would be interesting to test beyond 20 cores. You can lease a quad-CPU AMD system with 64 cores for about $500 a month. AMD is kind of weird because the cores have their own integer units but can share floating-point units. But that shouldn't matter for chess because there is little FPU usage.

I expect many scaling issues to get significantly worse as you move to these big core counts. But as CPUs evolve these should become more common.

--Jon

I've worked with the developer lab in the past. They have put together various machines and made them available for several months at a time. Last time I asked (a good while back) they didn't go beyond 4 socket motherboards, but 4 sockets is now quick a number of processors.

So far, with 20 cores, the scaling I am getting is really good. I'm trying to get huge pages up and going on this box, planning on using 64 1GB pages to see what that does. I'd expect a 10% (or somewhat better) NPS increase.

I'll post more scaling data soon. I have run 4 complete runs with 20 cores, and I am 1/3 through a one core run so I can compute speedups as in the previous post. This version is probably 10% better in terms of NPS scaling than the version at the front of this thread... I almost have no "central locks" left. One for I/O that is only for debugging, and one where I stop threads when reaching a fail high. There I worry about a pair of threads trying to stop other threads, and everybody can't be stopped as I need one of the results to be backed up (a fail high). I'm thinking about this one. If I get rid of that, there will be nothing but split block locks, and this has already made is MUCH faster...

bob · Post by **bob** » Wed Jun 10, 2015 7:02 pm

I am going to try this as a test. I keep this data in an excel spreadsheet. But that's not postable here as far as I can tell, so I told excel to save it as windows .txt. After a lot of editing, the following is what I ended up with.

This test is 1 cpu vs 16 cpu. 4 runs with 16 cpus. I've separated the data into time, then nps, then total nodes. NPS gives an idea of the max speedup I can expect, which is in the 13.7x range given the NPS improvement from 1 to 16 cores is 13.7x. For the speedup, I give both the arithmetic mean of the individual position speedups, and at Kai's suggestion, also the geometric mean as well.

This was run on the 2660 v3 xeon (dual chip, 10 cores per chip, 24MB cache, 128GB ram. For this run I used 16gb of ram (1g hash entries). We are working on getting huge pages to work in the kernel we use, and once that is working I am going to rerun with 64gb of hash and much less TLB thrashing.

Here's the 16 cpu results. I have 20 cpu results as well, and am working on 8, 4 and 2 but they take longer... It is not quite lining up perfectly, and I am not editing it any more. I can make the .xlsx files available on my web page if anyone is interested in something that is easier to read...

Code: Select all

																			
1 processor	16 processors		16 processors		16 processors		16 processors
time		time  speedup		time  speedup		time  speedup		time  speedup
												
  682		222	 3.07		292	 2.34		194	 3.52		230	 2.97
1,277		106	12.05		104	12.28		139	 9.19		105	12.16
3,878		349	11.11		207	18.73		194	19.99		164	23.65
2,675		321	 8.33		347	 7.71		168	15.92		169	15.83
3,752		182	20.62		297	12.63		268	14.00		218	17.21
4,370		222	19.68		391	11.18		245	17.84		462	 9.46	
1,731		163	10.62		106	16.33		90	19.23		129	13.42	
3,761		588	 6.40		477	 7.88		406	 9.26		143	26.30	
2,295		141	16.28		150	15.30		135	17.00		174	13.19	
2,340		157	14.90		319	 7.34		140	16.71		143	16.36	
5,545		222	24.98		366	15.15		87	63.74		131	42.33	
2,321		206	11.27		177	13.11		161	14.42		167	13.90	
3,387		219	15.47		190	17.83		227	14.92		271	12.50	
2,292		287	 7.99		196	11.69		597	 3.84		94	24.38	
4,656		166	28.05		221	21.07		375	12.42		137	33.99	
6,152		152	40.47		370	16.63		240	25.63		277	22.21	
1,539		140	10.99		228	 6.75		113	13.62		151	10.19	
1,453		127	11.44		89	16.33		110	13.21		115	12.63	
1,348		141	 9.56		93	14.49		90	14.98		94	14.34	
1,891		257	 7.36		118	16.03		160	11.82		145	13.04	
1,677		89	18.84		132	12.70		165	10.16		125	13.42	
1,341		297	 4.52		461	 2.91		256	 5.24		332	 4.04	
  954		188	 5.07		294	 3.24		215	 4.44		171	 5.58	
1,503		109	13.79		169	 8.89		120	12.53		118	12.74
												
62,820        5,051     12.44		5,794	10.84		4,895	12.83		4,265	14.73    mean
			11.75			10.52			12.48			13.75    geometric mean
																			
NPS&#40;M&#41;		NPS&#40;M&#41; speedup		NPS&#40;M&#41; speedup		NPS&#40;M&#41; speedup		NPS&#40;M&#41; speedup
												
4.9		67.3	13.68		67.9	13.80		69.7	14.17		70.5	14.33
5.1		70.8	13.88		71.6	14.04		72.8	14.27		73.1	14.33
5.0		74.1	14.79		74.0	14.77		72.8	14.53		74.2	14.81
5.4		79.1	14.76		75.7	14.12		77.1	14.38		76.8	14.33
5.2		72.6	14.02		72.8	14.05		72.8	14.05		74.3	14.34
5.1		70.8	13.88		71.6	14.04		71.5	14.02		69.5	13.63
5.2		73.1	14.11		72.5	14.00		72.3	13.96		72.1	13.92
5.1		75.1	14.73		74.2	14.55		73.5	14.41		71.5	14.02
5.2		74.6	14.40		72.6	14.02		72.0	13.90		73.4	14.17
5.4		76.0	14.18		76.1	14.20		74.6	13.92		78.1	14.57
5.2		74.0	14.29		74.7	14.42		71.4	13.78		75.8	14.63
5.5		77.0	14.13		75.1	13.78		72.9	13.38		75.8	13.91
5.0		70.3	14.03		72.1	14.39		71.6	14.29		72.9	14.55
5.2		74.5	14.38		70.1	13.53		70.5	13.61		71.1	13.73
5.1		73.5	14.41		71.8	14.08		69.6	13.65		71.2	13.96
5.0		68.9	13.75		70.7	14.11		69.7	13.91		69.6	13.89
5.0		68.0	13.57		68.3	13.63		65.5	13.07		67.3	13.43
4.8		61.5	12.73		58.2	12.05		60.0	12.42		58.2	12.05
4.9		60.1	12.22		59.9	12.17		58.4	11.87		60.2	12.24
5.1		62.0	12.16		64.8	12.71		60.4	11.84		65.4	12.82
5.0		60.8	12.14		62.4	12.46		60.9	12.16		62.0	12.38
5.4		70.4	13.13		72.5	13.53		68.8	12.84		73.2	13.66
5.4		73.3	13.68		73.6	13.73		71.2	13.28		70.8	13.21
5.3		64.0	12.14		64.8	12.30		64.3	12.20		61.8	11.73
												
5.1		70.5	13.72		70.3	13.69		69.3	13.50		70.4	13.70    mean
											
										
Nodes&#40;M&#41;	Nodes&#40;M&#41; change		Nodes&#40;M&#41; change		Nodes&#40;M&#41; change		Nodes&#40;M&#41; change
																		
 3,372		14,991	344.6%		19,856	488.8%		13,527	301.2%		16,273	382.6%
 6,533		7,560	 15.7%		7,500	 14.8%		10,148	 55.3%		7,703	 17.9%	
19,299		25,888	 34.1%		15,325	-20.6%		14,141	-26.7%		12,225	-36.7%
14,360		25,427	 77.1%		26,288	 83.1%		12,986	 -9.6%		13,031	 -9.3%
19,318		13,236	-31.5%		21,694	 12.3%		19,580	  1.4%		16,224	-16.0%
22,310		15,750	-29.4%		28,079	 25.9%		17,526	-21.4%		32,164	 44.2%
 9,037		11,990	 32.7%		7,725	-14.5%		6,553	-27.5%		9,324	  3.2%
19,093		44,228	131.6%		35,381	 85.3%		29,884	 56.5%		10,300	-46.1%
11,805		10,553	-10.6%		10,892	 -7.7%		9,723	-17.6%		12,817	  8.6%
12,630		11,932	 -5.5%		24,296	 92.4%		10,477	-17.0%		11,187	-11.4%
28,936		16,449	-43.2%		27,377	 -5.4%		6,269	-78.3%		9,939	-65.7%
12,652		15,916	 25.8%		13,298	  5.1%		11,760	 -7.1%		12,729	  0.6%
17,105		15,455	 -9.6%		13,775	-19.5%		16,313	 -4.6%		19,752	 15.5%
11,798		21,404	 81.4%		13,800	 17.0%		42,140	257.2%		6,715	-43.1%
23,803		12,242	-48.6%		15,892	-33.2%		26,148	  9.9%		9,782	-58.9%
30,730		10,483	-65.9%		26,156	-14.9%		16,768	-45.4%		19,291	-37.2%
 7,678		9,541	 24.3%		15,621	103.5%		7,461	 -2.8%		10,160	 32.3%
 7,074		7,828	 10.7%		5,200	-26.5%		6,646	 -6.1%		6,723	 -5.0%
 6,630		8,529	 28.6%		5,610	-15.4%		5,297	-20.1%		5,672	-14.4%
 9,587		15,958	 66.5%		7,667	-20.0%		9,697	  1.1%		9,522	 -0.7%
 8,420		5,425	-35.6%		8,253	 -2.0%		10,065	 19.5%		7,809	 -7.3%
 7,213		20,968	190.7%		33,479	364.1%		17,657	144.8%		24,346	237.5%
 5,097		13,834	171.4%		21,699	325.7%		15,363	201.4%		12,174	138.8%
 7,909		6,971	-11.9%		11,001	 39.1%		7,753	 -2.0%		7,342	 -7.2%
13,433		15,107	 12.5%		17,328	 29.0%		14,328	  6.7%		12,634	 -6.0%

mean		20,897			14,715			14,276			12,866  
geometric mean	19,135			20,270			21,236			25,266

Matthias Hartwich · Post by **Matthias Hartwich** » Wed Jun 10, 2015 10:04 pm

Thanks a lot!

Putting files on your web site is fine. Another possibility ist saving as *.csv and pasting here.

Adam Hair · Post by **Adam Hair** » Wed Jun 10, 2015 11:33 pm

Thanks for sharing this information Bob. The distribution of the time to depth for the 16 cpus samples looks like what I have seen at 8 cpus for various engines. No surprise there, but it is nice to have some confirmation that my data is not unusual.

bob · Post by **bob** » Wed Jun 10, 2015 11:52 pm

Matthias Hartwich wrote:Thanks a lot!

Putting files on your web site is fine. Another possibility ist saving as *.csv and pasting here.

That looks like hell. Already tried it.

I don't know why Microsoft is so damned enamoured with the <tab> character, but using it as a separator just doesn't work.

I'll add the spread sheets to my web page right now. Notice that only 1-16 and 1-20 are done. I am going to try to run 1-8 tonight, but doing 4 runs starts to stretch the time out...

The files can be found at "www.cis.uab.edu/hyatt/crafty/SMP

in that directory there are two files, smp-16.xlsx and mp-20.xlsx

If you notice any errors in formulas or numbers, let me know. I did this pretty quickly. The raw data is just a cut and paste, so those are OK, but many of the columns are formulas. I checked them all, but anything can happen.

bob · Post by **bob** » Wed Jun 10, 2015 11:56 pm

Adam Hair wrote:Thanks for sharing this information Bob. The distribution of the time to depth for the 16 cpus samples looks like what I have seen at 8 cpus for various engines. No surprise there, but it is nice to have some confirmation that my data is not unusual.

For my dissertation I did something like 32 runs with 16 processors. Then averaged each run. Threw out the best 2 and worst 2, and then used the rest. As you can see the times fluctuate. When I ran with just 1gb of hash, I had one that was terrible. had a speedup for one position of .15x or something like that. Didn't see such wild fluctuations with this 16gb run. There are lots of fluctuations, but pretty sane fluctuations based on all the SMP testing I have done/seen over the years. There are a few positions that will produce almost exactly the same speedup run after run. There are some that look like random numbers. Most are somewhere in the middle of that...

Matthias Hartwich · Post by **Matthias Hartwich** » Thu Jun 11, 2015 6:00 am

bob wrote:
Matthias Hartwich wrote:Thanks a lot!

Putting files on your web site is fine. Another possibility ist saving as *.csv and pasting here.
That looks like hell. Already tried it. I don't know why Microsoft is so damned enamoured with the <tab> character, but using it as a separator just doesn't work.

Yes, that's why I use ';' as separator.

parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia

Re: parallel speedup and assorted trivia

New data

Re: New data

Re: New data

Re: New data

Re: New data

Re: New data