Interesting, I downloaded ICC trial lately and tested it on several things from toy n-queens puzzle solver through simple sudoku solver to advanced poker equilibrium solver and GCC (4.8.1) performed significantly better for both single threaded and multi threaded (openMP) code.
I am not competent enough to say why is that but it would be useful if you give the compiler options for both. It's easy to miss some useful ones for GCC
There are compilers and there are compilers
Moderators: hgm, Rebel, chrisw
-
- Posts: 157
- Joined: Tue Apr 30, 2013 1:29 am
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: There are compilers and there are compilers
What CPU did you compile for?OneTrickPony wrote:Interesting, I downloaded ICC trial lately and tested it on several things from toy n-queens puzzle solver through simple sudoku solver to advanced poker equilibrium solver and GCC (4.8.1) performed significantly better for both single threaded and multi threaded (openMP) code.
I am not competent enough to say why is that but it would be useful if you give the compiler options for both. It's easy to miss some useful ones for GCC
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: There are compilers and there are compilers
I didn't intend to imply anything about TSX. Not a big fan...Joost Buijs wrote:The v3 doesn't have TSX either.bob wrote:BTW this is a v3 chipJoost Buijs wrote:TSX is disabled on all Haswell processors because there is an error in the implementation.ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".
I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
http://ark.intel.com/nl/products/81706/ ... e-2_60-GHz
-
- Posts: 859
- Joined: Mon Aug 10, 2009 10:05 pm
- Location: Italy
- Full name: Stefano Gemma
Re: There are compilers and there are compilers
First of all I would have the money to buy this "monster"bob wrote:[...]
Fixed depth shows the same thing as fixed time here. Intel compiler produces an executable over 10% faster. Fixed depth or fixed time doesn't matter a bit here when I am just measuring the effectiveness of the compiler.
Second, what I would like to say is maybe only theoretical and, in this context, maybe even pointless. I was just saying that examining node N takes a time proportional to number of moves in that node and that node N+1 have statistically less moves than previous one. If you have a faster compilation and test with time limit, the faster version can reach node N+1 more times and those node will be examined in a less time than node N... this gives an incremented nps. Is like to have a run between two runners that ends in a downhill (discesa, in italian) just for the last part of the road. The slower runner can't reach the downhill but the faster can do. The Miles/seconds of the faster runner will be already higher until the downhill but increase more while he/she runs in the downhill itself. That's why i think that you should compare with fixed depth.
Of course it s not proven that N+1 nodes has statically less moves than node N and your experiment say that the difference, if any, should be very low.
-
- Posts: 5228
- Joined: Thu Mar 09, 2006 9:40 am
- Full name: Vincent Lejeune
Re: There are compilers and there are compilers
The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHzbob wrote:BTW this is a v3 chip:Joost Buijs wrote:TSX is disabled on all Haswell processors because there is an error in the implementation.ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".
I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz
-
- Posts: 1563
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: There are compilers and there are compilers
2.6 GHz. is the base clock but in practice it runs at 2.9 GHz. with all cores fully loaded, when there are only 2 cores loaded it even runs at 3.3 GHz.Vinvin wrote:The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHzbob wrote:BTW this is a v3 chip:Joost Buijs wrote:TSX is disabled on all Haswell processors because there is an error in the implementation.ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".
I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz
This makes it difficult to determine the exact SMP speedup.
Maybe it is wise to disable turbo-mode in the BIOS before doing any scaling measurements.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: There are compilers and there are compilers
Actually Intel understates the speed. Running all 20 cores at once sees a constant clock speed of 2.9ghz. Only when you enable hyper threading and run 40 threads will it slow down to 2.6ghz. Running 3-4-5 threads and you will see 3.3ghz non-stop...Vinvin wrote:The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHzbob wrote:BTW this is a v3 chip:Joost Buijs wrote:TSX is disabled on all Haswell processors because there is an error in the implementation.ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".
I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: There are compilers and there are compilers
You can do that and give up a 10% speed penalty. Or you can test as I do and always run 20 cores. If I want 4 cpu numbers, I run a 4 cpu test + a 16 cpu dummy load, just to make the clocks stick at 2.9ghz...Joost Buijs wrote:2.6 GHz. is the base clock but in practice it runs at 2.9 GHz. with all cores fully loaded, when there are only 2 cores loaded it even runs at 3.3 GHz.Vinvin wrote:The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHzbob wrote:BTW this is a v3 chip:Joost Buijs wrote:TSX is disabled on all Haswell processors because there is an error in the implementation.ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".
I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz
This makes it difficult to determine the exact SMP speedup.
Maybe it is wise to disable turbo-mode in the BIOS before doing any scaling measurements.
You should be able to limit this stuff with the Linux kernel, but it doesn't quite work as advertised, yet. IE you can set the max clock frequency to 2.9ghz, but it still clocks up to 3.3ghz unfortunately.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: There are compilers and there are compilers
I ran a quick test and for Crafty, this doesn't seem to happen. IE a test to depth 20, 21 and 22 produce the same NPS numbers... i.e. for my test, depth=20 hit 6.5M NPS (one core only), depth=21 hit 6.5M NPS, depth=22 hit 6.5M NPS, etc... very shallow searches are slower, but I don't run shallow searches for SMP testing.stegemma wrote:First of all I would have the money to buy this "monster"bob wrote:[...]
Fixed depth shows the same thing as fixed time here. Intel compiler produces an executable over 10% faster. Fixed depth or fixed time doesn't matter a bit here when I am just measuring the effectiveness of the compiler.
Second, what I would like to say is maybe only theoretical and, in this context, maybe even pointless. I was just saying that examining node N takes a time proportional to number of moves in that node and that node N+1 have statistically less moves than previous one. If you have a faster compilation and test with time limit, the faster version can reach node N+1 more times and those node will be examined in a less time than node N... this gives an incremented nps. Is like to have a run between two runners that ends in a downhill (discesa, in italian) just for the last part of the road. The slower runner can't reach the downhill but the faster can do. The Miles/seconds of the faster runner will be already higher until the downhill but increase more while he/she runs in the downhill itself. That's why i think that you should compare with fixed depth.
Of course it s not proven that N+1 nodes has statically less moves than node N and your experiment say that the difference, if any, should be very low.
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: There are compilers and there are compilers
The reason why Haswell's base clock is so low is mostly because of AVX.bob wrote:Actually Intel understates the speed. Running all 20 cores at once sees a constant clock speed of 2.9ghz. Only when you enable hyper threading and run 40 threads will it slow down to 2.6ghz. Running 3-4-5 threads and you will see 3.3ghz non-stop...Vinvin wrote:The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHzbob wrote:BTW this is a v3 chip:Joost Buijs wrote:TSX is disabled on all Haswell processors because there is an error in the implementation.ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".
I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz
The limit is probably either power or thermal, and when you are not using AVX, half of the CPUs' floating point units are idle, and therefore they draw much less power.
By switching from SSE2 (128-bit) to AVX (256-bit) I get almost double the matrix multiplication throughput (and of course, if you are not doing floating point stuff at all, all FPs are idle).
If you run 20 threads of AVX-intensive workload, it will probably go down to the base clocks.
People (overclockers) have done a lot of tests to show that AVX very significantly increases power draw and temperature for Haswells.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.