There are compilers and there are compilers

OneTrickPony · Post by **OneTrickPony** » Wed Jun 24, 2015 6:34 pm

Interesting, I downloaded ICC trial lately and tested it on several things from toy n-queens puzzle solver through simple sudoku solver to advanced poker equilibrium solver and GCC (4.8.1) performed significantly better for both single threaded and multi threaded (openMP) code.
I am not competent enough to say why is that but it would be useful if you give the compiler options for both. It's easy to miss some useful ones for GCC

zullil · Post by **zullil** » Wed Jun 24, 2015 6:38 pm

OneTrickPony wrote:Interesting, I downloaded ICC trial lately and tested it on several things from toy n-queens puzzle solver through simple sudoku solver to advanced poker equilibrium solver and GCC (4.8.1) performed significantly better for both single threaded and multi threaded (openMP) code.
I am not competent enough to say why is that but it would be useful if you give the compiler options for both. It's easy to miss some useful ones for GCC

What CPU did you compile for?

bob · Post by **bob** » Wed Jun 24, 2015 7:35 pm

Joost Buijs wrote:
bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v3 doesn't have TSX either.

http://ark.intel.com/nl/products/81706/ ... e-2_60-GHz

I didn't intend to imply anything about TSX. Not a big fan...

stegemma · Post by **stegemma** » Thu Jun 25, 2015 9:11 am

bob wrote:[...]
Fixed depth shows the same thing as fixed time here. Intel compiler produces an executable over 10% faster. Fixed depth or fixed time doesn't matter a bit here when I am just measuring the effectiveness of the compiler.

First of all I would have the money to buy this "monster"

Second, what I would like to say is maybe only theoretical and, in this context, maybe even pointless. I was just saying that examining node N takes a time proportional to number of moves in that node and that node N+1 have statistically less moves than previous one. If you have a faster compilation and test with time limit, the faster version can reach node N+1 more times and those node will be examined in a less time than node N... this gives an incremented nps. Is like to have a run between two runners that ends in a downhill (discesa, in italian) just for the last part of the road. The slower runner can't reach the downhill but the faster can do. The Miles/seconds of the faster runner will be already higher until the downhill but increase more while he/she runs in the downhill itself. That's why i think that you should compare with fixed depth.

Of course it s not proven that N+1 nodes has statically less moves than node N and your experiment say that the difference, if any, should be very low.

Vinvin · Post by **Vinvin** » Thu Jun 25, 2015 10:18 am

bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip:

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz

The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz

Joost Buijs · Post by **Joost Buijs** » Thu Jun 25, 2015 1:55 pm

Vinvin wrote:
bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip:

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz

2.6 GHz. is the base clock but in practice it runs at 2.9 GHz. with all cores fully loaded, when there are only 2 cores loaded it even runs at 3.3 GHz.
This makes it difficult to determine the exact SMP speedup.
Maybe it is wise to disable turbo-mode in the BIOS before doing any scaling measurements.

bob · Post by **bob** » Thu Jun 25, 2015 4:49 pm

Vinvin wrote:
bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip:

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz

Actually Intel understates the speed. Running all 20 cores at once sees a constant clock speed of 2.9ghz. Only when you enable hyper threading and run 40 threads will it slow down to 2.6ghz. Running 3-4-5 threads and you will see 3.3ghz non-stop...

bob · Post by **bob** » Thu Jun 25, 2015 4:52 pm

Joost Buijs wrote:
Vinvin wrote:
bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip:

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz
2.6 GHz. is the base clock but in practice it runs at 2.9 GHz. with all cores fully loaded, when there are only 2 cores loaded it even runs at 3.3 GHz.
This makes it difficult to determine the exact SMP speedup.
Maybe it is wise to disable turbo-mode in the BIOS before doing any scaling measurements.

You can do that and give up a 10% speed penalty. Or you can test as I do and always run 20 cores. If I want 4 cpu numbers, I run a 4 cpu test + a 16 cpu dummy load, just to make the clocks stick at 2.9ghz...

You should be able to limit this stuff with the Linux kernel, but it doesn't quite work as advertised, yet. IE you can set the max clock frequency to 2.9ghz, but it still clocks up to 3.3ghz unfortunately.

bob · Post by **bob** » Thu Jun 25, 2015 4:56 pm

stegemma wrote:
bob wrote:[...]
Fixed depth shows the same thing as fixed time here. Intel compiler produces an executable over 10% faster. Fixed depth or fixed time doesn't matter a bit here when I am just measuring the effectiveness of the compiler.
First of all I would have the money to buy this "monster"

Second, what I would like to say is maybe only theoretical and, in this context, maybe even pointless. I was just saying that examining node N takes a time proportional to number of moves in that node and that node N+1 have statistically less moves than previous one. If you have a faster compilation and test with time limit, the faster version can reach node N+1 more times and those node will be examined in a less time than node N... this gives an incremented nps. Is like to have a run between two runners that ends in a downhill (discesa, in italian) just for the last part of the road. The slower runner can't reach the downhill but the faster can do. The Miles/seconds of the faster runner will be already higher until the downhill but increase more while he/she runs in the downhill itself. That's why i think that you should compare with fixed depth.

Of course it s not proven that N+1 nodes has statically less moves than node N and your experiment say that the difference, if any, should be very low.

I ran a quick test and for Crafty, this doesn't seem to happen. IE a test to depth 20, 21 and 22 produce the same NPS numbers... i.e. for my test, depth=20 hit 6.5M NPS (one core only), depth=21 hit 6.5M NPS, depth=22 hit 6.5M NPS, etc... very shallow searches are slower, but I don't run shallow searches for SMP testing.

matthewlai · Post by **matthewlai** » Thu Jun 25, 2015 5:05 pm

bob wrote:
Vinvin wrote:
bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip:

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz
Actually Intel understates the speed. Running all 20 cores at once sees a constant clock speed of 2.9ghz. Only when you enable hyper threading and run 40 threads will it slow down to 2.6ghz. Running 3-4-5 threads and you will see 3.3ghz non-stop...

The reason why Haswell's base clock is so low is mostly because of AVX.

The limit is probably either power or thermal, and when you are not using AVX, half of the CPUs' floating point units are idle, and therefore they draw much less power.

By switching from SSE2 (128-bit) to AVX (256-bit) I get almost double the matrix multiplication throughput (and of course, if you are not doing floating point stuff at all, all FPs are idle).

If you run 20 threads of AVX-intensive workload, it will probably go down to the base clocks.

People (overclockers) have done a lot of tests to show that AVX very significantly increases power draw and temperature for Haswells.

There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers