There are compilers and there are compilers

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

OneTrickPony
Posts: 157
Joined: Tue Apr 30, 2013 1:29 am

Re: There are compilers and there are compilers

Post by OneTrickPony »

Interesting, I downloaded ICC trial lately and tested it on several things from toy n-queens puzzle solver through simple sudoku solver to advanced poker equilibrium solver and GCC (4.8.1) performed significantly better for both single threaded and multi threaded (openMP) code.
I am not competent enough to say why is that but it would be useful if you give the compiler options for both. It's easy to miss some useful ones for GCC
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: There are compilers and there are compilers

Post by zullil »

OneTrickPony wrote:Interesting, I downloaded ICC trial lately and tested it on several things from toy n-queens puzzle solver through simple sudoku solver to advanced poker equilibrium solver and GCC (4.8.1) performed significantly better for both single threaded and multi threaded (openMP) code.
I am not competent enough to say why is that but it would be useful if you give the compiler options for both. It's easy to miss some useful ones for GCC
What CPU did you compile for?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: There are compilers and there are compilers

Post by bob »

Joost Buijs wrote:
bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v3 doesn't have TSX either.

http://ark.intel.com/nl/products/81706/ ... e-2_60-GHz
I didn't intend to imply anything about TSX. Not a big fan...
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: There are compilers and there are compilers

Post by stegemma »

bob wrote:[...]
Fixed depth shows the same thing as fixed time here. Intel compiler produces an executable over 10% faster. Fixed depth or fixed time doesn't matter a bit here when I am just measuring the effectiveness of the compiler.
First of all I would have the money to buy this "monster" ;)

Second, what I would like to say is maybe only theoretical and, in this context, maybe even pointless. I was just saying that examining node N takes a time proportional to number of moves in that node and that node N+1 have statistically less moves than previous one. If you have a faster compilation and test with time limit, the faster version can reach node N+1 more times and those node will be examined in a less time than node N... this gives an incremented nps. Is like to have a run between two runners that ends in a downhill (discesa, in italian) just for the last part of the road. The slower runner can't reach the downhill but the faster can do. The Miles/seconds of the faster runner will be already higher until the downhill but increase more while he/she runs in the downhill itself. That's why i think that you should compare with fixed depth.

Of course it s not proven that N+1 nodes has statically less moves than node N and your experiment say that the difference, if any, should be very low.
Vinvin
Posts: 5228
Joined: Thu Mar 09, 2006 9:40 am
Full name: Vincent Lejeune

Re: There are compilers and there are compilers

Post by Vinvin »

bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip:

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: There are compilers and there are compilers

Post by Joost Buijs »

Vinvin wrote:
bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip:

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz
2.6 GHz. is the base clock but in practice it runs at 2.9 GHz. with all cores fully loaded, when there are only 2 cores loaded it even runs at 3.3 GHz.
This makes it difficult to determine the exact SMP speedup.
Maybe it is wise to disable turbo-mode in the BIOS before doing any scaling measurements.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: There are compilers and there are compilers

Post by bob »

Vinvin wrote:
bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip:

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz
Actually Intel understates the speed. Running all 20 cores at once sees a constant clock speed of 2.9ghz. Only when you enable hyper threading and run 40 threads will it slow down to 2.6ghz. Running 3-4-5 threads and you will see 3.3ghz non-stop...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: There are compilers and there are compilers

Post by bob »

Joost Buijs wrote:
Vinvin wrote:
bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip:

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz
2.6 GHz. is the base clock but in practice it runs at 2.9 GHz. with all cores fully loaded, when there are only 2 cores loaded it even runs at 3.3 GHz.
This makes it difficult to determine the exact SMP speedup.
Maybe it is wise to disable turbo-mode in the BIOS before doing any scaling measurements.
You can do that and give up a 10% speed penalty. Or you can test as I do and always run 20 cores. If I want 4 cpu numbers, I run a 4 cpu test + a 16 cpu dummy load, just to make the clocks stick at 2.9ghz...

You should be able to limit this stuff with the Linux kernel, but it doesn't quite work as advertised, yet. IE you can set the max clock frequency to 2.9ghz, but it still clocks up to 3.3ghz unfortunately.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: There are compilers and there are compilers

Post by bob »

stegemma wrote:
bob wrote:[...]
Fixed depth shows the same thing as fixed time here. Intel compiler produces an executable over 10% faster. Fixed depth or fixed time doesn't matter a bit here when I am just measuring the effectiveness of the compiler.
First of all I would have the money to buy this "monster" ;)

Second, what I would like to say is maybe only theoretical and, in this context, maybe even pointless. I was just saying that examining node N takes a time proportional to number of moves in that node and that node N+1 have statistically less moves than previous one. If you have a faster compilation and test with time limit, the faster version can reach node N+1 more times and those node will be examined in a less time than node N... this gives an incremented nps. Is like to have a run between two runners that ends in a downhill (discesa, in italian) just for the last part of the road. The slower runner can't reach the downhill but the faster can do. The Miles/seconds of the faster runner will be already higher until the downhill but increase more while he/she runs in the downhill itself. That's why i think that you should compare with fixed depth.

Of course it s not proven that N+1 nodes has statically less moves than node N and your experiment say that the difference, if any, should be very low.
I ran a quick test and for Crafty, this doesn't seem to happen. IE a test to depth 20, 21 and 22 produce the same NPS numbers... i.e. for my test, depth=20 hit 6.5M NPS (one core only), depth=21 hit 6.5M NPS, depth=22 hit 6.5M NPS, etc... very shallow searches are slower, but I don't run shallow searches for SMP testing.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: There are compilers and there are compilers

Post by matthewlai »

bob wrote:
Vinvin wrote:
bob wrote:
Joost Buijs wrote:
ymatioun wrote:New Intel Xeon CPUs include something called "Transaction Synchronization Extensions".

I don't know exactly how they work, but those instructions were specifically designed to reduce synch overhead. Perhaps ICC emits those instructions, while GCC most certainly does not? That would explain why improvement only happens in parallel search.
TSX is disabled on all Haswell processors because there is an error in the implementation.
I think the E5-2660v2 is a Haswell processor, the performance increase Bob sees must have an other explanation.
BTW this is a v3 chip:

model name : Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
The v2 runs @2.2 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.20GHz
and the v3 runs @2.6 GHz -> http://www.cpubenchmark.net/cpu.php?cpu ... 40+2.60GHz
Actually Intel understates the speed. Running all 20 cores at once sees a constant clock speed of 2.9ghz. Only when you enable hyper threading and run 40 threads will it slow down to 2.6ghz. Running 3-4-5 threads and you will see 3.3ghz non-stop...
The reason why Haswell's base clock is so low is mostly because of AVX.

The limit is probably either power or thermal, and when you are not using AVX, half of the CPUs' floating point units are idle, and therefore they draw much less power.

By switching from SSE2 (128-bit) to AVX (256-bit) I get almost double the matrix multiplication throughput (and of course, if you are not doing floating point stuff at all, all FPs are idle).

If you run 20 threads of AVX-intensive workload, it will probably go down to the base clocks.

People (overclockers) have done a lot of tests to show that AVX very significantly increases power draw and temperature for Haswells.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.