I got ICC installed on the 20 core box finally, and ran a test. Results were beyond surprising. Both of these were run using the same settings, both using the transparent huge pages in Linux.
Here is the icc output using 20 cores:
time=1:30(99%) nodes=9072501462(9.1B) fh1=89% nps=100.7M
Here is the gcc output using 20 cores:
time=1:30(98%) nodes=7547414706(7.5B) fh1=90% nps=83.7M
Here's the single-core run to compare them, first Intel, then gcc
time=1:00(100%) nodes=385915568(385.9M) fh1=92% nps=6.4M
time=1:00(100%) nodes=391303121(391.3M) fh1=92% nps=6.5M
So a dead heat (gcc slightly better) with one thread, not so dead heat with 20 cores. Note that these are run with turbo-boost enabled again, for other testing. The 20 core runs run at 2.9ghz rock-solid, the 1-core tests run at 3.3ghz. To compute NPS scaling, it looks like this:
6.5 * 2.9 / 3.3 = 5.71M nps for one 2.9ghz processor.
100.7 / 5.7 = 17.7x, which is more reasonable. I suppose I am going to have to re-run all of my tests yet again with this version...
More on this later.
There are compilers and there are compilers
Moderators: hgm, Rebel, chrisw
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: There are compilers and there are compilers
I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.
That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: There are compilers and there are compilers
This is not a cluster. This is one box with 2 Intel 2660 10 core chips... Intel has a major speed advantage over gcc. Both were profile-guided optimizations, so apples-to-appliesmatthewlai wrote:I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.
That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
If something beats MKL something is definitely fishy... That has been optimized, re-optimized, re-re-optimized, etc...
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: There are compilers and there are compilers
It's a cluster of a few hundred 20-core nodes totalling >13000 cores.bob wrote:This is not a cluster. This is one box with 2 Intel 2660 10 core chips... Intel has a major speed advantage over gcc. Both were profile-guided optimizations, so apples-to-appliesmatthewlai wrote:I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.
That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
If something beats MKL something is definitely fishy... That has been optimized, re-optimized, re-re-optimized, etc...
Each node has 2x E5-2660 v2.
In my case neither is PGO'ed, so that's apple-to-apple as well .
I am surprised that it beats MKL as well, but Eigen is also a very mature linear algebra library with explicit vectorization for all the SIMD instruction sets out there.
For parallel matrix multiplications MKL scales slightly better, but for my application I can parallelize at a higher level, and Eigen comes out slightly ahead for single threaded multiplications.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: There are compilers and there are compilers
I think Intel's PGO is better than GCC by a significant margin... I ran a non-PGO version and just broke 90M nps, where the profiled version breaks 100M. I'm suspecting it shuffles memory variables around a bit to optimize locality and minimize cache block invalidates.matthewlai wrote:It's a cluster of a few hundred 20-core nodes totalling >13000 cores.bob wrote:This is not a cluster. This is one box with 2 Intel 2660 10 core chips... Intel has a major speed advantage over gcc. Both were profile-guided optimizations, so apples-to-appliesmatthewlai wrote:I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.
That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
If something beats MKL something is definitely fishy... That has been optimized, re-optimized, re-re-optimized, etc...
Each node has 2x E5-2660 v2.
In my case neither is PGO'ed, so that's apple-to-apple as well .
I am surprised that it beats MKL as well, but Eigen is also a very mature linear algebra library with explicit vectorization for all the SIMD instruction sets out there.
For parallel matrix multiplications MKL scales slightly better, but for my application I can parallelize at a higher level, and Eigen comes out slightly ahead for single threaded multiplications.
-
- Posts: 4889
- Joined: Thu Mar 09, 2006 6:34 am
- Location: Pen Argyl, Pennsylvania
Re: There are compilers and there are compilers
NICE! Did you change anything on your current makefile?bob wrote:I got ICC installed on the 20 core box finally, and ran a test. Results were beyond surprising. Both of these were run using the same settings, both using the transparent huge pages in Linux.
Here is the icc output using 20 cores:
time=1:30(99%) nodes=9072501462(9.1B) fh1=89% nps=100.7M
Here is the gcc output using 20 cores:
time=1:30(98%) nodes=7547414706(7.5B) fh1=90% nps=83.7M
Here's the single-core run to compare them, first Intel, then gcc
time=1:00(100%) nodes=385915568(385.9M) fh1=92% nps=6.4M
time=1:00(100%) nodes=391303121(391.3M) fh1=92% nps=6.5M
So a dead heat (gcc slightly better) with one thread, not so dead heat with 20 cores. Note that these are run with turbo-boost enabled again, for other testing. The 20 core runs run at 2.9ghz rock-solid, the 1-core tests run at 3.3ghz. To compute NPS scaling, it looks like this:
6.5 * 2.9 / 3.3 = 5.71M nps for one 2.9ghz processor.
100.7 / 5.7 = 17.7x, which is more reasonable. I suppose I am going to have to re-run all of my tests yet again with this version...
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: There are compilers and there are compilers
That's possible. GCC's PGO doesn't seem to help very much on things I've tried it on. Usually something like 3% at best.bob wrote:I think Intel's PGO is better than GCC by a significant margin... I ran a non-PGO version and just broke 90M nps, where the profiled version breaks 100M. I'm suspecting it shuffles memory variables around a bit to optimize locality and minimize cache block invalidates.matthewlai wrote:It's a cluster of a few hundred 20-core nodes totalling >13000 cores.bob wrote:This is not a cluster. This is one box with 2 Intel 2660 10 core chips... Intel has a major speed advantage over gcc. Both were profile-guided optimizations, so apples-to-appliesmatthewlai wrote:I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.
That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
If something beats MKL something is definitely fishy... That has been optimized, re-optimized, re-re-optimized, etc...
Each node has 2x E5-2660 v2.
In my case neither is PGO'ed, so that's apple-to-apple as well .
I am surprised that it beats MKL as well, but Eigen is also a very mature linear algebra library with explicit vectorization for all the SIMD instruction sets out there.
For parallel matrix multiplications MKL scales slightly better, but for my application I can parallelize at a higher level, and Eigen comes out slightly ahead for single threaded multiplications.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: There are compilers and there are compilers
just -prof_genx to -prof_genMikeB wrote:NICE! Did you change anything on your current makefile?bob wrote:I got ICC installed on the 20 core box finally, and ran a test. Results were beyond surprising. Both of these were run using the same settings, both using the transparent huge pages in Linux.
Here is the icc output using 20 cores:
time=1:30(99%) nodes=9072501462(9.1B) fh1=89% nps=100.7M
Here is the gcc output using 20 cores:
time=1:30(98%) nodes=7547414706(7.5B) fh1=90% nps=83.7M
Here's the single-core run to compare them, first Intel, then gcc
time=1:00(100%) nodes=385915568(385.9M) fh1=92% nps=6.4M
time=1:00(100%) nodes=391303121(391.3M) fh1=92% nps=6.5M
So a dead heat (gcc slightly better) with one thread, not so dead heat with 20 cores. Note that these are run with turbo-boost enabled again, for other testing. The 20 core runs run at 2.9ghz rock-solid, the 1-core tests run at 3.3ghz. To compute NPS scaling, it looks like this:
6.5 * 2.9 / 3.3 = 5.71M nps for one 2.9ghz processor.
100.7 / 5.7 = 17.7x, which is more reasonable. I suppose I am going to have to re-run all of my tests yet again with this version...
-
- Posts: 859
- Joined: Mon Aug 10, 2009 10:05 pm
- Location: Italy
- Full name: Stefano Gemma
Re: There are compilers and there are compilers
It would be more general if you could try other engines too. In some case, I've seen that the very old C++ Builder 6.0 generates faster code than actual MSVC++, in 32 bit only while the 64 bit is MSVC the faster one. More compilers, more softwares, more resultsbob wrote:I got ICC installed on the 20 core box finally, and ran a test. Results were beyond surprising. Both of these were run using the same settings, both using the transparent huge pages in Linux.
Here is the icc output using 20 cores:
time=1:30(99%) nodes=9072501462(9.1B) fh1=89% nps=100.7M
Here is the gcc output using 20 cores:
time=1:30(98%) nodes=7547414706(7.5B) fh1=90% nps=83.7M
Here's the single-core run to compare them, first Intel, then gcc
time=1:00(100%) nodes=385915568(385.9M) fh1=92% nps=6.4M
time=1:00(100%) nodes=391303121(391.3M) fh1=92% nps=6.5M
So a dead heat (gcc slightly better) with one thread, not so dead heat with 20 cores. Note that these are run with turbo-boost enabled again, for other testing. The 20 core runs run at 2.9ghz rock-solid, the 1-core tests run at 3.3ghz. To compute NPS scaling, it looks like this:
6.5 * 2.9 / 3.3 = 5.71M nps for one 2.9ghz processor.
100.7 / 5.7 = 17.7x, which is more reasonable. I suppose I am going to have to re-run all of my tests yet again with this version...
-
- Posts: 892
- Joined: Sun Nov 19, 2006 9:16 pm
- Location: Russia
Re: There are compilers and there are compilers
Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes. :)
Seriously, nodes and nps does not matter. Time to solution matters.
Seriously, nodes and nps does not matter. Time to solution matters.