There are compilers and there are compilers

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
bob
Posts: 20549
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

There are compilers and there are compilers

Post by bob » Tue Jun 23, 2015 9:36 pm

I got ICC installed on the 20 core box finally, and ran a test. Results were beyond surprising. Both of these were run using the same settings, both using the transparent huge pages in Linux.

Here is the icc output using 20 cores:

time=1:30(99%) nodes=9072501462(9.1B) fh1=89% nps=100.7M

Here is the gcc output using 20 cores:

time=1:30(98%) nodes=7547414706(7.5B) fh1=90% nps=83.7M



Here's the single-core run to compare them, first Intel, then gcc

time=1:00(100%) nodes=385915568(385.9M) fh1=92% nps=6.4M
time=1:00(100%) nodes=391303121(391.3M) fh1=92% nps=6.5M


So a dead heat (gcc slightly better) with one thread, not so dead heat with 20 cores. Note that these are run with turbo-boost enabled again, for other testing. The 20 core runs run at 2.9ghz rock-solid, the 1-core tests run at 3.3ghz. To compute NPS scaling, it looks like this:

6.5 * 2.9 / 3.3 = 5.71M nps for one 2.9ghz processor.

100.7 / 5.7 = 17.7x, which is more reasonable. I suppose I am going to have to re-run all of my tests yet again with this version...

More on this later.

matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 2:48 am
Location: London, UK
Contact:

Re: There are compilers and there are compilers

Post by matthewlai » Tue Jun 23, 2015 10:55 pm

I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.

That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.

bob
Posts: 20549
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: There are compilers and there are compilers

Post by bob » Tue Jun 23, 2015 11:01 pm

matthewlai wrote:I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.

That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
This is not a cluster. This is one box with 2 Intel 2660 10 core chips... Intel has a major speed advantage over gcc. Both were profile-guided optimizations, so apples-to-applies

If something beats MKL something is definitely fishy... That has been optimized, re-optimized, re-re-optimized, etc...

matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 2:48 am
Location: London, UK
Contact:

Re: There are compilers and there are compilers

Post by matthewlai » Tue Jun 23, 2015 11:10 pm

bob wrote:
matthewlai wrote:I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.

That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
This is not a cluster. This is one box with 2 Intel 2660 10 core chips... Intel has a major speed advantage over gcc. Both were profile-guided optimizations, so apples-to-applies

If something beats MKL something is definitely fishy... That has been optimized, re-optimized, re-re-optimized, etc...
It's a cluster of a few hundred 20-core nodes totalling >13000 cores.

Each node has 2x E5-2660 v2.

In my case neither is PGO'ed, so that's apple-to-apple as well :).

I am surprised that it beats MKL as well, but Eigen is also a very mature linear algebra library with explicit vectorization for all the SIMD instruction sets out there.

For parallel matrix multiplications MKL scales slightly better, but for my application I can parallelize at a higher level, and Eigen comes out slightly ahead for single threaded multiplications.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.

bob
Posts: 20549
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: There are compilers and there are compilers

Post by bob » Wed Jun 24, 2015 1:41 am

matthewlai wrote:
bob wrote:
matthewlai wrote:I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.

That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
This is not a cluster. This is one box with 2 Intel 2660 10 core chips... Intel has a major speed advantage over gcc. Both were profile-guided optimizations, so apples-to-applies

If something beats MKL something is definitely fishy... That has been optimized, re-optimized, re-re-optimized, etc...
It's a cluster of a few hundred 20-core nodes totalling >13000 cores.

Each node has 2x E5-2660 v2.

In my case neither is PGO'ed, so that's apple-to-apple as well :).

I am surprised that it beats MKL as well, but Eigen is also a very mature linear algebra library with explicit vectorization for all the SIMD instruction sets out there.

For parallel matrix multiplications MKL scales slightly better, but for my application I can parallelize at a higher level, and Eigen comes out slightly ahead for single threaded multiplications.
I think Intel's PGO is better than GCC by a significant margin... I ran a non-PGO version and just broke 90M nps, where the profiled version breaks 100M. I'm suspecting it shuffles memory variables around a bit to optimize locality and minimize cache block invalidates.

MikeB
Posts: 3461
Joined: Thu Mar 09, 2006 5:34 am
Location: Pen Argyl, Pennsylvania

Re: There are compilers and there are compilers

Post by MikeB » Wed Jun 24, 2015 1:56 am

bob wrote:I got ICC installed on the 20 core box finally, and ran a test. Results were beyond surprising. Both of these were run using the same settings, both using the transparent huge pages in Linux.

Here is the icc output using 20 cores:

time=1:30(99%) nodes=9072501462(9.1B) fh1=89% nps=100.7M

Here is the gcc output using 20 cores:

time=1:30(98%) nodes=7547414706(7.5B) fh1=90% nps=83.7M



Here's the single-core run to compare them, first Intel, then gcc

time=1:00(100%) nodes=385915568(385.9M) fh1=92% nps=6.4M
time=1:00(100%) nodes=391303121(391.3M) fh1=92% nps=6.5M


So a dead heat (gcc slightly better) with one thread, not so dead heat with 20 cores. Note that these are run with turbo-boost enabled again, for other testing. The 20 core runs run at 2.9ghz rock-solid, the 1-core tests run at 3.3ghz. To compute NPS scaling, it looks like this:

6.5 * 2.9 / 3.3 = 5.71M nps for one 2.9ghz processor.

100.7 / 5.7 = 17.7x, which is more reasonable. I suppose I am going to have to re-run all of my tests yet again with this version...

More on this later.
NICE! Did you change anything on your current makefile?

matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 2:48 am
Location: London, UK
Contact:

Re: There are compilers and there are compilers

Post by matthewlai » Wed Jun 24, 2015 1:58 am

bob wrote:
matthewlai wrote:
bob wrote:
matthewlai wrote:I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.

That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
This is not a cluster. This is one box with 2 Intel 2660 10 core chips... Intel has a major speed advantage over gcc. Both were profile-guided optimizations, so apples-to-applies

If something beats MKL something is definitely fishy... That has been optimized, re-optimized, re-re-optimized, etc...
It's a cluster of a few hundred 20-core nodes totalling >13000 cores.

Each node has 2x E5-2660 v2.

In my case neither is PGO'ed, so that's apple-to-apple as well :).

I am surprised that it beats MKL as well, but Eigen is also a very mature linear algebra library with explicit vectorization for all the SIMD instruction sets out there.

For parallel matrix multiplications MKL scales slightly better, but for my application I can parallelize at a higher level, and Eigen comes out slightly ahead for single threaded multiplications.
I think Intel's PGO is better than GCC by a significant margin... I ran a non-PGO version and just broke 90M nps, where the profiled version breaks 100M. I'm suspecting it shuffles memory variables around a bit to optimize locality and minimize cache block invalidates.
That's possible. GCC's PGO doesn't seem to help very much on things I've tried it on. Usually something like 3% at best.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.

bob
Posts: 20549
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: There are compilers and there are compilers

Post by bob » Wed Jun 24, 2015 2:58 am

MikeB wrote:
bob wrote:I got ICC installed on the 20 core box finally, and ran a test. Results were beyond surprising. Both of these were run using the same settings, both using the transparent huge pages in Linux.

Here is the icc output using 20 cores:

time=1:30(99%) nodes=9072501462(9.1B) fh1=89% nps=100.7M

Here is the gcc output using 20 cores:

time=1:30(98%) nodes=7547414706(7.5B) fh1=90% nps=83.7M



Here's the single-core run to compare them, first Intel, then gcc

time=1:00(100%) nodes=385915568(385.9M) fh1=92% nps=6.4M
time=1:00(100%) nodes=391303121(391.3M) fh1=92% nps=6.5M


So a dead heat (gcc slightly better) with one thread, not so dead heat with 20 cores. Note that these are run with turbo-boost enabled again, for other testing. The 20 core runs run at 2.9ghz rock-solid, the 1-core tests run at 3.3ghz. To compute NPS scaling, it looks like this:

6.5 * 2.9 / 3.3 = 5.71M nps for one 2.9ghz processor.

100.7 / 5.7 = 17.7x, which is more reasonable. I suppose I am going to have to re-run all of my tests yet again with this version...

More on this later.
NICE! Did you change anything on your current makefile?
just -prof_genx to -prof_gen

User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 8:05 pm
Location: Italy
Full name: Stefano Gemma
Contact:

Re: There are compilers and there are compilers

Post by stegemma » Wed Jun 24, 2015 7:20 am

bob wrote:I got ICC installed on the 20 core box finally, and ran a test. Results were beyond surprising. Both of these were run using the same settings, both using the transparent huge pages in Linux.

Here is the icc output using 20 cores:

time=1:30(99%) nodes=9072501462(9.1B) fh1=89% nps=100.7M

Here is the gcc output using 20 cores:

time=1:30(98%) nodes=7547414706(7.5B) fh1=90% nps=83.7M



Here's the single-core run to compare them, first Intel, then gcc

time=1:00(100%) nodes=385915568(385.9M) fh1=92% nps=6.4M
time=1:00(100%) nodes=391303121(391.3M) fh1=92% nps=6.5M


So a dead heat (gcc slightly better) with one thread, not so dead heat with 20 cores. Note that these are run with turbo-boost enabled again, for other testing. The 20 core runs run at 2.9ghz rock-solid, the 1-core tests run at 3.3ghz. To compute NPS scaling, it looks like this:

6.5 * 2.9 / 3.3 = 5.71M nps for one 2.9ghz processor.

100.7 / 5.7 = 17.7x, which is more reasonable. I suppose I am going to have to re-run all of my tests yet again with this version...

More on this later.
It would be more general if you could try other engines too. In some case, I've seen that the very old C++ Builder 6.0 generates faster code than actual MSVC++, in 32 bit only while the 64 bit is MSVC the faster one. More compilers, more softwares, more results :)

Aleks Peshkov
Posts: 870
Joined: Sun Nov 19, 2006 8:16 pm
Location: Russia

Re: There are compilers and there are compilers

Post by Aleks Peshkov » Wed Jun 24, 2015 8:36 am

Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes. :)
Seriously, nodes and nps does not matter. Time to solution matters.

Post Reply