There are compilers and there are compilers

bob · Post by **bob** » Tue Jun 23, 2015 11:36 pm

I got ICC installed on the 20 core box finally, and ran a test. Results were beyond surprising. Both of these were run using the same settings, both using the transparent huge pages in Linux.

Here is the icc output using 20 cores:

time=1:30(99%) nodes=9072501462(9.1B) fh1=89% nps=100.7M

Here is the gcc output using 20 cores:

time=1:30(98%) nodes=7547414706(7.5B) fh1=90% nps=83.7M

Here's the single-core run to compare them, first Intel, then gcc

time=1:00(100%) nodes=385915568(385.9M) fh1=92% nps=6.4M
time=1:00(100%) nodes=391303121(391.3M) fh1=92% nps=6.5M

So a dead heat (gcc slightly better) with one thread, not so dead heat with 20 cores. Note that these are run with turbo-boost enabled again, for other testing. The 20 core runs run at 2.9ghz rock-solid, the 1-core tests run at 3.3ghz. To compute NPS scaling, it looks like this:

6.5 * 2.9 / 3.3 = 5.71M nps for one 2.9ghz processor.

100.7 / 5.7 = 17.7x, which is more reasonable. I suppose I am going to have to re-run all of my tests yet again with this version...

More on this later.

matthewlai · Post by **matthewlai** » Wed Jun 24, 2015 12:55 am

I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.

That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.

bob · Post by **bob** » Wed Jun 24, 2015 1:01 am

matthewlai wrote:I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.

That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.

This is not a cluster. This is one box with 2 Intel 2660 10 core chips... Intel has a major speed advantage over gcc. Both were profile-guided optimizations, so apples-to-applies

If something beats MKL something is definitely fishy... That has been optimized, re-optimized, re-re-optimized, etc...

matthewlai · Post by **matthewlai** » Wed Jun 24, 2015 1:10 am

bob wrote:
matthewlai wrote:I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.

That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
This is not a cluster. This is one box with 2 Intel 2660 10 core chips... Intel has a major speed advantage over gcc. Both were profile-guided optimizations, so apples-to-applies

If something beats MKL something is definitely fishy... That has been optimized, re-optimized, re-re-optimized, etc...

It's a cluster of a few hundred 20-core nodes totalling >13000 cores.

Each node has 2x E5-2660 v2.

In my case neither is PGO'ed, so that's apple-to-apple as well

.

I am surprised that it beats MKL as well, but Eigen is also a very mature linear algebra library with explicit vectorization for all the SIMD instruction sets out there.

For parallel matrix multiplications MKL scales slightly better, but for my application I can parallelize at a higher level, and Eigen comes out slightly ahead for single threaded multiplications.

bob · Post by **bob** » Wed Jun 24, 2015 3:41 am

matthewlai wrote:
bob wrote:
matthewlai wrote:I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.

That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
This is not a cluster. This is one box with 2 Intel 2660 10 core chips... Intel has a major speed advantage over gcc. Both were profile-guided optimizations, so apples-to-applies

If something beats MKL something is definitely fishy... That has been optimized, re-optimized, re-re-optimized, etc...
It's a cluster of a few hundred 20-core nodes totalling >13000 cores.

Each node has 2x E5-2660 v2.

In my case neither is PGO'ed, so that's apple-to-apple as well .

I am surprised that it beats MKL as well, but Eigen is also a very mature linear algebra library with explicit vectorization for all the SIMD instruction sets out there.

For parallel matrix multiplications MKL scales slightly better, but for my application I can parallelize at a higher level, and Eigen comes out slightly ahead for single threaded multiplications.

I think Intel's PGO is better than GCC by a significant margin... I ran a non-PGO version and just broke 90M nps, where the profiled version breaks 100M. I'm suspecting it shuffles memory variables around a bit to optimize locality and minimize cache block invalidates.

MikeB · Post by **MikeB** » Wed Jun 24, 2015 3:56 am

bob wrote:I got ICC installed on the 20 core box finally, and ran a test. Results were beyond surprising. Both of these were run using the same settings, both using the transparent huge pages in Linux.

Here is the icc output using 20 cores:

time=1:30(99%) nodes=9072501462(9.1B) fh1=89% nps=100.7M

Here is the gcc output using 20 cores:

time=1:30(98%) nodes=7547414706(7.5B) fh1=90% nps=83.7M

Here's the single-core run to compare them, first Intel, then gcc

time=1:00(100%) nodes=385915568(385.9M) fh1=92% nps=6.4M
time=1:00(100%) nodes=391303121(391.3M) fh1=92% nps=6.5M

So a dead heat (gcc slightly better) with one thread, not so dead heat with 20 cores. Note that these are run with turbo-boost enabled again, for other testing. The 20 core runs run at 2.9ghz rock-solid, the 1-core tests run at 3.3ghz. To compute NPS scaling, it looks like this:

6.5 * 2.9 / 3.3 = 5.71M nps for one 2.9ghz processor.

100.7 / 5.7 = 17.7x, which is more reasonable. I suppose I am going to have to re-run all of my tests yet again with this version...

More on this later.

NICE! Did you change anything on your current makefile?

matthewlai · Post by **matthewlai** » Wed Jun 24, 2015 3:58 am

bob wrote:
matthewlai wrote:
bob wrote:
matthewlai wrote:I have also been playing with 20 cores nodes on a cluster, and I found GCC 4.9 to be very slightly faster than ICC for my engine.

That said, my engine does a lot of large matrix multiplications using Eigen with explicit vectorization, and is parallelized using OpenMP, so its performance characteristics are probably entirely unlike normal chess engines. Surprisingly, Eigen's matrix multiplication is even faster than ICC + MKL for me.
This is not a cluster. This is one box with 2 Intel 2660 10 core chips... Intel has a major speed advantage over gcc. Both were profile-guided optimizations, so apples-to-applies

If something beats MKL something is definitely fishy... That has been optimized, re-optimized, re-re-optimized, etc...
It's a cluster of a few hundred 20-core nodes totalling >13000 cores.

Each node has 2x E5-2660 v2.

In my case neither is PGO'ed, so that's apple-to-apple as well .

I am surprised that it beats MKL as well, but Eigen is also a very mature linear algebra library with explicit vectorization for all the SIMD instruction sets out there.

For parallel matrix multiplications MKL scales slightly better, but for my application I can parallelize at a higher level, and Eigen comes out slightly ahead for single threaded multiplications.
I think Intel's PGO is better than GCC by a significant margin... I ran a non-PGO version and just broke 90M nps, where the profiled version breaks 100M. I'm suspecting it shuffles memory variables around a bit to optimize locality and minimize cache block invalidates.

That's possible. GCC's PGO doesn't seem to help very much on things I've tried it on. Usually something like 3% at best.

bob · Post by **bob** » Wed Jun 24, 2015 4:58 am

MikeB wrote:
bob wrote:I got ICC installed on the 20 core box finally, and ran a test. Results were beyond surprising. Both of these were run using the same settings, both using the transparent huge pages in Linux.

Here is the icc output using 20 cores:

time=1:30(99%) nodes=9072501462(9.1B) fh1=89% nps=100.7M

Here is the gcc output using 20 cores:

time=1:30(98%) nodes=7547414706(7.5B) fh1=90% nps=83.7M

Here's the single-core run to compare them, first Intel, then gcc

time=1:00(100%) nodes=385915568(385.9M) fh1=92% nps=6.4M
time=1:00(100%) nodes=391303121(391.3M) fh1=92% nps=6.5M

So a dead heat (gcc slightly better) with one thread, not so dead heat with 20 cores. Note that these are run with turbo-boost enabled again, for other testing. The 20 core runs run at 2.9ghz rock-solid, the 1-core tests run at 3.3ghz. To compute NPS scaling, it looks like this:

6.5 * 2.9 / 3.3 = 5.71M nps for one 2.9ghz processor.

100.7 / 5.7 = 17.7x, which is more reasonable. I suppose I am going to have to re-run all of my tests yet again with this version...

More on this later.
NICE! Did you change anything on your current makefile?

just -prof_genx to -prof_gen

stegemma · Post by **stegemma** » Wed Jun 24, 2015 9:20 am

bob wrote:I got ICC installed on the 20 core box finally, and ran a test. Results were beyond surprising. Both of these were run using the same settings, both using the transparent huge pages in Linux.

Here is the icc output using 20 cores:

time=1:30(99%) nodes=9072501462(9.1B) fh1=89% nps=100.7M

Here is the gcc output using 20 cores:

time=1:30(98%) nodes=7547414706(7.5B) fh1=90% nps=83.7M

Here's the single-core run to compare them, first Intel, then gcc

time=1:00(100%) nodes=385915568(385.9M) fh1=92% nps=6.4M
time=1:00(100%) nodes=391303121(391.3M) fh1=92% nps=6.5M

So a dead heat (gcc slightly better) with one thread, not so dead heat with 20 cores. Note that these are run with turbo-boost enabled again, for other testing. The 20 core runs run at 2.9ghz rock-solid, the 1-core tests run at 3.3ghz. To compute NPS scaling, it looks like this:

6.5 * 2.9 / 3.3 = 5.71M nps for one 2.9ghz processor.

100.7 / 5.7 = 17.7x, which is more reasonable. I suppose I am going to have to re-run all of my tests yet again with this version...

More on this later.

It would be more general if you could try other engines too. In some case, I've seen that the very old C++ Builder 6.0 generates faster code than actual MSVC++, in 32 bit only while the 64 bit is MSVC the faster one. More compilers, more softwares, more results

Aleks Peshkov · Post by **Aleks Peshkov** » Wed Jun 24, 2015 10:36 am

Nodes count is different. I can say that GCC build is 20% smarter in 20 nodes. :)
Seriously, nodes and nps does not matter. Time to solution matters.

There are compilers and there are compilers

There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers

Re: There are compilers and there are compilers