Stockfish port to C# Complete

Jim Ablett · Post by **Jim Ablett** » Wed Mar 28, 2012 7:19 pm

bob wrote:
diep wrote:
Jim Ablett wrote:
No no, GCC is turtle slow for diep in 64 bits. Diep is alraedy suffering from the L1i in 32 bits and as you know GCC always needs more instructions to get the same thing done and that will never change of course.

So in 64 bits GCC suffers more for Diep.
For chess programs written to utilize 64 bit architecture & using bitboards, then GCC is faster generally.

Jim.
Well i hope some of the authors post something here for you, as when it's even 0.5% faster they will switch to GCC i'm sure

However latest report i had was that for most bitboarders it's 50% difference positive for intel c++. The speedup i see from GCC, though diep is not bitboards, is just a 10% boost by 4.7.0, and it's 14% behind on pgo already onto visual studio, we can only guess intel c++ there.

So i seriously doubt your claim here.

Let me do an email to a few authors there, waking them up. Maybe some want to even post here.

Vincent
This is a bit unexpected to me. I find, even today, that intel's compiler produces faster executables that gcc in every test I have done, so long as you are actually running on an Intel CPU and not AMD...

I'll try to give this a test later today to confirm it is still true, but I would be absolutely amazed if gcc beats a commercial compiler written by the company that designed the cpu itself...

BTW I don't see anything like 50%. Maybe 10%. But 10% is still 10%...

Make sure to use latest link-time optimizations >

Code: Select all

-Ofast -flto -fwhole-program -fprofile-generate / -fprofile-use

Jim.

diep · Post by **diep** » Fri Mar 30, 2012 2:15 am

I'll try to give this a test later today to confirm it is still true, but I would be absolutely amazed if gcc beats a commercial compiler written by the company that designed the cpu itself...

BTW I don't see anything like 50%. Maybe 10%. But 10% is still 10%...
Make sure to use latest link-time optimizations >
Code: Select all
-Ofast -flto -fwhole-program -fprofile-generate / -fprofile-use
Jim.

I tested it carefully at a 8 core2 xeon box processors: L5420 @ 2.5Ghz
Note i have a tad older intel c++ here 11.0.0.something.
I did use pgo of course in both cases and had intel c++ produce a generic executable that works both on my AMD hardware as well as on intel core2.

When running at 8 cores the difference is at least 6.6% in favour of intel c++ over GCC 4.7.0 for Diep in 32 bits.

I redid measurements several times and of course things depend upon plydepth. At most plydepths it's a tad above 7%. Bestcase i could get is to 6.6%.

Now this is a big improvement over what it was. Major league i'd say.

If you wonder why i ran 32 bits - that's because i had put in a random DVD to install linux. But with just 2 GB in most machines of RAM and in just 1 or 2 i have 4 GB ram, that's not really an issue. So around 6 nodes have 2GB ram from the cluster which is 8 in total.

Now the next job is to get GCC 4.7.0 or intel c++ somehow to work for the cluster, right now i ran everything at a single node

Nowadays intel isn't spreading it for free it seems.

It's good progress of GCC to be 'only 6.6%' slower now. However in formula 1 there is a 5% rule; if you are in qualification 5% slower than the fastest car, you are not allowed to start in the race....

Vincent

p.s. considering the many options that work for you, instead of GCC figure that out itself by means of parameter tuning - it seems GCC still can get improved a lot using better parameter tuning. Modifying options with intel is simply not working - it's automatically calculating much better what works for Diep. Would that be the difference?

Jim Ablett · Post by **Jim Ablett** » Fri Mar 30, 2012 11:18 am

diep wrote:
I'll try to give this a test later today to confirm it is still true, but I would be absolutely amazed if gcc beats a commercial compiler written by the company that designed the cpu itself...

BTW I don't see anything like 50%. Maybe 10%. But 10% is still 10%...
Make sure to use latest link-time optimizations >
Code: Select all
-Ofast -flto -fwhole-program -fprofile-generate / -fprofile-use
Jim.
I tested it carefully at a 8 core2 xeon box processors: L5420 @ 2.5Ghz
Note i have a tad older intel c++ here 11.0.0.something.
I did use pgo of course in both cases and had intel c++ produce a generic executable that works both on my AMD hardware as well as on intel core2.

When running at 8 cores the difference is at least 6.6% in favour of intel c++ over GCC 4.7.0 for Diep in 32 bits.

I redid measurements several times and of course things depend upon plydepth. At most plydepths it's a tad above 7%. Bestcase i could get is to 6.6%.

Now this is a big improvement over what it was. Major league i'd say.

If you wonder why i ran 32 bits - that's because i had put in a random DVD to install linux. But with just 2 GB in most machines of RAM and in just 1 or 2 i have 4 GB ram, that's not really an issue. So around 6 nodes have 2GB ram from the cluster which is 8 in total.

Now the next job is to get GCC 4.7.0 or intel c++ somehow to work for the cluster, right now i ran everything at a single node

Nowadays intel isn't spreading it for free it seems.

It's good progress of GCC to be 'only 6.6%' slower now. However in formula 1 there is a 5% rule; if you are in qualification 5% slower than the fastest car, you are not allowed to start in the race....

Vincent

p.s. considering the many options that work for you, instead of GCC figure that out itself by means of parameter tuning - it seems GCC still can get improved a lot using better parameter tuning. Modifying options with intel is simply not working - it's automatically calculating much better what works for Diep. Would that be the difference?

Maybe you do this already, but can you try again, this time using the 'Crafty' trick of combining all files into one. This usually give me a little extra speed.

Code: Select all


'Crafty.c'
---------------------------
#include "search.c"
#include "thread.c"
#include "repeat.c"
#include "next.c"
#include "killer.c"
#include "quiesce.c"
#include "evaluate.c"
#include "movgen.c"
#include "make.c"
#include "unmake.c"
#include "hash.c"
#include "attacks.c"
#include "swap.c"
#include "boolean.c"
#include "utility.c"
#include "probe.c"
#include "book.c"
-----------------------

Also sometimes using pgo together with flto is detrimental. Try using -flto alone.

Jim.

diep · Post by **diep** » Fri Mar 30, 2012 3:15 pm

Jim Ablett wrote:
diep wrote:
I'll try to give this a test later today to confirm it is still true, but I would be absolutely amazed if gcc beats a commercial compiler written by the company that designed the cpu itself...

BTW I don't see anything like 50%. Maybe 10%. But 10% is still 10%...
Make sure to use latest link-time optimizations >
Code: Select all
-Ofast -flto -fwhole-program -fprofile-generate / -fprofile-use
Jim.
I tested it carefully at a 8 core2 xeon box processors: L5420 @ 2.5Ghz
Note i have a tad older intel c++ here 11.0.0.something.
I did use pgo of course in both cases and had intel c++ produce a generic executable that works both on my AMD hardware as well as on intel core2.

When running at 8 cores the difference is at least 6.6% in favour of intel c++ over GCC 4.7.0 for Diep in 32 bits.

I redid measurements several times and of course things depend upon plydepth. At most plydepths it's a tad above 7%. Bestcase i could get is to 6.6%.

Now this is a big improvement over what it was. Major league i'd say.

If you wonder why i ran 32 bits - that's because i had put in a random DVD to install linux. But with just 2 GB in most machines of RAM and in just 1 or 2 i have 4 GB ram, that's not really an issue. So around 6 nodes have 2GB ram from the cluster which is 8 in total.

Now the next job is to get GCC 4.7.0 or intel c++ somehow to work for the cluster, right now i ran everything at a single node

Nowadays intel isn't spreading it for free it seems.

It's good progress of GCC to be 'only 6.6%' slower now. However in formula 1 there is a 5% rule; if you are in qualification 5% slower than the fastest car, you are not allowed to start in the race....

Vincent

p.s. considering the many options that work for you, instead of GCC figure that out itself by means of parameter tuning - it seems GCC still can get improved a lot using better parameter tuning. Modifying options with intel is simply not working - it's automatically calculating much better what works for Diep. Would that be the difference?
Maybe you do this already, but can you try again, this time using the 'Crafty' trick of combining all files into one. This usually give me a little extra speed.
Code: Select all
'Crafty.c'
---------------------------
#include "search.c"
#include "thread.c"
#include "repeat.c"
#include "next.c"
#include "killer.c"
#include "quiesce.c"
#include "evaluate.c"
#include "movgen.c"
#include "make.c"
#include "unmake.c"
#include "hash.c"
#include "attacks.c"
#include "swap.c"
#include "boolean.c"
#include "utility.c"
#include "probe.c"
#include "book.c"
-----------------------
Also sometimes using pgo together with flto is detrimental. Try using -flto alone.

Jim.

This is a trick from 90s at much weaker processors and something that typically works in bitboards and especially c++, if you have a few layers of function primitives, as in bitboards you need a special function just to convert a bitboard index to for example a square. Crafty is inlining that, yet most c++ programs suffer bigtime there as they made special layers from it. In Diep i'm not having this problem. Attacktables get done incremental for example - i don't need to newly generate it unlike the bitboarders. To have for example the number of attackers of say for example black to g7
i just need to do:
int attackers = attackarray[black][sq_g7] & 15;

One instruction and a prefetch from L1 cache.

In bitboards it is a bunch of functions you call, some will be inlined, or maybe everything needs to get inlined, yet compiler effiicency in Diep has to do with doing branches very efficient and generate little code to get things done.

Note that i also have a jumptable in Diep, which dates back from the 90s, spec Sjeng has a similar table. I would assume compilers having optimized for SpecSjeng have no problems with that...

If i search deeper with Diep, and give it more time to search deeper, i see that GCC is getting slower relative to intel c++. It's losing it from intel c++ then bigtime. Diep's nps slows down considerable then, and deteriorates less with intel c++ than it does do with gcc.

Didn't use cachegrind yet, but am suspecting it's way out of L1i then and gets the full hit of increased L1i misses.

Of course it also could be that intel c++ better prefetches things there.

What i didn't test is how efficient GCC manages to deal with 'volatile' variables. In its search Diep has quite a bunch there, yet it eats peanuts system time as compared to evaluation.

It's also quite possible that with some more complex pointerwork GCC loses some there. won't be a big loss, but effectively a loss to the average (like 20% slower at spot here and 20% there). GCC always lost it majorleague with complex pointerwork in the past. It also for decades had bugs inside its compiler there - i simply rewrote parts of diep's initialisation as gcc generated bugs there. Pointers are too complex for it.

Note i experimented with flto in all sorts of ways, and also tried combinations of your makefile for SF, removing the c++ ones (showing once again why c++ has become too complex) .
Flto is simply a SIGNIFICANT slowdown. I don't know why GCC isn't doing automatic parameter tuning in deciding which features to use in what program, that really seems where intel c++ wins the battle majorleague; if i enforce options in intel c++ by hand, it just generates slower code - itself it makes superior choices by automatic tuning - GCC is not like that seemingly.

It needs an automatic tuning framework.

Stockfish port to C# Complete

Re: Stockfish port to C# Complete

GCC 4.7.0 versus intel c++

Re: GCC 4.7.0 versus intel c++

Re: GCC 4.7.0 versus intel c++