Jim Ablett wrote:diep wrote:
I'll try to give this a test later today to confirm it is still true, but I would be absolutely amazed if gcc beats a commercial compiler written by the company that designed the cpu itself...
BTW I don't see anything like 50%. Maybe 10%. But 10% is still 10%...
Make sure to use latest link-time optimizations >
Code: Select all
-Ofast -flto -fwhole-program -fprofile-generate / -fprofile-use
Jim.
I tested it carefully at a 8 core2 xeon box processors: L5420 @ 2.5Ghz
Note i have a tad older intel c++ here 11.0.0.something.
I did use pgo of course in both cases and had intel c++ produce a generic executable that works both on my AMD hardware as well as on intel core2.
When running at 8 cores the difference is at least 6.6% in favour of intel c++ over GCC 4.7.0 for Diep in 32 bits.
I redid measurements several times and of course things depend upon plydepth. At most plydepths it's a tad above 7%. Bestcase i could get is to 6.6%.
Now this is a big improvement over what it was. Major league i'd say.
If you wonder why i ran 32 bits - that's because i had put in a random DVD to install linux. But with just 2 GB in most machines of RAM and in just 1 or 2 i have 4 GB ram, that's not really an issue. So around 6 nodes have 2GB ram from the cluster which is 8 in total.
Now the next job is to get GCC 4.7.0 or intel c++ somehow to work for the cluster, right now i ran everything at a single node
Nowadays intel isn't spreading it for free it seems.
It's good progress of GCC to be 'only 6.6%' slower now. However in formula 1 there is a 5% rule; if you are in qualification 5% slower than the fastest car, you are not allowed to start in the race....
Vincent
p.s. considering the many options that work for you, instead of GCC figure that out itself by means of parameter tuning - it seems GCC still can get improved a lot using better parameter tuning. Modifying options with intel is simply not working - it's automatically calculating much better what works for Diep. Would that be the difference?
Maybe you do this already, but can you try again, this time using the 'Crafty' trick of combining all files into one. This usually give me a little extra speed.
Code: Select all
'Crafty.c'
---------------------------
#include "search.c"
#include "thread.c"
#include "repeat.c"
#include "next.c"
#include "killer.c"
#include "quiesce.c"
#include "evaluate.c"
#include "movgen.c"
#include "make.c"
#include "unmake.c"
#include "hash.c"
#include "attacks.c"
#include "swap.c"
#include "boolean.c"
#include "utility.c"
#include "probe.c"
#include "book.c"
-----------------------
Also sometimes using pgo together with flto is detrimental. Try using -flto alone.
Jim.
This is a trick from 90s at much weaker processors and something that typically works in bitboards and especially c++, if you have a few layers of function primitives, as in bitboards you need a special function just to convert a bitboard index to for example a square. Crafty is inlining that, yet most c++ programs suffer bigtime there as they made special layers from it. In Diep i'm not having this problem. Attacktables get done incremental for example - i don't need to newly generate it unlike the bitboarders. To have for example the number of attackers of say for example black to g7
i just need to do:
int attackers = attackarray[black][sq_g7] & 15;
One instruction and a prefetch from L1 cache.
In bitboards it is a bunch of functions you call, some will be inlined, or maybe everything needs to get inlined, yet compiler effiicency in Diep has to do with doing branches very efficient and generate little code to get things done.
Note that i also have a jumptable in Diep, which dates back from the 90s, spec Sjeng has a similar table. I would assume compilers having optimized for SpecSjeng have no problems with that...
If i search deeper with Diep, and give it more time to search deeper, i see that GCC is getting slower relative to intel c++. It's losing it from intel c++ then bigtime. Diep's nps slows down considerable then, and deteriorates less with intel c++ than it does do with gcc.
Didn't use cachegrind yet, but am suspecting it's way out of L1i then and gets the full hit of increased L1i misses.
Of course it also could be that intel c++ better prefetches things there.
What i didn't test is how efficient GCC manages to deal with 'volatile' variables. In its search Diep has quite a bunch there, yet it eats peanuts system time as compared to evaluation.
It's also quite possible that with some more complex pointerwork GCC loses some there. won't be a big loss, but effectively a loss to the average (like 20% slower at spot here and 20% there). GCC always lost it majorleague with complex pointerwork in the past. It also for decades had bugs inside its compiler there - i simply rewrote parts of diep's initialisation as gcc generated bugs there. Pointers are too complex for it.
Note i experimented with flto in all sorts of ways, and also tried combinations of your makefile for SF, removing the c++ ones (showing once again why c++ has become too complex) .
Flto is simply a SIGNIFICANT slowdown. I don't know why GCC isn't doing automatic parameter tuning in deciding which features to use in what program, that really seems where intel c++ wins the battle majorleague; if i enforce options in intel c++ by hand, it just generates slower code - itself it makes superior choices by automatic tuning - GCC is not like that seemingly.
It needs an automatic tuning framework.