The released 4.7.0 is indeed a lot faster than the initial snapshots i tested.Jim Ablett wrote:diep wrote:Latest GCC snapshot i tested a few weeks ago, and it's still lightyears behind intel c++ and even visual studio.mcostalba wrote:When, in the 90s, have you stopped testing and started to go "by memory" ?diep wrote:Why do you claim this nonsense?mcostalba wrote:GCC has improved a bit since the 90s. Today this produces the fastest binary on Windows:diep wrote: Every compare here is with GCC. A junk compiler i use daily, so the junk it produces i know everything about.
http://www.equation.com/servlet/equation.cmd?fa=fortran
For your information this is the compiler used by Jim to produce the fastest x86-64 SSE42 Windows binary released for SF 2.2.2 (few months ago). Before, until last year, he was using the Intel compiler, but found this one faster.
Just because it hardly gets any speedup by PGO.
To avoid a bug in GCC's pgo, i'm doing the profile run single threaded with Diep. Even then it just gives 3% speedup.
I've posted extensively examples of how GCC messes up everywhere on the net. Starting in 2007.
Latest snapshot still didn't have that fixed.
So it already STARTS with a disadvantage over other compilers of 25% or so. Such bad PGO performance is of course a joke.
Note that around 2004-2005 some snapshots back then did do pretty ok at PGO, then suddenly BOOM and it no longer worked at all for Diep that is.
Default pgo gives 0.5% in GCC. Bug after bug and 7 years later it still hasn't been fixed.
One of the big screw ups in GCC which hits much software hard is the rewrite to end of function; it is grabbing your code, and instead of generating a simple CMOV it moves the code to end of function, jumps sometimes to there and then jumps after executing 2 instructions, back to where it was.
That *hurts*.
To quote Linus: "there is no excuse to not generate CMOV's"
A polish guy then posted back in 2007 replying to Linus: "but then it is slower at my P4".
Only at around end 2011 they started moving. We're some months later now, but a snapshot of a few weeks ago still was TURTLE slow still having the same bugs and bottlenecks.
Of course i am compiling for 64 bits, yet diep's code would be faster in 32 bits; i just want efficient code without messing up with the branch prediction.
I want a normal PGO just like other compilers have it!
They aren't capable of producing that, and they're overruling Linus on their way refusing to generate effectively shorter code for *many* years.
Now that they have some competition from other compilers that are 'on the production line' to overtake GCC, it wouldn't amaze me if they 'magically' suddenly improve a lot. They need a kick in their butt man.
The GCC team showed the middle finger to dozens of very important and influencial guys such as Linus for many years.
I'm amazed they know how to produce SSE 4.2 with SF, as they still didn't figure out how to efficiently produce code for branches. The entire fall through model of intel simply hasn't been implemented in GCC.
When did intel introduce this?
Oh 1994 or so?
The difference of gcc 4.0 versus the latest snapshots i tried, is just a few percent for Diep, meanwhile visualstudio and intel c++ got dozens of percentages faster for the modern hardware opteron (barcelona core) and core2 xeons that i have here.
Vincent
"x86-64 and IA-64 will prove to be the ultimate disaster for GCC"
Marc Lehmann, in a private email to me
Try latest link-time optimizations.
Jim.Code: Select all
-Ofast -flto -fwhole-program -fprofile-generate / -fprofile-use
The flto is just slowing down Diep by the way, quite a tad.
when using the flag in the linker to use the flto it still slows down 2%.
Probably problem is that GCC as usual produces too many instructions to get things done, causing a bigger L1i miss.
Note i'm using it in 32 bits, as in 64 bits this effect would be worse.
But where the snapshots at AMD hardware hardly are faster than the old GCC's, at the core2 hardware here default -O2 makes it 10% faster in 32 bits. Not sure about 64 bits - could be more limited effect there.
What they seem to have fixed compared to snapshots 4.6+ is the PGO. At least for intel it gives around a 8% speedup without need to modify diep to single threaded single core - i could run it multithreaded (that means 1 search thread and 1 i/o to user thread). Not sure about AMD yet. So overall GCC is a 18% faster at first sight than it used to be for core2.
Note visual studio speeds up a 22% by pgo or so. Intel c++ way more than that. At every "normal 21th centuries compiler feature" if i may call it like that, there the commercial compilers still total hammer GCC as they profit more there than GCC.
But there is progress there for first time in 7 years!
All together it's not bad what i see for 4.7.0. I'll compare it one of these days with intel c++ and also at the AMD. Previous 4.6 snapshots i had tried at AMD barcelona core. As for intel core2 xeons i have here it's a lot faster this compiler. That's good news!
Maybe the speeddiff has halved now, making intel c++ a 20% faster now than GCC or so. Exact measurements are needed there yet not easy as intel nowadays charges big cash for its compiler!
The flto is a big bummer though. In 32 bits, where instruction sizes are a lot smaller, it's default 7% slower or so. When having linker use it, it's 2% slower.