Developments of the last two years

Don · Post by **Don** » Sun Mar 03, 2013 7:54 pm

hgm wrote:I don't think that reasoning is sound. I can build an engine that plays better Chess than I do. Why shouldn't I be able to write a compiler that makes better assembly than I do? Even if I would be the worlds's number one expert. We have engines that beat the World champ Chess...

There are some examples on the web of fairly simply programs being optimized in assembler to get a 2x speedup. The truth of the matter is that compilers are much better than most of us, but people that really understand the low level details, specifically compiler writers themselves, can do a much better job.

The idea that human cannot match the performance of compilers is a myth that has been propagated by being uttered too many times and people repeat it like a mantra.

I found several examples on the web. Here is one:

http://www.codinghorror.com/blog/2008/0 ... -code.html

There are several dark corners where compilers have no clue but people do.

Look at some of the writings of Michael Abrash and I think you will change your tune - your thinking on this is not really your thinking, it's someone else's thinking. Don't believe things just because they are in print.

Don

hgm · Post by **hgm** » Sun Mar 03, 2013 7:56 pm

Don wrote:The reasoning is that a human cannot beat a compiler unless he has the same knowledge that a compiler has. The good compilers "understand" many of the important details of instruction scheduling. That can be worked out by hand and improved upon (because compilers STILL do not do it very well.) Chess is a different task which requires massive calculation.

But that is exactly my point: current CPUs are so complex that it would be a computationally intensive task to produce optimum code. Elementary operations that describe a task can be ordered in combinatorially many sequences. It is far from trivial how each of these perform; that would require complex simulation of the core. And what is worse, the knowledge to do such a simulation is often jealously guarded company secret. (But compiler writers that are associated with the CPU manifacturer do have access to it.)

In the Pentium-II days I did try to optimize an FFT routine. This pretty much proved a hopeless task. Many of the bottle-necks in the CPU were completely undocumented. Many of them at the beginning of the pipeline could be discovered by reverse engineering. But I always ran into the situation where designing the code so well that it would perfectly satisfy all known restrictions, there was a disastrous collapse of performance. It was very easy to calculate how many MFLOPS you could in theory do, based on the throughput of multiplier and adder units, and the 3-uOps/cycle capacity of the retirement unit. Unoptimized code woul reach about 50% of that. Well-optimized code could reach 75%, and you could always point out what the bottleneck was that caused the pipeline bubbles. Perfectly optimized code would then achieve only 25%...

Don · Post by **Don** » Sun Mar 03, 2013 8:02 pm

hgm wrote:
Don wrote:The reasoning is that a human cannot beat a compiler unless he has the same knowledge that a compiler has. The good compilers "understand" many of the important details of instruction scheduling. That can be worked out by hand and improved upon (because compilers STILL do not do it very well.) Chess is a different task which requires massive calculation.
But that is exactly my point: current CPUs are so complex that it would be a computationally intensive task to produce optimum code.

It't true that they are much more complex than they used to be, but they still follow simple rules and people who really understand the low level details can out-code a compiler. Very few people will bother to try because they are good enough and in fact probably much better than most of us - who probably know almost nothing about instruction scheduling issues, register utilization and other gory low level details. So may to YOU and I it seems hopelessly complicated, but to a beginner at chess the moves of a master seem hopefully out of reach to them.

But maybe more to the point is that compiler are downright stupid about many things. They cannot figure out everything and perhaps the write assembly like a chess program plays very closed positions, with a great deal of naivety. And yet even in a closed position a chess program will probably beat you and I, but not a really good player.

Elementary operations that describe a task can be ordered in combinatorially many sequences. It is far from trivial how each of these perform; that would require complex simulation of the core. And what is worse, the knowledge to do such a simulation is often jealously guarded company secret. (But compiler writers that are associated with the CPU manifacturer do have access to it.)

In the Pentium-II days I did try to optimize an FFT routine. This pretty much proved a hopeless task. Many of the bottle-necks in the CPU were completely undocumented. Many of them at the beginning of the pipeline could be discovered by reverse engineering. But I always ran into the situation where designing the code so well that it would perfectly satisfy all known restrictions, there was a disastrous collapse of performance. It was very easy to calculate how many MFLOPS you could in theory do, based on the throughput of multiplier and adder units, and the 3-uOps/cycle capacity of the retirement unit. Unoptimized code woul reach about 50% of that. Well-optimized code could reach 75%, and you could always point out what the bottleneck was that caused the pipeline bubbles. Perfectly optimized code would then achieve only 25%...

Rebel · Post by **Rebel** » Sun Mar 03, 2013 9:17 pm

Don wrote: Look at some of the writings of Michael Abrash and I think you will change your tune

Precisely, the man has been my teacher

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Mar 03, 2013 10:47 pm

Rebel wrote:
Don wrote: Look at some of the writings of Michael Abrash and I think you will change your tune
Precisely, the man has been my teacher

Hey, you never told us before

Thanks to Don for link and hint.

Don · Post by **Don** » Sun Mar 03, 2013 10:54 pm

Rebel wrote:
Don wrote: Look at some of the writings of Michael Abrash and I think you will change your tune
Precisely, the man has been my teacher

I have learned over my short lifetime that you have to question almost anything that comes into the public domain as "conventional wisdom."

With the web it's just as true, because if someone writes something and it sounds good, especially if they are a good writer, it becomes a standard sound bite that will automatically get repeated until it's just accepted without any critical thinking. I think this is clearly one of those things.

Of course some things are much more believable than others and this one is very believable and for most people more or less true.

Another one that make me cringe: "modern garbage collectors have better performance than people can with manual garbage collection."

Another one that has been going around for 30 years is: "with Moores law we no longer have any need for compiled languages." This one (or something like it) is based on the idea that an interpreted language running on a computer today runs faster than a compiled program running on a computer of a few years ago and it assumes that you will never need more performance that we have "right now" with a compiled language.

There are a bunch of these types of silly statements going around that people just accept once they hear someone say it.

Uri Blass · Post by **Uri Blass** » Sun Mar 03, 2013 11:19 pm

Don wrote:

There are some examples on the web of fairly simply programs being optimized in assembler to get a 2x speedup.

Don

I wonder if you think that 2x speedup is possible for a chess program
by some good programmer.

If yes then maybe somebody can prove it by improving stockfish only by translating it to assmbler and making it twice faster.

Joost Buijs · Post by **Joost Buijs** » Sun Mar 03, 2013 11:29 pm

Rebel wrote:
Joost Buijs wrote:In another thread I saw you are using the "Digital Mars" compiler. I have no experience with this compiler but I can imagine that it produces (much) slower code as e.g. the Intel, Microsoft or the latest version of the GCC compiler.
I was curious how the MS compiler would perform and the 2010 edition was a disappointing experience, a speed loss (ASM->C) of 15% (IIRC). Besides, although the DigitalMars compiler is old and development has stopped its code speed-wise is on the same level as MSVC-2010.

A couple of years ago I wrote my whole evaluation function in 64 bit ASM and at best I could reach approximately the same performance as with the C++ version. I really tried very hard to get it better but it just didn't work.
I take your word for it. Just curious, did you apply the rules of "pairing" ?

No I didn't apply rules of instruction pairing.
I have a very old book over here called "Pentium Processor Optimization Tools" that's where they talk about instruction pairing for the Pentium.
This doesn't seem to hold anymore for the core architecture, although there still seems to be a pairing mechanism when you use MMX or SSE instructions.

It surprises me that a very old compiler like the Digital Mars produces code that is speed-wise the same as code produced by MSVC 2010.

I can imagine that you lose some speed when you translate directly from ASM to C. You probably have to restructure your program at some point to regain the lost speed. It is just a matter of knowing what your compiler does, some constructs work better than others.

Don · Post by **Don** » Mon Mar 04, 2013 12:00 am

Uri Blass wrote:
Don wrote:

There are some examples on the web of fairly simply programs being optimized in assembler to get a 2x speedup.

Don
I wonder if you think that 2x speedup is possible for a chess program
by some good programmer.

If yes then maybe somebody can prove it by improving stockfish only by translating it to assmbler and making it twice faster.

You probably think that is ridiculous, right? I don't.

I think an exceptionally talented assembler guru could squeeze a lot out of Stockfish. I don't know if it would be 2X but it might be even more and that would not surprise me whatsoever.

There are 2 limits to how fast a "functionally identical" optimized version of Stockfish could get. The first limit is simply what is possible which we can probably never know. A metaphor for this is to ask, "what could God achieve?" I can easily imagine that there are many possible optimizations that we never even thought of and probably even a few "head slappers" in there.

The second limitation is the one that is most relevant and that is what we believe is possible - and that is the one that limits us the most. If we become satisfied at some point that we have pretty much gotten close to the limit, we won't look for any more.

If you are talking about pure code optimization where you completely ignore anything that might speed up the code substantially but is more of a algorithmic nature, then there is probably still at least 50% speedup possible. Maybe a lot more. You should realize that compilers are probably missing quite a bit even though they continue to improve. They cannot understand anything that is too complicated and they have to make the most conservative assumptions to be sure of being correct and they cannot rewrite algorithms a better way. Keep in mind that C also has limitations too, such as lack of access to all the instructions that a chip provides. In fact most programs now have to work around the popCount thing - either with special intrinsic's to address this limitation or a small assembly language routine. That should tell you something!

syzygy · Post by **syzygy** » Mon Mar 04, 2013 1:36 am

hgm wrote:
Don wrote:The reasoning is that a human cannot beat a compiler unless he has the same knowledge that a compiler has. The good compilers "understand" many of the important details of instruction scheduling. That can be worked out by hand and improved upon (because compilers STILL do not do it very well.) Chess is a different task which requires massive calculation.
But that is exactly my point: current CPUs are so complex that it would be a computationally intensive task to produce optimum code. Elementary operations that describe a task can be ordered in combinatorially many sequences. It is far from trivial how each of these perform; that would require complex simulation of the core.

Current compilers don't perform those simulations, either.

Don Dailey wrote:I have learned over my short lifetime that you have to question almost anything that comes into the public domain as "conventional wisdom."

Wasn't it too long ago that the "conventional wisdom" was that compiled C code was a factor 10 or so slower than manually written assembly? That is of course not true anymore. But I would be surprised if many people really believe that compiled C code is faster than assembly code.

Of course C does become faster than assembly when the program is sufficiently complex that writing it in C due to its higher level of abstraction allows you to come up with more efficient data structures and/or algorithms.

Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years