C vs ASM

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Mar 06, 2013 9:06 pm

hgm wrote:Btw, what a weird bubble sort did you write. On my machine (i7-2600) the C code took 13 sec. I could not compile the assembly, but by adding one line oc C code I could accelerate it to 0.031 sec. I am sure I could do even better, because r1 is always r2+1 (even after my change), so there would not have been any need to have both of them. But a factor 419 seemed enough for a days work.
Code: Select all
void bubbles&#40;)

&#123;       unsigned int zz; unsigned char ch; int r1,r2;

sort05&#58; r1=0; r2=-1;
sort10&#58; r1++; r2++;
        if &#40;r1>=MAX_ENTRIES&#41; return;
//cnt1++;
        if &#40;key1&#91;r2&#93; <= key1&#91;r1&#93;) goto sort10;
        zz=key1&#91;r1&#93;; key1&#91;r1&#93;=key1&#91;r2&#93;; key1&#91;r2&#93;=zz;    // swap
        zz=key2&#91;r1&#93;; key2&#91;r1&#93;=key2&#91;r2&#93;; key2&#91;r2&#93;=zz;
        ch=byte1&#91;r1&#93;; byte1&#91;r1&#93;=byte1&#91;r2&#93;; byte1&#91;r2&#93;=ch;
        ch=byte2&#91;r1&#93;; byte2&#91;r1&#93;=byte2&#91;r2&#93;; byte2&#91;r2&#93;=ch;
//cnt2++;
	r2-=2; if&#40;r1-=2&#41; goto sort10; // &#91;HGM&#93; accelerating patch
        goto sort05;                                    // again
&#125;
The problem is that you run through the entire list comparing the entries to find the first reversed pair. Swapping those of course leaves all preceding entries untouched, but you run through them again nevertheless. While it would have been sufficient to step one back.

So the code is not at all what it seems (mostly swapping array elements). The swap part is only executed about 9e6 times, but the comparison part (the sort10 loop) is executed 24e9 times! Only this loop has any impact on performance, (2 increments, one redundant, two memory loads and two compare+branch instructions). And the loop is not properly aligned, because the compiler has difficulty recognizing it, due to the use of gotos rather than standard control structures. And alignment of such small loops is critical. This is why the results are so erratic.

Wow, excellent analysis!

Daniel Shawul · Post by **Daniel Shawul** » Wed Mar 06, 2013 9:28 pm

That is a good demonstration of why assembly is bad:) Everyone is busy trying to prove how assembly is faster than C, and this happens lol. You give too much focus on details you end up loosing the plot. No offense to anyone.

bob · Post by **bob** » Wed Mar 06, 2013 10:37 pm

Daniel Shawul wrote:That is a good demonstration of why assembly is bad:) Everyone is busy trying to prove how assembly is faster than C, and this happens lol. You give too much focus on details you end up loosing the plot. No offense to anyone.

Hence my quote, repeated by many compiler gurus... "There is no such thing as the perfect code for any algorithm." One might have "the best discovered so far" but that doesn't mean there is not something better.

However, asm is faster, execution wise, just not development wise.

bob · Post by **bob** » Wed Mar 06, 2013 10:39 pm

Gerd Isenberg wrote:
Rebel wrote:
rvida wrote:The slowdown is probably due to the same reason why I really _hate_ x86 asm: jump target alignment.
Perhaps chapter 9.6 is of any help.

http://www.agner.org/optimize/optimizing_assembly.pdf
Playing with these modern processors seems applied chaos theory, where tiny changes in code and data may yield to huge butterfly-effects, specially with small test programs.

If I only read chapter 2.2.4 Advanced Memory Access in the intel optimization guide on speculatively load and stores, DPL, store forewarding, memory disambiguator with dependency predicton, store retirement competing for cache banks with executing loads maintained as a background task by the memory order buffer ...

http://www.intel.com/content/www/us/en/ ... anual.html

Happy optimizing

This stuff is fun. In my architecture class, I have students write a C program that discerns cache characteristics (#levels, size of each level, blocksize and set associativity). It has gotten a lot harder to do this. If you cause multiple cache misses with a fixed pattern, the damned thing will recognize the pattern and start to prefetch where you would not expect it. There are ways to make it work, but not quite a simple as it used to be 20 years ago.

Rebel · Post by **Rebel** » Thu Mar 07, 2013 12:15 am

hgm wrote:Btw, what a weird bubble sort did you write.

Indeed, my own slowed_down invention

Doing so the focus entirely is on the swaps from which I expected to gain optimizations and within a measurable time for comparisons. The contrary seems to be true. Please explain why the SWAP define (on or off) (see ***) produces the same times, 15 seconds on my PC.

Code: Select all

void bubbles&#40;)

&#123;       asm &#123;

sort05&#58; xor EDX,EDX                         // r1=0
        mov EBX,0FFFFFFFFh                  // r2=-1

sort10&#58; add EDX,1                           // r1++
        add EBX,1                           // r2++

        cmp EDX,MAX_ENTRIES-1               // if &#40;r1 >= MAX_ENTRIES&#41; return
        jge done

        mov EAX,dword ptr key1&#91;EBX*4&#93;       // eax=key1&#91;r2&#93;
        cmp EAX,dword ptr key1&#91;EDX*4&#93;
        jbe sort10

        mov ECX,dword ptr key1&#91;EDX*4&#93;       // ecx=key1&#91;r1&#93;
        mov dword ptr key1&#91;EDX*4&#93;,EAX
        mov dword ptr key1&#91;EBX*4&#93;,ECX       // swap key1

        #if SWAP  *******************
        mov EDI,dword ptr key2&#91;EBX*4&#93;       // swap key2
        mov ESI,dword ptr key2&#91;EDX*4&#93;
        mov dword ptr key2&#91;EBX*4&#93;,ESI
        mov dword ptr key2&#91;EDX*4&#93;,EDI

        mov AL,byte ptr byte1&#91;EDX&#93;          // swap byte 1 & 2
        mov AH,byte ptr byte1&#91;EBX&#93;
        mov CL,byte ptr byte2&#91;EDX&#93;
        mov CH,byte ptr byte2&#91;EBX&#93;

        mov byte ptr byte1&#91;EDX&#93;,AH
        mov byte ptr byte1&#91;EBX&#93;,AL
        mov byte ptr byte2&#91;EDX&#93;,CH
        mov byte ptr byte2&#91;EBX&#93;,CL
        #endif *******************

        jmp sort05

            &#125;       // end of asm

done&#58;   return;

&#125;

Having seen all this I am beginning to understand your and Joost's failed tries to write or manually optimize ASM code. The above makes no sense.

Rebel · Post by **Rebel** » Thu Mar 07, 2013 12:24 am

Joost Buijs wrote: The routine that determines free memory does not work properly for a 64 bit build. Probably you have to change some variables to 64 bit. Of course it doesn't matter if you don't need more than 4GB.

My PC has 4Gb, so I am fine with 3Gb.

Rebel · Post by **Rebel** » Thu Mar 07, 2013 12:31 am

hgm wrote:So the code is not at all what it seems (mostly swapping array elements). The swap part is only executed about 9e6 times, but the comparison part (the sort10 loop) is executed 24e9 times! Only this loop has any impact on performance, (2 increments, one redundant, two memory loads and two compare+branch instructions). And the loop is not properly aligned, because the compiler has difficulty recognizing it, due to the use of gotos rather than standard control structures. And alignment of such small loops is critical. This is why the results are so erratic.

I think you are right so forget about above reply. Good analysis.

lucasart · Post by **lucasart** » Thu Mar 07, 2013 4:19 am

bob wrote:
Daniel Shawul wrote:That is a good demonstration of why assembly is bad:) Everyone is busy trying to prove how assembly is faster than C, and this happens lol. You give too much focus on details you end up loosing the plot. No offense to anyone.
Hence my quote, repeated by many compiler gurus... "There is no such thing as the perfect code for any algorithm." One might have "the best discovered so far" but that doesn't mean there is not something better.

However, asm is faster, execution wise, just not development wise.

So why did you keep trying to contradict me with a bunch of straw man arguments in the other thread ?

The conclusion of this case study is exactly what I tried to explain on C vs ASM in the other thread!

First we have a piece of C code that is sucky beyond belief (*). It's both illegible and very inefficient. And I am convinced that there is a *causality* effect between the two: writing clean code would have made that inefficiency self-obvious. Trying to use goto instad of control structures for an alleged speed increase is typically digging a bigger hole for oneself (only do that if and when it is provably faster than control structures, which I seriously doubt in this case). The code is so obfuscated and confusing that the programmer confuses himself, and the code doesn't translate his intention.

Then we waste time with assembly, forgetting the real problem (the C code sucks). And after an endless journey in the technical litterature, and hours of sweating, we realize that manually written assembly is not faster than C.

Besides, even if the C code was truly perfect and could not be optimized, assembly is *always* a bad idea for a lot reasons:
- you will not beat the compiler (by any significant margin) with hand written assembly on any *large enough* piece of code. And if you think you have, try different compilers, different CPUs and ponder on Gerd's comment about chaos theory before jumping to conclusions.
- assembly is not portable, obviously...
- writing the "most efficient" assembly means writing a different code for each and every different (even compatible) processor. It's likely that the best assembly for 80386 is not the best for 80586 (due to instruction pairing for example), and that for modern processors things are *much* more complicated (chaos theory)
- so you end up with a zillion different versions of the assembly code for different CPUs. The result is an unmaintainable piece of crap
- the only case when assembly is useful is for hardware features unavailable in C. In the case of chess, there are only 4 that I'm aware of: lsb() msb() popcnt() prefetch(). And even for that you should *not* use assembly, because you are losing the portability, and end up with a pile of sucky #ifdef's in the code, and your code won't work on many plateforms and will need modifications for future ones... Compiler intrinsics are made for that!
- GCC's inline assembly syntax is so horrible, that it's a huge deterrent against using inline assembly in the first place. Maybe the GNU people did it on purpose to discourage the ill-advised people who think that they improve their code performance by using inline assembly
- feel free to complete this list, ad nauseam...

So, in my opinion, and for a chess program, there is no place for a single line of assembly.

(*) Ed please don't take this as a personal attack. I write sucky code too, and so does everyone (then we fix it, programming is often an iterative process). And I would like to thank you for your efforts and time on this case study. It certainly helped to debunk the myth that putting assembly code in a chess program is a good ide: it looks tempting at first, until you do it and realize that it's a bloody stupid idea...

bob · Post by **bob** » Thu Mar 07, 2013 5:14 am

lucasart wrote:
bob wrote:
Daniel Shawul wrote:That is a good demonstration of why assembly is bad:) Everyone is busy trying to prove how assembly is faster than C, and this happens lol. You give too much focus on details you end up loosing the plot. No offense to anyone.
Hence my quote, repeated by many compiler gurus... "There is no such thing as the perfect code for any algorithm." One might have "the best discovered so far" but that doesn't mean there is not something better.

However, asm is faster, execution wise, just not development wise.
So why did you keep trying to contradict me with a bunch of straw man arguments in the other thread ?

The conclusion of this case study is exactly what I tried to explain on C vs ASM in the other thread!

First we have a piece of C code that is sucky beyond belief (*). It's both illegible and very inefficient. And I am convinced that there is a *causality* effect between the two: writing clean code would have made that inefficiency self-obvious. Trying to use goto instad of control structures for an alleged speed increase is typically digging a bigger hole for oneself (only do that if and when it is provably faster than control structures, which I seriously doubt in this case). The code is so obfuscated and confusing that the programmer confuses himself, and the code doesn't translate his intention.

Then we waste time with assembly, forgetting the real problem (the C code sucks). And after an endless journey in the technical litterature, and hours of sweating, we realize that manually written assembly is not faster than C.

Besides, even if the C code was truly perfect and could not be optimized, assembly is *always* a bad idea for a lot reasons:
- you will not beat the compiler (by any significant margin) with hand written assembly on any *large enough* piece of code. And if you think you have, try different compilers, different CPUs and ponder on Gerd's comment about chaos theory before jumping to conclusions.
- assembly is not portable, obviously...
- writing the "most efficient" assembly means writing a different code for each and every different (even compatible) processor. It's likely that the best assembly for 80386 is not the best for 80586 (due to instruction pairing for example), and that for modern processors things are *much* more complicated (chaos theory)
- so you end up with a zillion different versions of the assembly code for different CPUs. The result is an unmaintainable piece of crap
- the only case when assembly is useful is for hardware features unavailable in C. In the case of chess, there are only 4 that I'm aware of: lsb() msb() popcnt() prefetch(). And even for that you should *not* use assembly, because you are losing the portability, and end up with a pile of sucky #ifdef's in the code, and your code won't work on many plateforms and will need modifications for future ones... Compiler intrinsics are made for that!
- GCC's inline assembly syntax is so horrible, that it's a huge deterrent against using inline assembly in the first place. Maybe the GNU people did it on purpose to discourage the ill-advised people who think that they improve their code performance by using inline assembly
- feel free to complete this list, ad nauseam...

So, in my opinion, and for a chess program, there is no place for a single line of assembly.

(*) Ed please don't take this as a personal attack. I write sucky code too, and so does everyone (then we fix it, programming is often an iterative process). And I would like to thank you for your efforts and time on this case study. It certainly helped to debunk the myth that putting assembly code in a chess program is a good ide: it looks tempting at first, until you do it and realize that it's a bloody stupid idea...

1. I don't use "straw man arguments". Period.

2. I have not changed my statement from the first post on on this topic. ASM code will almost ALWAYS be faster than C code. But it takes longer to write, longer to debug, and longer to modify when that becomes necessary.

As a general rule, I do NOT rewrite "sucky C code". First step is to attempt to discover the most efficient algorithm possible. Writing in C. Then one can always rewrite that in asm to make it faster.

Rebel · Post by **Rebel** » Thu Mar 07, 2013 8:47 am

lucasart wrote: (*) Ed please don't take this as a personal attack. I write sucky code too, and so does everyone (then we fix it, programming is often an iterative process). And I would like to thank you for your efforts and time on this case study.

I don't feel offended, instead I blame myself for (unconsciencely) cherry picking a too small piece of code that performed faster in ASM than in C on my PC.

It certainly helped to debunk the myth that putting assembly code in a chess program is a good ide: it looks tempting at first, until you do it and realize that it's a bloody stupid idea...

Certainly my respect for the compiler has grown.

However when I was porting my ASM engine back to C using MSVC I ran into several problems causing speed losses. One of the examples:

In my eval I have a bunch of variables that need zeroing before starting. For instance, when I declare them as follows:

Code: Select all

static char a1,a2,a3,a4,a5,a6,a7,a8;
static char b1,b2,b3,b4,b5,b6,b7,b8;

Then using "Digital Mars" in ASM and C I could clear those 16 variables in 4 instructions:

Code: Select all

ASM
mov dword ptr a1,0
mov dword ptr a5,0
mov dword ptr b1,0
mov dword ptr b5,0

Code: Select all

C
long *p_a1 = &#40;long *) &a1;       // 32-bit redefinition
long *p_b1 = &#40;long *) &b1;       // 32-bit redefinition
p_a1&#91;0&#93; = p_a1&#91;1&#93; = p_b1&#91;0&#93; = p_b1&#91;1&#93;=0;

This was (still is in the 2012 version?) impossible with MSVC because the compiler apparently has its own philosophy organizing a1-a8 and b1-b8 into memory while Digital Mars just leaves the chain as declared by the programmer in tact.

C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM