Wow, excellent analysis!hgm wrote:Btw, what a weird bubble sort did you write. On my machine (i7-2600) the C code took 13 sec. I could not compile the assembly, but by adding one line oc C code I could accelerate it to 0.031 sec. I am sure I could do even better, because r1 is always r2+1 (even after my change), so there would not have been any need to have both of them. But a factor 419 seemed enough for a days work.
The problem is that you run through the entire list comparing the entries to find the first reversed pair. Swapping those of course leaves all preceding entries untouched, but you run through them again nevertheless. While it would have been sufficient to step one back.Code: Select all
void bubbles() { unsigned int zz; unsigned char ch; int r1,r2; sort05: r1=0; r2=-1; sort10: r1++; r2++; if (r1>=MAX_ENTRIES) return; //cnt1++; if (key1[r2] <= key1[r1]) goto sort10; zz=key1[r1]; key1[r1]=key1[r2]; key1[r2]=zz; // swap zz=key2[r1]; key2[r1]=key2[r2]; key2[r2]=zz; ch=byte1[r1]; byte1[r1]=byte1[r2]; byte1[r2]=ch; ch=byte2[r1]; byte2[r1]=byte2[r2]; byte2[r2]=ch; //cnt2++; r2-=2; if(r1-=2) goto sort10; // [HGM] accelerating patch goto sort05; // again }
So the code is not at all what it seems (mostly swapping array elements). The swap part is only executed about 9e6 times, but the comparison part (the sort10 loop) is executed 24e9 times! Only this loop has any impact on performance, (2 increments, one redundant, two memory loads and two compare+branch instructions). And the loop is not properly aligned, because the compiler has difficulty recognizing it, due to the use of gotos rather than standard control structures. And alignment of such small loops is critical. This is why the results are so erratic.
C vs ASM
Moderators: hgm, Rebel, chrisw
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: C vs ASM
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: C vs ASM
That is a good demonstration of why assembly is bad:) Everyone is busy trying to prove how assembly is faster than C, and this happens lol. You give too much focus on details you end up loosing the plot. No offense to anyone.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: C vs ASM
Hence my quote, repeated by many compiler gurus... "There is no such thing as the perfect code for any algorithm." One might have "the best discovered so far" but that doesn't mean there is not something better.Daniel Shawul wrote:That is a good demonstration of why assembly is bad:) Everyone is busy trying to prove how assembly is faster than C, and this happens lol. You give too much focus on details you end up loosing the plot. No offense to anyone.
However, asm is faster, execution wise, just not development wise.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: C vs ASM
This stuff is fun. In my architecture class, I have students write a C program that discerns cache characteristics (#levels, size of each level, blocksize and set associativity). It has gotten a lot harder to do this. If you cause multiple cache misses with a fixed pattern, the damned thing will recognize the pattern and start to prefetch where you would not expect it. There are ways to make it work, but not quite a simple as it used to be 20 years ago.Gerd Isenberg wrote:Playing with these modern processors seems applied chaos theory, where tiny changes in code and data may yield to huge butterfly-effects, specially with small test programs.Rebel wrote:Perhaps chapter 9.6 is of any help.rvida wrote:The slowdown is probably due to the same reason why I really _hate_ x86 asm: jump target alignment.
http://www.agner.org/optimize/optimizing_assembly.pdf
If I only read chapter 2.2.4 Advanced Memory Access in the intel optimization guide on speculatively load and stores, DPL, store forewarding, memory disambiguator with dependency predicton, store retirement competing for cache banks with executing loads maintained as a background task by the memory order buffer ...
http://www.intel.com/content/www/us/en/ ... anual.html
Happy optimizing
-
- Posts: 7025
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: C vs ASM
Indeed, my own slowed_down invention Doing so the focus entirely is on the swaps from which I expected to gain optimizations and within a measurable time for comparisons. The contrary seems to be true. Please explain why the SWAP define (on or off) (see ***) produces the same times, 15 seconds on my PC.hgm wrote:Btw, what a weird bubble sort did you write.
Code: Select all
void bubbles()
{ asm {
sort05: xor EDX,EDX // r1=0
mov EBX,0FFFFFFFFh // r2=-1
sort10: add EDX,1 // r1++
add EBX,1 // r2++
cmp EDX,MAX_ENTRIES-1 // if (r1 >= MAX_ENTRIES) return
jge done
mov EAX,dword ptr key1[EBX*4] // eax=key1[r2]
cmp EAX,dword ptr key1[EDX*4]
jbe sort10
mov ECX,dword ptr key1[EDX*4] // ecx=key1[r1]
mov dword ptr key1[EDX*4],EAX
mov dword ptr key1[EBX*4],ECX // swap key1
#if SWAP *******************
mov EDI,dword ptr key2[EBX*4] // swap key2
mov ESI,dword ptr key2[EDX*4]
mov dword ptr key2[EBX*4],ESI
mov dword ptr key2[EDX*4],EDI
mov AL,byte ptr byte1[EDX] // swap byte 1 & 2
mov AH,byte ptr byte1[EBX]
mov CL,byte ptr byte2[EDX]
mov CH,byte ptr byte2[EBX]
mov byte ptr byte1[EDX],AH
mov byte ptr byte1[EBX],AL
mov byte ptr byte2[EDX],CH
mov byte ptr byte2[EBX],CL
#endif *******************
jmp sort05
} // end of asm
done: return;
}
-
- Posts: 7025
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: C vs ASM
My PC has 4Gb, so I am fine with 3Gb.Joost Buijs wrote: The routine that determines free memory does not work properly for a 64 bit build. Probably you have to change some variables to 64 bit. Of course it doesn't matter if you don't need more than 4GB.
-
- Posts: 7025
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: C vs ASM
I think you are right so forget about above reply. Good analysis.hgm wrote:So the code is not at all what it seems (mostly swapping array elements). The swap part is only executed about 9e6 times, but the comparison part (the sort10 loop) is executed 24e9 times! Only this loop has any impact on performance, (2 increments, one redundant, two memory loads and two compare+branch instructions). And the loop is not properly aligned, because the compiler has difficulty recognizing it, due to the use of gotos rather than standard control structures. And alignment of such small loops is critical. This is why the results are so erratic.
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: C vs ASM
So why did you keep trying to contradict me with a bunch of straw man arguments in the other thread ?bob wrote:Hence my quote, repeated by many compiler gurus... "There is no such thing as the perfect code for any algorithm." One might have "the best discovered so far" but that doesn't mean there is not something better.Daniel Shawul wrote:That is a good demonstration of why assembly is bad:) Everyone is busy trying to prove how assembly is faster than C, and this happens lol. You give too much focus on details you end up loosing the plot. No offense to anyone.
However, asm is faster, execution wise, just not development wise.
The conclusion of this case study is exactly what I tried to explain on C vs ASM in the other thread!
First we have a piece of C code that is sucky beyond belief (*). It's both illegible and very inefficient. And I am convinced that there is a *causality* effect between the two: writing clean code would have made that inefficiency self-obvious. Trying to use goto instad of control structures for an alleged speed increase is typically digging a bigger hole for oneself (only do that if and when it is provably faster than control structures, which I seriously doubt in this case). The code is so obfuscated and confusing that the programmer confuses himself, and the code doesn't translate his intention.
Then we waste time with assembly, forgetting the real problem (the C code sucks). And after an endless journey in the technical litterature, and hours of sweating, we realize that manually written assembly is not faster than C.
Besides, even if the C code was truly perfect and could not be optimized, assembly is *always* a bad idea for a lot reasons:
- you will not beat the compiler (by any significant margin) with hand written assembly on any *large enough* piece of code. And if you think you have, try different compilers, different CPUs and ponder on Gerd's comment about chaos theory before jumping to conclusions.
- assembly is not portable, obviously...
- writing the "most efficient" assembly means writing a different code for each and every different (even compatible) processor. It's likely that the best assembly for 80386 is not the best for 80586 (due to instruction pairing for example), and that for modern processors things are *much* more complicated (chaos theory)
- so you end up with a zillion different versions of the assembly code for different CPUs. The result is an unmaintainable piece of crap
- the only case when assembly is useful is for hardware features unavailable in C. In the case of chess, there are only 4 that I'm aware of: lsb() msb() popcnt() prefetch(). And even for that you should *not* use assembly, because you are losing the portability, and end up with a pile of sucky #ifdef's in the code, and your code won't work on many plateforms and will need modifications for future ones... Compiler intrinsics are made for that!
- GCC's inline assembly syntax is so horrible, that it's a huge deterrent against using inline assembly in the first place. Maybe the GNU people did it on purpose to discourage the ill-advised people who think that they improve their code performance by using inline assembly
- feel free to complete this list, ad nauseam...
So, in my opinion, and for a chess program, there is no place for a single line of assembly.
(*) Ed please don't take this as a personal attack. I write sucky code too, and so does everyone (then we fix it, programming is often an iterative process). And I would like to thank you for your efforts and time on this case study. It certainly helped to debunk the myth that putting assembly code in a chess program is a good ide: it looks tempting at first, until you do it and realize that it's a bloody stupid idea...
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: C vs ASM
1. I don't use "straw man arguments". Period.lucasart wrote:So why did you keep trying to contradict me with a bunch of straw man arguments in the other thread ?bob wrote:Hence my quote, repeated by many compiler gurus... "There is no such thing as the perfect code for any algorithm." One might have "the best discovered so far" but that doesn't mean there is not something better.Daniel Shawul wrote:That is a good demonstration of why assembly is bad:) Everyone is busy trying to prove how assembly is faster than C, and this happens lol. You give too much focus on details you end up loosing the plot. No offense to anyone.
However, asm is faster, execution wise, just not development wise.
The conclusion of this case study is exactly what I tried to explain on C vs ASM in the other thread!
First we have a piece of C code that is sucky beyond belief (*). It's both illegible and very inefficient. And I am convinced that there is a *causality* effect between the two: writing clean code would have made that inefficiency self-obvious. Trying to use goto instad of control structures for an alleged speed increase is typically digging a bigger hole for oneself (only do that if and when it is provably faster than control structures, which I seriously doubt in this case). The code is so obfuscated and confusing that the programmer confuses himself, and the code doesn't translate his intention.
Then we waste time with assembly, forgetting the real problem (the C code sucks). And after an endless journey in the technical litterature, and hours of sweating, we realize that manually written assembly is not faster than C.
Besides, even if the C code was truly perfect and could not be optimized, assembly is *always* a bad idea for a lot reasons:
- you will not beat the compiler (by any significant margin) with hand written assembly on any *large enough* piece of code. And if you think you have, try different compilers, different CPUs and ponder on Gerd's comment about chaos theory before jumping to conclusions.
- assembly is not portable, obviously...
- writing the "most efficient" assembly means writing a different code for each and every different (even compatible) processor. It's likely that the best assembly for 80386 is not the best for 80586 (due to instruction pairing for example), and that for modern processors things are *much* more complicated (chaos theory)
- so you end up with a zillion different versions of the assembly code for different CPUs. The result is an unmaintainable piece of crap
- the only case when assembly is useful is for hardware features unavailable in C. In the case of chess, there are only 4 that I'm aware of: lsb() msb() popcnt() prefetch(). And even for that you should *not* use assembly, because you are losing the portability, and end up with a pile of sucky #ifdef's in the code, and your code won't work on many plateforms and will need modifications for future ones... Compiler intrinsics are made for that!
- GCC's inline assembly syntax is so horrible, that it's a huge deterrent against using inline assembly in the first place. Maybe the GNU people did it on purpose to discourage the ill-advised people who think that they improve their code performance by using inline assembly
- feel free to complete this list, ad nauseam...
So, in my opinion, and for a chess program, there is no place for a single line of assembly.
(*) Ed please don't take this as a personal attack. I write sucky code too, and so does everyone (then we fix it, programming is often an iterative process). And I would like to thank you for your efforts and time on this case study. It certainly helped to debunk the myth that putting assembly code in a chess program is a good ide: it looks tempting at first, until you do it and realize that it's a bloody stupid idea...
2. I have not changed my statement from the first post on on this topic. ASM code will almost ALWAYS be faster than C code. But it takes longer to write, longer to debug, and longer to modify when that becomes necessary.
As a general rule, I do NOT rewrite "sucky C code". First step is to attempt to discover the most efficient algorithm possible. Writing in C. Then one can always rewrite that in asm to make it faster.
-
- Posts: 7025
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: C vs ASM
I don't feel offended, instead I blame myself for (unconsciencely) cherry picking a too small piece of code that performed faster in ASM than in C on my PC.lucasart wrote: (*) Ed please don't take this as a personal attack. I write sucky code too, and so does everyone (then we fix it, programming is often an iterative process). And I would like to thank you for your efforts and time on this case study.
Certainly my respect for the compiler has grown.It certainly helped to debunk the myth that putting assembly code in a chess program is a good ide: it looks tempting at first, until you do it and realize that it's a bloody stupid idea...
However when I was porting my ASM engine back to C using MSVC I ran into several problems causing speed losses. One of the examples:
In my eval I have a bunch of variables that need zeroing before starting. For instance, when I declare them as follows:
Code: Select all
static char a1,a2,a3,a4,a5,a6,a7,a8;
static char b1,b2,b3,b4,b5,b6,b7,b8;
Code: Select all
ASM
mov dword ptr a1,0
mov dword ptr a5,0
mov dword ptr b1,0
mov dword ptr b5,0
Code: Select all
C
long *p_a1 = (long *) &a1; // 32-bit redefinition
long *p_b1 = (long *) &b1; // 32-bit redefinition
p_a1[0] = p_a1[1] = p_b1[0] = p_b1[1]=0;