Basically all these compiles are just loops of 8 load and 8 store instructions, using the same addressing modes in all cases. (How could they be anything different, considering the task...)
If they run at different speed (and it seems they run at spectacularly different speed, one way or the other), it doesn't seem to have anything to do with the code. More likely that it is determined by other factors, like how the data set is mapped into memory, and whether this causes cache flushing.
C vs ASM
Moderators: hgm, Rebel, chrisw
-
- Posts: 27792
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
-
- Posts: 1563
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: C vs ASM
The fact remains that the handwritten code runs 1.5 times slower on my machine.hgm wrote:That is also crazy. They should be the same... So far the compiler outputs that have been posted here are virtually identical to Ed's ASM code.
This is the Intel compiler:
Code: Select all
?bubbles@@YAXXZ PROC NEAR
.B3.1: ; Preds .B3.0
$LN300:
push esi ;123.2
$LN301:
push edi ;123.2
$LN302:
push ebx ;123.2
$LN303:
jmp .B3.2 ; Prob 100% ;123.2
$LN304:
; LOE ebp
.B3.5: ; Preds .B3.4
$LN305:
mov DWORD PTR [?key1.0@@4PAIA+eax*4], ebx ;129.23
$LN306:
mov DWORD PTR [?key1.0@@4PAIA+edx*4], ecx ;129.42
$LN307:
movzx ecx, BYTE PTR [?byte1.0@@4PAEA+edx] ;131.24
$LN308:
movzx ebx, BYTE PTR [?byte1.0@@4PAEA+eax] ;131.10
$LN309:
mov BYTE PTR [?byte1.0@@4PAEA+eax], cl ;131.24
$LN310:
mov esi, DWORD PTR [?key2.0@@4PAIA+edx*4] ;130.23
$LN311:
mov BYTE PTR [?byte1.0@@4PAEA+edx], bl ;131.45
$LN312:
movzx ecx, BYTE PTR [?byte2.0@@4PAEA+edx] ;132.24
$LN313:
mov edi, DWORD PTR [?key2.0@@4PAIA+eax*4] ;130.10
$LN314:
movzx ebx, BYTE PTR [?byte2.0@@4PAEA+eax] ;132.10
$LN315:
mov DWORD PTR [?key2.0@@4PAIA+eax*4], esi ;130.23
$LN316:
mov BYTE PTR [?byte2.0@@4PAEA+eax], cl ;132.24
$LN317:
mov DWORD PTR [?key2.0@@4PAIA+edx*4], edi ;130.42
$LN318:
mov BYTE PTR [?byte2.0@@4PAEA+edx], bl ;132.45
$LN319:
; LOE ebp
.B3.2: ; Preds .B3.5 .B3.1
xor eax, eax ;
mov edx, -1 ;
$LN320:
; LOE eax edx ebp
.B3.3: ; Preds .B3.4 .B3.2
$LN321:
inc eax ;126.10
$LN322:
inc edx ;126.16
$LN323:
cmp eax, 6000 ;127.10
$LN324:
jge .B3.6 ; Prob 4% ;127.10
$LN325:
; LOE eax edx ebp
.B3.4: ; Preds .B3.3
$LN326:
mov ebx, DWORD PTR [?key1.0@@4PAIA+edx*4] ;128.10
$LN327:
mov ecx, DWORD PTR [?key1.0@@4PAIA+eax*4] ;128.10
$LN328:
cmp ebx, ecx ;128.10
$LN329:
jbe .B3.3 ; Prob 82% ;128.10
$LN330:
jmp .B3.5 ; Prob 100% ;128.10
$LN331:
; LOE eax edx ecx ebx ebp
.B3.6: ; Preds .B3.3 ; Infreq
$LN332:
pop ebx ;127.31
$LN333:
pop edi ;127.31
$LN334:
pop esi ;127.31
$LN335:
ret ;127.31
ALIGN 16
$LN336:
; LOE
$LN337:
; mark_end;
?bubbles@@YAXXZ ENDP
-
- Posts: 1563
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: C vs ASM
Maybe Ed uses an AMD processor? I would like to know why Ed sees something totally different.Joost Buijs wrote:The fact remains that the handwritten code runs 1.5 times slower on my machine.hgm wrote:That is also crazy. They should be the same... So far the compiler outputs that have been posted here are virtually identical to Ed's ASM code.
-
- Posts: 1563
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: C vs ASM
I don't see why the data set would be mapped differently into memory when I call the ASM bubble instead of the C bubble.hgm wrote:Basically all these compiles are just loops of 8 load and 8 store instructions, using the same addressing modes in all cases. (How could they be anything different, considering the task...)
If they run at different speed (and it seems they run at spectacularly different speed, one way or the other), it doesn't seem to have anything to do with the code. More likely that it is determined by other factors, like how the data set is mapped into memory, and whether this causes cache flushing.
Anyway the claim that handcrafted ASM runs faster then C code is still not proven.
-
- Posts: 6991
- Joined: Thu Aug 18, 2011 12:04 pm
Re: C vs ASM
Joost Buijs wrote:I took your code (unmodified) and run it under MSVC-2012 and Intel C++ v13.0. The only thing I had to replace was 'asm {' with '__asm {'Rebel wrote:Following the heated discussion (C vs ASM) a small example that ASM still can pay off. You can compile (and run) the below program yourself. At program start it creates 6000 random 32-bit numbers and then we are going to bubble the classic way.
My Digital Mars compiler takes 22.1 secs to complete.
GCC 4.6.1 (32 bit) takes 22.2 secs
GCC 4.6.1 (64-bit) takes 22.0 secs
The hand tuned ASM version (BUBBLES=1) takes 14.7 seconds.
I compiled GCC with the "-Ofast" option. Perhaps that's not the right one as I am new to GCC.
I just used basic settings for both compilers, nothing fancy.
My timings are totally different:
MSVC 12.126 sec.
Intel 12.075 sec.
ASM 18.034 sec.
So on my machine (Sandy-Bridge) your ASM code is actually a lot slower.
I get the feeling it's perhaps an alignment issue?
Two thoughts for that:
1. When I activate the debug code the speed gain totally disappears, from 14.7 to 22.2 secs
2. The same happens when I make a small (in principle meaningless) change in the ASM code:
Code: Select all
old
mov EAX,dword ptr key1[EBX*4] // eax=key1[r2]
cmp EAX,dword ptr key1[EDX*4]
jbe sort10
mov ECX,dword ptr key1[EDX*4] // ecx=key1[r1]
Code: Select all
new
mov EAX,dword ptr key1[EBX*4] // eax=key1[r2]
mov ECX,dword ptr key1[EDX*4] // ecx=key1[r1]
cmp EAX,ECX
jbe sort10
mov ECX,dword ptr key1[EDX*4] // ecx=key1[r1]
Hmm...
-
- Posts: 6991
- Joined: Thu Aug 18, 2011 12:04 pm
Re: C vs ASM
Intel I7Joost Buijs wrote:Maybe Ed uses an AMD processor? I would like to know why Ed sees something totally different.Joost Buijs wrote:The fact remains that the handwritten code runs 1.5 times slower on my machine.hgm wrote:That is also crazy. They should be the same... So far the compiler outputs that have been posted here are virtually identical to Ed's ASM code.
But perhaps my above answer explains more.
-
- Posts: 27792
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: C vs ASM
I have also seen such weird things when I was optimizing qperft, where deleting a dead piece of C-code (behind a break) would cause a slowdown of nearly 20%.
-
- Posts: 1563
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: C vs ASM
Yes that is very strange. I don't see why adding an extra instruction to your ASM code would change anything in the alignment of the data.Rebel wrote:Joost Buijs wrote:I took your code (unmodified) and run it under MSVC-2012 and Intel C++ v13.0. The only thing I had to replace was 'asm {' with '__asm {'Rebel wrote:Following the heated discussion (C vs ASM) a small example that ASM still can pay off. You can compile (and run) the below program yourself. At program start it creates 6000 random 32-bit numbers and then we are going to bubble the classic way.
My Digital Mars compiler takes 22.1 secs to complete.
GCC 4.6.1 (32 bit) takes 22.2 secs
GCC 4.6.1 (64-bit) takes 22.0 secs
The hand tuned ASM version (BUBBLES=1) takes 14.7 seconds.
I compiled GCC with the "-Ofast" option. Perhaps that's not the right one as I am new to GCC.
I just used basic settings for both compilers, nothing fancy.
My timings are totally different:
MSVC 12.126 sec.
Intel 12.075 sec.
ASM 18.034 sec.
So on my machine (Sandy-Bridge) your ASM code is actually a lot slower.
I get the feeling it's perhaps an alignment issue?
Two thoughts for that:
1. When I activate the debug code the speed gain totally disappears, from 14.7 to 22.2 secs
2. The same happens when I make a small (in principle meaningless) change in the ASM code:
Code: Select all
old mov EAX,dword ptr key1[EBX*4] // eax=key1[r2] cmp EAX,dword ptr key1[EDX*4] jbe sort10 mov ECX,dword ptr key1[EDX*4] // ecx=key1[r1]
And gone is the speed improvement.Code: Select all
new mov EAX,dword ptr key1[EBX*4] // eax=key1[r2] mov ECX,dword ptr key1[EDX*4] // ecx=key1[r1] cmp EAX,ECX jbe sort10 mov ECX,dword ptr key1[EDX*4] // ecx=key1[r1]
Hmm...
There is also something strange with the bubblesort itself, I replaced it with a standard bubblesort without goto's etc. and then it sorts the whole array in 40 msec. But that is a different topic.
Code: Select all
void bubbles() {
int i, j;
unsigned int zz;
unsigned char c;
for (j = 0; j < MAX_ENTRIES; j++) {
for (i = 1; i < MAX_ENTRIES - j; i++) {
if(key1[i-1] > key1[i]) {
zz = key1[i]; key1[i] = key1[i-1]; key1[i-1] = zz;
zz = key2[i]; key2[i] = key2[i-1]; key2[i-1] = zz;
c = byte1[i]; byte1[i] = byte1[i-1]; byte1[i-1] = c;
c = byte2[i]; byte2[i] = byte2[i-1]; byte2[i-1] = c;
}
}
}
}
-
- Posts: 6991
- Joined: Thu Aug 18, 2011 12:04 pm
Re: C vs ASM
Yep, see my reply to Miguel. At release time swapping include files, variables, tables and code (apparently still) makes sense.hgm wrote:I have also seen such weird things when I was optimizing qperft, where deleting a dead piece of C-code (behind a break) would cause a slowdown of nearly 20%.
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: C vs ASM
Yep, sample of serializing the bitset to a global lst.
g++ generated assembly of the loop:
amazing!
Code: Select all
typedef bit_set<int64_t, 1> bitset;
int movelist[64];
int main()
{
int* p = movelist;
bitset x {17, 31, 61};
for (auto it = x.begin(); it != x.end(); ++it)
*p++ = *it;
return 0;
}
Code: Select all
...
test rax, rax # x$data_
je .L7 #,
mov edx, OFFSET FLAT:movelist # p,
.L3:
bsf rcx, rax # tmp123, it$mask_
mov DWORD PTR [rdx], ecx # MEM[base: p_91, offset: 0B], tmp123
lea rcx, [rax-1] # D.56459,
add rdx, 4 # p,
and rax, rcx # it$mask_, D.56459
jne .L3 #,
.L7: