C vs ASM

hgm · Post by **hgm** » Tue Mar 05, 2013 7:30 pm

Basically all these compiles are just loops of 8 load and 8 store instructions, using the same addressing modes in all cases. (How could they be anything different, considering the task...)

If they run at different speed (and it seems they run at spectacularly different speed, one way or the other), it doesn't seem to have anything to do with the code. More likely that it is determined by other factors, like how the data set is mapped into memory, and whether this causes cache flushing.

Joost Buijs · Post by **Joost Buijs** » Tue Mar 05, 2013 7:43 pm

hgm wrote:That is also crazy. They should be the same... So far the compiler outputs that have been posted here are virtually identical to Ed's ASM code.

The fact remains that the handwritten code runs 1.5 times slower on my machine.

This is the Intel compiler:

Code: Select all

?bubbles@@YAXXZ	PROC NEAR 
.B3.1&#58;                          ; Preds .B3.0
$LN300&#58;
        push      esi                                           ;123.2
$LN301&#58;
        push      edi                                           ;123.2
$LN302&#58;
        push      ebx                                           ;123.2
$LN303&#58;
        jmp       .B3.2         ; Prob 100%                     ;123.2
$LN304&#58;
                                ; LOE ebp
.B3.5&#58;                          ; Preds .B3.4
$LN305&#58;
        mov       DWORD PTR &#91;?key1.0@@4PAIA+eax*4&#93;, ebx         ;129.23
$LN306&#58;
        mov       DWORD PTR &#91;?key1.0@@4PAIA+edx*4&#93;, ecx         ;129.42
$LN307&#58;
        movzx     ecx, BYTE PTR &#91;?byte1.0@@4PAEA+edx&#93;           ;131.24
$LN308&#58;
        movzx     ebx, BYTE PTR &#91;?byte1.0@@4PAEA+eax&#93;           ;131.10
$LN309&#58;
        mov       BYTE PTR &#91;?byte1.0@@4PAEA+eax&#93;, cl            ;131.24
$LN310&#58;
        mov       esi, DWORD PTR &#91;?key2.0@@4PAIA+edx*4&#93;         ;130.23
$LN311&#58;
        mov       BYTE PTR &#91;?byte1.0@@4PAEA+edx&#93;, bl            ;131.45
$LN312&#58;
        movzx     ecx, BYTE PTR &#91;?byte2.0@@4PAEA+edx&#93;           ;132.24
$LN313&#58;
        mov       edi, DWORD PTR &#91;?key2.0@@4PAIA+eax*4&#93;         ;130.10
$LN314&#58;
        movzx     ebx, BYTE PTR &#91;?byte2.0@@4PAEA+eax&#93;           ;132.10
$LN315&#58;
        mov       DWORD PTR &#91;?key2.0@@4PAIA+eax*4&#93;, esi         ;130.23
$LN316&#58;
        mov       BYTE PTR &#91;?byte2.0@@4PAEA+eax&#93;, cl            ;132.24
$LN317&#58;
        mov       DWORD PTR &#91;?key2.0@@4PAIA+edx*4&#93;, edi         ;130.42
$LN318&#58;
        mov       BYTE PTR &#91;?byte2.0@@4PAEA+edx&#93;, bl            ;132.45
$LN319&#58;
                                ; LOE ebp
.B3.2&#58;                          ; Preds .B3.5 .B3.1
        xor       eax, eax                                      ;
        mov       edx, -1                                       ;
$LN320&#58;
                                ; LOE eax edx ebp
.B3.3&#58;                          ; Preds .B3.4 .B3.2
$LN321&#58;
        inc       eax                                           ;126.10
$LN322&#58;
        inc       edx                                           ;126.16
$LN323&#58;
        cmp       eax, 6000                                     ;127.10
$LN324&#58;
        jge       .B3.6         ; Prob 4%                       ;127.10
$LN325&#58;
                                ; LOE eax edx ebp
.B3.4&#58;                          ; Preds .B3.3
$LN326&#58;
        mov       ebx, DWORD PTR &#91;?key1.0@@4PAIA+edx*4&#93;         ;128.10
$LN327&#58;
        mov       ecx, DWORD PTR &#91;?key1.0@@4PAIA+eax*4&#93;         ;128.10
$LN328&#58;
        cmp       ebx, ecx                                      ;128.10
$LN329&#58;
        jbe       .B3.3         ; Prob 82%                      ;128.10
$LN330&#58;
        jmp       .B3.5         ; Prob 100%                     ;128.10
$LN331&#58;
                                ; LOE eax edx ecx ebx ebp
.B3.6&#58;                          ; Preds .B3.3                   ; Infreq
$LN332&#58;
        pop       ebx                                           ;127.31
$LN333&#58;
        pop       edi                                           ;127.31
$LN334&#58;
        pop       esi                                           ;127.31
$LN335&#58;
        ret                                                     ;127.31
        ALIGN     16
$LN336&#58;
                                ; LOE
$LN337&#58;
; mark_end;
?bubbles@@YAXXZ ENDP

Joost Buijs · Post by **Joost Buijs** » Tue Mar 05, 2013 7:54 pm

Joost Buijs wrote:
hgm wrote:That is also crazy. They should be the same... So far the compiler outputs that have been posted here are virtually identical to Ed's ASM code.
The fact remains that the handwritten code runs 1.5 times slower on my machine.

Maybe Ed uses an AMD processor? I would like to know why Ed sees something totally different.

Joost Buijs · Post by **Joost Buijs** » Tue Mar 05, 2013 8:17 pm

hgm wrote:Basically all these compiles are just loops of 8 load and 8 store instructions, using the same addressing modes in all cases. (How could they be anything different, considering the task...)

If they run at different speed (and it seems they run at spectacularly different speed, one way or the other), it doesn't seem to have anything to do with the code. More likely that it is determined by other factors, like how the data set is mapped into memory, and whether this causes cache flushing.

I don't see why the data set would be mapped differently into memory when I call the ASM bubble instead of the C bubble.

Anyway the claim that handcrafted ASM runs faster then C code is still not proven.

Rebel · Post by **Rebel** » Tue Mar 05, 2013 8:20 pm

Joost Buijs wrote:
Rebel wrote:Following the heated discussion (C vs ASM) a small example that ASM still can pay off. You can compile (and run) the below program yourself. At program start it creates 6000 random 32-bit numbers and then we are going to bubble the classic way.

My Digital Mars compiler takes 22.1 secs to complete.
GCC 4.6.1 (32 bit) takes 22.2 secs
GCC 4.6.1 (64-bit) takes 22.0 secs

The hand tuned ASM version (BUBBLES=1) takes 14.7 seconds.

I compiled GCC with the "-Ofast" option. Perhaps that's not the right one as I am new to GCC.
I took your code (unmodified) and run it under MSVC-2012 and Intel C++ v13.0. The only thing I had to replace was 'asm {' with '__asm {'
I just used basic settings for both compilers, nothing fancy.

My timings are totally different:
MSVC 12.126 sec.
Intel 12.075 sec.
ASM 18.034 sec.

So on my machine (Sandy-Bridge) your ASM code is actually a lot slower.

I get the feeling it's perhaps an alignment issue?

Two thoughts for that:

1. When I activate the debug code the speed gain totally disappears, from 14.7 to 22.2 secs

2. The same happens when I make a small (in principle meaningless) change in the ASM code:

Code: Select all

old
        mov EAX,dword ptr key1&#91;EBX*4&#93;       // eax=key1&#91;r2&#93; 
        cmp EAX,dword ptr key1&#91;EDX*4&#93; 
        jbe sort10 
        mov ECX,dword ptr key1&#91;EDX*4&#93;       // ecx=key1&#91;r1&#93;

Code: Select all

new
        mov EAX,dword ptr key1&#91;EBX*4&#93;       // eax=key1&#91;r2&#93;
        mov ECX,dword ptr key1&#91;EDX*4&#93;       // ecx=key1&#91;r1&#93;
        cmp EAX,ECX
        jbe sort10
        mov ECX,dword ptr key1&#91;EDX*4&#93;       // ecx=key1&#91;r1&#93;

And gone is the speed improvement.

Hmm...

Rebel · Post by **Rebel** » Tue Mar 05, 2013 8:23 pm

Joost Buijs wrote:
Joost Buijs wrote:
hgm wrote:That is also crazy. They should be the same... So far the compiler outputs that have been posted here are virtually identical to Ed's ASM code.
The fact remains that the handwritten code runs 1.5 times slower on my machine.
Maybe Ed uses an AMD processor? I would like to know why Ed sees something totally different.

Intel I7

But perhaps my above answer explains more.

hgm · Post by **hgm** » Tue Mar 05, 2013 8:27 pm

I have also seen such weird things when I was optimizing qperft, where deleting a dead piece of C-code (behind a break) would cause a slowdown of nearly 20%.

Joost Buijs · Post by **Joost Buijs** » Tue Mar 05, 2013 8:36 pm

Rebel wrote:
Joost Buijs wrote:
Rebel wrote:Following the heated discussion (C vs ASM) a small example that ASM still can pay off. You can compile (and run) the below program yourself. At program start it creates 6000 random 32-bit numbers and then we are going to bubble the classic way.

My Digital Mars compiler takes 22.1 secs to complete.
GCC 4.6.1 (32 bit) takes 22.2 secs
GCC 4.6.1 (64-bit) takes 22.0 secs

The hand tuned ASM version (BUBBLES=1) takes 14.7 seconds.

I compiled GCC with the "-Ofast" option. Perhaps that's not the right one as I am new to GCC.
I took your code (unmodified) and run it under MSVC-2012 and Intel C++ v13.0. The only thing I had to replace was 'asm {' with '__asm {'
I just used basic settings for both compilers, nothing fancy.

My timings are totally different:
MSVC 12.126 sec.
Intel 12.075 sec.
ASM 18.034 sec.

So on my machine (Sandy-Bridge) your ASM code is actually a lot slower.

I get the feeling it's perhaps an alignment issue?

Two thoughts for that:

1. When I activate the debug code the speed gain totally disappears, from 14.7 to 22.2 secs

2. The same happens when I make a small (in principle meaningless) change in the ASM code:
Code: Select all
old
        mov EAX,dword ptr key1&#91;EBX*4&#93;       // eax=key1&#91;r2&#93; 
        cmp EAX,dword ptr key1&#91;EDX*4&#93; 
        jbe sort10 
        mov ECX,dword ptr key1&#91;EDX*4&#93;       // ecx=key1&#91;r1&#93; 
 
Code: Select all
new
        mov EAX,dword ptr key1&#91;EBX*4&#93;       // eax=key1&#91;r2&#93;
        mov ECX,dword ptr key1&#91;EDX*4&#93;       // ecx=key1&#91;r1&#93;
        cmp EAX,ECX
        jbe sort10
        mov ECX,dword ptr key1&#91;EDX*4&#93;       // ecx=key1&#91;r1&#93;
And gone is the speed improvement.

Hmm...

Yes that is very strange. I don't see why adding an extra instruction to your ASM code would change anything in the alignment of the data.

There is also something strange with the bubblesort itself, I replaced it with a standard bubblesort without goto's etc. and then it sorts the whole array in 40 msec. But that is a different topic.

Code: Select all

 void bubbles&#40;) &#123;
    int i, j;
	unsigned int zz;
	unsigned char c;

    for &#40;j = 0; j < MAX_ENTRIES; j++) &#123;
       for &#40;i = 1; i < MAX_ENTRIES - j; i++) &#123;
          if&#40;key1&#91;i-1&#93; > key1&#91;i&#93;) &#123;
             zz = key1&#91;i&#93;; key1&#91;i&#93; = key1&#91;i-1&#93;; key1&#91;i-1&#93; = zz;
			 zz = key2&#91;i&#93;; key2&#91;i&#93; = key2&#91;i-1&#93;; key2&#91;i-1&#93; = zz;
			 c = byte1&#91;i&#93;; byte1&#91;i&#93; = byte1&#91;i-1&#93;; byte1&#91;i-1&#93; = c;
			 c = byte2&#91;i&#93;; byte2&#91;i&#93; = byte2&#91;i-1&#93;; byte2&#91;i-1&#93; = c;
          &#125;
       &#125;
    &#125;
 &#125;

Rebel · Post by **Rebel** » Tue Mar 05, 2013 9:42 pm

hgm wrote:I have also seen such weird things when I was optimizing qperft, where deleting a dead piece of C-code (behind a break) would cause a slowdown of nearly 20%.

Yep, see my reply to Miguel. At release time swapping include files, variables, tables and code (apparently still) makes sense.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Mar 05, 2013 10:26 pm

Yep, sample of serializing the bitset to a global lst.

Code: Select all

typedef bit_set<int64_t, 1> bitset;

int movelist&#91;64&#93;;
int main&#40;)
&#123; 
  int* p = movelist;
  bitset x &#123;17, 31, 61&#125;;
  for &#40;auto it = x.begin&#40;); it != x.end&#40;); ++it&#41; 
    *p++ = *it;
  return 0;
&#125;

g++ generated assembly of the loop:

Code: Select all

	...
	test	rax, rax	# x$data_
	je	.L7	#,
	mov	edx, OFFSET FLAT&#58;movelist	# p,
.L3&#58;
	bsf	rcx, rax	# tmp123, it$mask_
	mov	DWORD PTR &#91;rdx&#93;, ecx	# MEM&#91;base&#58; p_91, offset&#58; 0B&#93;, tmp123
	lea	rcx, &#91;rax-1&#93;	# D.56459,
	add	rdx, 4	# p,
	and	rax, rcx	# it$mask_, D.56459
	jne	.L3	#,
.L7&#58;

amazing!

C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM