C vs ASM

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27792
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: C vs ASM

Post by hgm »

Basically all these compiles are just loops of 8 load and 8 store instructions, using the same addressing modes in all cases. (How could they be anything different, considering the task...)

If they run at different speed (and it seems they run at spectacularly different speed, one way or the other), it doesn't seem to have anything to do with the code. More likely that it is determined by other factors, like how the data set is mapped into memory, and whether this causes cache flushing.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: C vs ASM

Post by Joost Buijs »

hgm wrote:That is also crazy. They should be the same... So far the compiler outputs that have been posted here are virtually identical to Ed's ASM code.
The fact remains that the handwritten code runs 1.5 times slower on my machine.

This is the Intel compiler:

Code: Select all

?bubbles@@YAXXZ	PROC NEAR 
.B3.1:                          ; Preds .B3.0
$LN300:
        push      esi                                           ;123.2
$LN301:
        push      edi                                           ;123.2
$LN302:
        push      ebx                                           ;123.2
$LN303:
        jmp       .B3.2         ; Prob 100%                     ;123.2
$LN304:
                                ; LOE ebp
.B3.5:                          ; Preds .B3.4
$LN305:
        mov       DWORD PTR [?key1.0@@4PAIA+eax*4], ebx         ;129.23
$LN306:
        mov       DWORD PTR [?key1.0@@4PAIA+edx*4], ecx         ;129.42
$LN307:
        movzx     ecx, BYTE PTR [?byte1.0@@4PAEA+edx]           ;131.24
$LN308:
        movzx     ebx, BYTE PTR [?byte1.0@@4PAEA+eax]           ;131.10
$LN309:
        mov       BYTE PTR [?byte1.0@@4PAEA+eax], cl            ;131.24
$LN310:
        mov       esi, DWORD PTR [?key2.0@@4PAIA+edx*4]         ;130.23
$LN311:
        mov       BYTE PTR [?byte1.0@@4PAEA+edx], bl            ;131.45
$LN312:
        movzx     ecx, BYTE PTR [?byte2.0@@4PAEA+edx]           ;132.24
$LN313:
        mov       edi, DWORD PTR [?key2.0@@4PAIA+eax*4]         ;130.10
$LN314:
        movzx     ebx, BYTE PTR [?byte2.0@@4PAEA+eax]           ;132.10
$LN315:
        mov       DWORD PTR [?key2.0@@4PAIA+eax*4], esi         ;130.23
$LN316:
        mov       BYTE PTR [?byte2.0@@4PAEA+eax], cl            ;132.24
$LN317:
        mov       DWORD PTR [?key2.0@@4PAIA+edx*4], edi         ;130.42
$LN318:
        mov       BYTE PTR [?byte2.0@@4PAEA+edx], bl            ;132.45
$LN319:
                                ; LOE ebp
.B3.2:                          ; Preds .B3.5 .B3.1
        xor       eax, eax                                      ;
        mov       edx, -1                                       ;
$LN320:
                                ; LOE eax edx ebp
.B3.3:                          ; Preds .B3.4 .B3.2
$LN321:
        inc       eax                                           ;126.10
$LN322:
        inc       edx                                           ;126.16
$LN323:
        cmp       eax, 6000                                     ;127.10
$LN324:
        jge       .B3.6         ; Prob 4%                       ;127.10
$LN325:
                                ; LOE eax edx ebp
.B3.4:                          ; Preds .B3.3
$LN326:
        mov       ebx, DWORD PTR [?key1.0@@4PAIA+edx*4]         ;128.10
$LN327:
        mov       ecx, DWORD PTR [?key1.0@@4PAIA+eax*4]         ;128.10
$LN328:
        cmp       ebx, ecx                                      ;128.10
$LN329:
        jbe       .B3.3         ; Prob 82%                      ;128.10
$LN330:
        jmp       .B3.5         ; Prob 100%                     ;128.10
$LN331:
                                ; LOE eax edx ecx ebx ebp
.B3.6:                          ; Preds .B3.3                   ; Infreq
$LN332:
        pop       ebx                                           ;127.31
$LN333:
        pop       edi                                           ;127.31
$LN334:
        pop       esi                                           ;127.31
$LN335:
        ret                                                     ;127.31
        ALIGN     16
$LN336:
                                ; LOE
$LN337:
; mark_end;
?bubbles@@YAXXZ ENDP
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: C vs ASM

Post by Joost Buijs »

Joost Buijs wrote:
hgm wrote:That is also crazy. They should be the same... So far the compiler outputs that have been posted here are virtually identical to Ed's ASM code.
The fact remains that the handwritten code runs 1.5 times slower on my machine.
Maybe Ed uses an AMD processor? I would like to know why Ed sees something totally different.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: C vs ASM

Post by Joost Buijs »

hgm wrote:Basically all these compiles are just loops of 8 load and 8 store instructions, using the same addressing modes in all cases. (How could they be anything different, considering the task...)

If they run at different speed (and it seems they run at spectacularly different speed, one way or the other), it doesn't seem to have anything to do with the code. More likely that it is determined by other factors, like how the data set is mapped into memory, and whether this causes cache flushing.
I don't see why the data set would be mapped differently into memory when I call the ASM bubble instead of the C bubble.

Anyway the claim that handcrafted ASM runs faster then C code is still not proven.
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: C vs ASM

Post by Rebel »

Joost Buijs wrote:
Rebel wrote:Following the heated discussion (C vs ASM) a small example that ASM still can pay off. You can compile (and run) the below program yourself. At program start it creates 6000 random 32-bit numbers and then we are going to bubble the classic way.

My Digital Mars compiler takes 22.1 secs to complete.
GCC 4.6.1 (32 bit) takes 22.2 secs
GCC 4.6.1 (64-bit) takes 22.0 secs

The hand tuned ASM version (BUBBLES=1) takes 14.7 seconds.

I compiled GCC with the "-Ofast" option. Perhaps that's not the right one as I am new to GCC.
I took your code (unmodified) and run it under MSVC-2012 and Intel C++ v13.0. The only thing I had to replace was 'asm {' with '__asm {'
I just used basic settings for both compilers, nothing fancy.

My timings are totally different:
MSVC 12.126 sec.
Intel 12.075 sec.
ASM 18.034 sec.

So on my machine (Sandy-Bridge) your ASM code is actually a lot slower.
:lol:

I get the feeling it's perhaps an alignment issue?

Two thoughts for that:

1. When I activate the debug code the speed gain totally disappears, from 14.7 to 22.2 secs

2. The same happens when I make a small (in principle meaningless) change in the ASM code:

Code: Select all

old
        mov EAX,dword ptr key1[EBX*4]       // eax=key1[r2] 
        cmp EAX,dword ptr key1[EDX*4] 
        jbe sort10 
        mov ECX,dword ptr key1[EDX*4]       // ecx=key1[r1] 
 

Code: Select all

new
        mov EAX,dword ptr key1[EBX*4]       // eax=key1[r2]
        mov ECX,dword ptr key1[EDX*4]       // ecx=key1[r1]
        cmp EAX,ECX
        jbe sort10
        mov ECX,dword ptr key1[EDX*4]       // ecx=key1[r1]
And gone is the speed improvement.

Hmm...
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: C vs ASM

Post by Rebel »

Joost Buijs wrote:
Joost Buijs wrote:
hgm wrote:That is also crazy. They should be the same... So far the compiler outputs that have been posted here are virtually identical to Ed's ASM code.
The fact remains that the handwritten code runs 1.5 times slower on my machine.
Maybe Ed uses an AMD processor? I would like to know why Ed sees something totally different.
Intel I7

But perhaps my above answer explains more.
User avatar
hgm
Posts: 27792
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: C vs ASM

Post by hgm »

I have also seen such weird things when I was optimizing qperft, where deleting a dead piece of C-code (behind a break) would cause a slowdown of nearly 20%.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: C vs ASM

Post by Joost Buijs »

Rebel wrote:
Joost Buijs wrote:
Rebel wrote:Following the heated discussion (C vs ASM) a small example that ASM still can pay off. You can compile (and run) the below program yourself. At program start it creates 6000 random 32-bit numbers and then we are going to bubble the classic way.

My Digital Mars compiler takes 22.1 secs to complete.
GCC 4.6.1 (32 bit) takes 22.2 secs
GCC 4.6.1 (64-bit) takes 22.0 secs

The hand tuned ASM version (BUBBLES=1) takes 14.7 seconds.

I compiled GCC with the "-Ofast" option. Perhaps that's not the right one as I am new to GCC.
I took your code (unmodified) and run it under MSVC-2012 and Intel C++ v13.0. The only thing I had to replace was 'asm {' with '__asm {'
I just used basic settings for both compilers, nothing fancy.

My timings are totally different:
MSVC 12.126 sec.
Intel 12.075 sec.
ASM 18.034 sec.

So on my machine (Sandy-Bridge) your ASM code is actually a lot slower.
:lol:

I get the feeling it's perhaps an alignment issue?

Two thoughts for that:

1. When I activate the debug code the speed gain totally disappears, from 14.7 to 22.2 secs

2. The same happens when I make a small (in principle meaningless) change in the ASM code:

Code: Select all

old
        mov EAX,dword ptr key1[EBX*4]       // eax=key1[r2] 
        cmp EAX,dword ptr key1[EDX*4] 
        jbe sort10 
        mov ECX,dword ptr key1[EDX*4]       // ecx=key1[r1] 
 

Code: Select all

new
        mov EAX,dword ptr key1[EBX*4]       // eax=key1[r2]
        mov ECX,dword ptr key1[EDX*4]       // ecx=key1[r1]
        cmp EAX,ECX
        jbe sort10
        mov ECX,dword ptr key1[EDX*4]       // ecx=key1[r1]
And gone is the speed improvement.

Hmm...
Yes that is very strange. I don't see why adding an extra instruction to your ASM code would change anything in the alignment of the data.

There is also something strange with the bubblesort itself, I replaced it with a standard bubblesort without goto's etc. and then it sorts the whole array in 40 msec. But that is a different topic.

Code: Select all

 void bubbles() {
    int i, j;
	unsigned int zz;
	unsigned char c;

    for &#40;j = 0; j < MAX_ENTRIES; j++) &#123;
       for &#40;i = 1; i < MAX_ENTRIES - j; i++) &#123;
          if&#40;key1&#91;i-1&#93; > key1&#91;i&#93;) &#123;
             zz = key1&#91;i&#93;; key1&#91;i&#93; = key1&#91;i-1&#93;; key1&#91;i-1&#93; = zz;
			 zz = key2&#91;i&#93;; key2&#91;i&#93; = key2&#91;i-1&#93;; key2&#91;i-1&#93; = zz;
			 c = byte1&#91;i&#93;; byte1&#91;i&#93; = byte1&#91;i-1&#93;; byte1&#91;i-1&#93; = c;
			 c = byte2&#91;i&#93;; byte2&#91;i&#93; = byte2&#91;i-1&#93;; byte2&#91;i-1&#93; = c;
          &#125;
       &#125;
    &#125;
 &#125;
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: C vs ASM

Post by Rebel »

hgm wrote:I have also seen such weird things when I was optimizing qperft, where deleting a dead piece of C-code (behind a break) would cause a slowdown of nearly 20%.
Yep, see my reply to Miguel. At release time swapping include files, variables, tables and code (apparently still) makes sense.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: C vs ASM

Post by Gerd Isenberg »

Yep, sample of serializing the bitset to a global lst.

Code: Select all

typedef bit_set<int64_t, 1> bitset;

int movelist&#91;64&#93;;
int main&#40;)
&#123; 
  int* p = movelist;
  bitset x &#123;17, 31, 61&#125;;
  for &#40;auto it = x.begin&#40;); it != x.end&#40;); ++it&#41; 
    *p++ = *it;
  return 0;
&#125;
g++ generated assembly of the loop:

Code: Select all

	...
	test	rax, rax	# x$data_
	je	.L7	#,
	mov	edx, OFFSET FLAT&#58;movelist	# p,
.L3&#58;
	bsf	rcx, rax	# tmp123, it$mask_
	mov	DWORD PTR &#91;rdx&#93;, ecx	# MEM&#91;base&#58; p_91, offset&#58; 0B&#93;, tmp123
	lea	rcx, &#91;rax-1&#93;	# D.56459,
	add	rdx, 4	# p,
	and	rax, rcx	# it$mask_, D.56459
	jne	.L3	#,
.L7&#58;
amazing!