C vs ASM

Rein Halbersma · Post by **Rein Halbersma** » Tue Mar 05, 2013 10:43 pm

Gerd Isenberg wrote:Yep, sample of serializing the bitset to a global lst.

typedef bit_set<int64_t, 1> bitset;

int movelist&#91;64&#93;;
int main&#40;)
&#123; 
  int* p = movelist;
  bitset x &#123;17, 31, 61&#125;;
  std&#58;&#58;copy&#40;x.begin&#40;), x.end&#40;), p&#41;; // equivalent to hand-written loop?
  return 0;
&#125;

Gerd, does the STL std::copy call generate the same assembly as your handwritten loop?

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Mar 05, 2013 10:53 pm

Rein Halbersma wrote: Gerd, does the STL std::copy call generate the same assembly as your handwritten loop?

Doesn't work actually ...
Killed - processing time exceeded

Rein Halbersma · Post by **Rein Halbersma** » Tue Mar 05, 2013 11:22 pm

Gerd Isenberg wrote:
Rein Halbersma wrote: Gerd, does the STL std::copy call generate the same assembly as your handwritten loop?
Doesn't work actually ...
Killed - processing time exceeded

Oh, it works for me on gcc.godbolt.org (did you perhaps forgot to copy the int movelist[64] declaration?) Seems the same assembly from the std::copy call

Code: Select all

	test	rax, rax
	je	.L13
	mov	edx, OFFSET FLAT&#58;movelist
.L9&#58;
	bsf	rcx, rax
	mov	DWORD PTR &#91;rdx&#93;, ecx
	lea	rcx, &#91;rax-1&#93;
	add	rdx, 4
	and	rax, rcx
	jne	.L9
.L13&#58;

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Mar 06, 2013 8:48 am

Rein Halbersma wrote:
Gerd Isenberg wrote:
Rein Halbersma wrote: Gerd, does the STL std::copy call generate the same assembly as your handwritten loop?
Doesn't work actually ...
Killed - processing time exceeded
Oh, it works for me on gcc.godbolt.org (did you perhaps forgot to copy the int movelist[64] declaration?) Seems the same assembly from the std::copy call
Code: Select all
	test	rax, rax
	je	.L13
	mov	edx, OFFSET FLAT&#58;movelist
.L9&#58;
	bsf	rcx, rax
	mov	DWORD PTR &#91;rdx&#93;, ecx
	lea	rcx, &#91;rax-1&#93;
	add	rdx, 4
	and	rax, rcx
	jne	.L9
.L13&#58;

Great. Still not working for me, only replacing the loop, doesn't matter.

Rebel · Post by **Rebel** » Wed Mar 06, 2013 9:57 am

Joost Buijs wrote: This is the Intel compiler:
Code: Select all
?bubbles@@YAXXZ	PROC NEAR 

I imported the Intel code and on my PC (I7-860) I am back at square one, the standard 22.1 secs, no improvement. But there is something special with this KEY1 compare.

Code: Select all

Intel original 22.1 secs

B34&#58;    mov       ebx, DWORD PTR key1&#91;edx*4&#93;
        mov       ecx, DWORD PTR key1&#91;eax*4&#93;
        cmp       ebx, ecx
        jbe       B33
        jmp       B35

Code: Select all

Intel modified 17.4 secs

B34&#58;    mov       ebx, DWORD PTR key1&#91;edx*4&#93;
        cmp       ebx, DWORD PTR key1&#91;eax*4&#93;    // ***
        jbe       B33
        mov       ecx, DWORD PTR key1&#91;eax*4&#93;    // ***
        jmp       B35

The beautified Intel code.

Code: Select all

void bubbles&#40;)

&#123;       asm &#123;

        jmp       B32

B35&#58;    mov       DWORD PTR key1&#91;eax*4&#93;, ebx
        mov       DWORD PTR key1&#91;edx*4&#93;, ecx

        movzx     ecx, BYTE PTR byte1&#91;edx&#93;
        movzx     ebx, BYTE PTR byte1&#91;eax&#93;

        mov       BYTE PTR byte1&#91;eax&#93;, cl
        mov       esi, DWORD PTR key2&#91;edx*4&#93;

        mov       BYTE PTR byte1&#91;edx&#93;, bl
        movzx     ecx, BYTE PTR byte2&#91;edx&#93;

        mov       edi, DWORD PTR key2&#91;eax*4&#93;
        movzx     ebx, BYTE PTR byte2&#91;eax&#93;

        mov       DWORD PTR key2&#91;eax*4&#93;, esi
        mov       BYTE PTR byte2&#91;eax&#93;, cl
        mov       DWORD PTR key2&#91;edx*4&#93;, edi
        mov       BYTE PTR byte2&#91;edx&#93;, bl

B32&#58;    xor       eax, eax
        mov       edx, -1

B33&#58;    inc       eax
        inc       edx

        cmp       eax, 6000
        jge       B36

B34&#58;    mov       ebx, DWORD PTR key1&#91;edx*4&#93;
        mov       ecx, DWORD PTR key1&#91;eax*4&#93;

        cmp       ebx, ecx
        jbe       B33

        jmp       B35

    &#125;
B36&#58;    return;

&#125;

hgm · Post by **hgm** » Wed Mar 06, 2013 10:24 am

Well, it seems that results are more random than anything else.

The 'modfication' you show should in principle be a degradation on most CPUs. An ALU instruction with memory source does not fall in the 'RISC' class. It is decoded into 2 uOps (exactly the same uOps as the mov mem,reg, cmp reg,reg would decode to), and only decoder 1 can do that. So the beautified code creates extra decoder load. Which is expecially bad if you realize this could be jumped to through the label: After fetching the first block of instructions there, the first instruction (mov, going to decoder 1) is of RISC type, and the second (cmp reg, mem) is not, so decoder 2 cannot handle it. So for the first clock deocoder 2-4 will have a bubble, and only on the next cycle the cmp can go to decoder 1. The original Intel code prevents that, and can push the first 4 instructions through the decoders (as they are all RISC type). Provided the alignment was such that they could all be fetched in the first code block. (Not sure if this is 32 byte or 64 byte on i7.)

Decoding could potentially be a bottle-neck, as the loop contains many store instructions, which translate to 2 uOps. (Or has this been improved in i7?) So they can only be decoded by decoder 1, and since every other instruction seems to be one, that means you are limited to decoding two instructions per clock, and decoder 3 and 4 will almost always be idle.

But the execution itself will be limited on load ad store units anyway; I don't think you can do more than one load plus one store per clock.

What is the theoretical optimum for this task anyway? That is, to how many clock cycles does this 22 (or 17) sec translate? And how many times is the loop executed? Can you put a counter in there to measure that once and for all?

rvida · Post by **rvida** » Wed Mar 06, 2013 11:10 am

Rebel wrote: I get the feeling it's perhaps an alignment issue?

Two thoughts for that:

1. When I activate the debug code the speed gain totally disappears, from 14.7 to 22.2 secs

2. The same happens when I make a small (in principle meaningless) change in the ASM code:
Code: Select all
old
        mov EAX,dword ptr key1&#91;EBX*4&#93;       // eax=key1&#91;r2&#93; 
        cmp EAX,dword ptr key1&#91;EDX*4&#93; 
        jbe sort10 
        mov ECX,dword ptr key1&#91;EDX*4&#93;       // ecx=key1&#91;r1&#93; 
 
Code: Select all
new
        mov EAX,dword ptr key1&#91;EBX*4&#93;       // eax=key1&#91;r2&#93;
        mov ECX,dword ptr key1&#91;EDX*4&#93;       // ecx=key1&#91;r1&#93;
        cmp EAX,ECX
        jbe sort10
        mov ECX,dword ptr key1&#91;EDX*4&#93;       // ecx=key1&#91;r1&#93;
And gone is the speed improvement.

Hmm...

The slowdown is probably due to the same reason why I really _hate_ x86 asm: jump target alignment.

I am not aware of any explicitly documented rule about this, but sometimes a 16-byte jump target alignment can improve performance by as much as 50% in very tight loops. Other times it does not matter at all. (If anyone ever wondered why MSVC outputs code with some multi-byte NOP equivalents like REX: REX: LEA RAX, RAX + [DWORD PTR 0]).

Slightly off-topic:
Didn't check with latest cpus, but the original K7 arch (iirc P6 too) had a branch-predictor limitation about the number of branch targets per I-cache line.
Code like this pretty much screws up branch prediction in Athlons:

Code: Select all

  xor edx, edx
  mov eax, dword ptr &#91;somevar&#93;
  cmp eax, 0xdeadc0de
  jz label1
  cmp eax, 0xc001babe
  jz label2
  cmp eax, 0x42
  jz label3
  cmp eax, 0xbadf00d
  jz label 4
  jmp label5
label1&#58;
  inc edx
label2&#58;
  inc edx
label3&#58;
  inc edx
label4&#58;
  inc edx
label5&#58;
  mov dword ptr&#91;othervar&#93;, edx

rvida · Post by **rvida** » Wed Mar 06, 2013 11:16 am

rvida wrote: Slightly off-topic:
Didn't check with latest cpus, but the original K7 arch (iirc P6 too) had a branch-predictor limitation about the number of branch targets per I-cache line.

Ooops. Nothing to do with I-Cache. I meant the instruction fetch buffer.

rvida · Post by **rvida** » Wed Mar 06, 2013 11:44 am

hgm wrote:Well, it seems that results are more random than anything else.

The 'modfication' you show should in principle be a degradation on most CPUs. An ALU instruction with memory source does not fall in the 'RISC' class. It is decoded into 2 uOps (exactly the same uOps as the mov mem,reg, cmp reg,reg would decode to), and only decoder 1 can do that. So the beautified code creates extra decoder load. Which is expecially bad if you realize this could be jumped to through the label: After fetching the first block of instructions there, the first instruction (mov, going to decoder 1) is of RISC type, and the second (cmp reg, mem) is not, so decoder 2 cannot handle it. So for the first clock deocoder 2-4 will have a bubble, and only on the next cycle the cmp can go to decoder 1. The original Intel code prevents that, and can push the first 4 instructions through the decoders (as they are all RISC type). Provided the alignment was such that they could all be fetched in the first code block. (Not sure if this is 32 byte or 64 byte on i7.)

Decoding could potentially be a bottle-neck, as the loop contains many store instructions, which translate to 2 uOps. (Or has this been improved in i7?) So they can only be decoded by decoder 1, and since every other instruction seems to be one, that means you are limited to decoding two instructions per clock, and decoder 3 and 4 will almost always be idle.

Most of 2 uOp Load+ALU or ALU+Store instructions are fast-path (in AMD terminology) and are decodable by any of the decoders. Only Load+ALU+Store (eg. add [mem], reg) instructions are somewhat problematic - and they decode to more than 2 uOps anyway.

Edit: ... with recent CPUs. I realized you are talking about Pentium Pro/Pentium II - with these, you are right, they can decode up to 3 instructions per cycle only if you follow the complex/simple/simple pattern.

hgm · Post by **hgm** » Wed Mar 06, 2013 12:59 pm

Actually we were talking about Intel CPUs, and not about AMD, which makes even more of a difference than P-6 vs i7. AFAIK the i7 still requires the complex/simple/simple/simple pattern. (With one extra simple, as there now are 4 decoders.) No such thing as FAST PATH there.

C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM