Gerd, does the STL std::copy call generate the same assembly as your handwritten loop?Gerd Isenberg wrote:Yep, sample of serializing the bitset to a global lst.Code: Select all
typedef bit_set<int64_t, 1> bitset; int movelist[64]; int main() { int* p = movelist; bitset x {17, 31, 61}; std::copy(x.begin(), x.end(), p); // equivalent to hand-written loop? return 0; }
C vs ASM
Moderators: hgm, Rebel, chrisw
-
- Posts: 741
- Joined: Tue May 22, 2007 11:13 am
Re: C vs ASM
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: C vs ASM
Doesn't work actually ...Rein Halbersma wrote: Gerd, does the STL std::copy call generate the same assembly as your handwritten loop?
Killed - processing time exceeded
-
- Posts: 741
- Joined: Tue May 22, 2007 11:13 am
Re: C vs ASM
Oh, it works for me on gcc.godbolt.org (did you perhaps forgot to copy the int movelist[64] declaration?) Seems the same assembly from the std::copy callGerd Isenberg wrote:Doesn't work actually ...Rein Halbersma wrote: Gerd, does the STL std::copy call generate the same assembly as your handwritten loop?
Killed - processing time exceeded
Code: Select all
test rax, rax
je .L13
mov edx, OFFSET FLAT:movelist
.L9:
bsf rcx, rax
mov DWORD PTR [rdx], ecx
lea rcx, [rax-1]
add rdx, 4
and rax, rcx
jne .L9
.L13:
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: C vs ASM
Great. Still not working for me, only replacing the loop, doesn't matter.Rein Halbersma wrote:Oh, it works for me on gcc.godbolt.org (did you perhaps forgot to copy the int movelist[64] declaration?) Seems the same assembly from the std::copy callGerd Isenberg wrote:Doesn't work actually ...Rein Halbersma wrote: Gerd, does the STL std::copy call generate the same assembly as your handwritten loop?
Killed - processing time exceeded
Code: Select all
test rax, rax je .L13 mov edx, OFFSET FLAT:movelist .L9: bsf rcx, rax mov DWORD PTR [rdx], ecx lea rcx, [rax-1] add rdx, 4 and rax, rcx jne .L9 .L13:
-
- Posts: 6991
- Joined: Thu Aug 18, 2011 12:04 pm
Re: C vs ASM
I imported the Intel code and on my PC (I7-860) I am back at square one, the standard 22.1 secs, no improvement. But there is something special with this KEY1 compare.Joost Buijs wrote: This is the Intel compiler:
Code: Select all
?bubbles@@YAXXZ PROC NEAR
Code: Select all
Intel original 22.1 secs
B34: mov ebx, DWORD PTR key1[edx*4]
mov ecx, DWORD PTR key1[eax*4]
cmp ebx, ecx
jbe B33
jmp B35
Code: Select all
Intel modified 17.4 secs
B34: mov ebx, DWORD PTR key1[edx*4]
cmp ebx, DWORD PTR key1[eax*4] // ***
jbe B33
mov ecx, DWORD PTR key1[eax*4] // ***
jmp B35
Code: Select all
void bubbles()
{ asm {
jmp B32
B35: mov DWORD PTR key1[eax*4], ebx
mov DWORD PTR key1[edx*4], ecx
movzx ecx, BYTE PTR byte1[edx]
movzx ebx, BYTE PTR byte1[eax]
mov BYTE PTR byte1[eax], cl
mov esi, DWORD PTR key2[edx*4]
mov BYTE PTR byte1[edx], bl
movzx ecx, BYTE PTR byte2[edx]
mov edi, DWORD PTR key2[eax*4]
movzx ebx, BYTE PTR byte2[eax]
mov DWORD PTR key2[eax*4], esi
mov BYTE PTR byte2[eax], cl
mov DWORD PTR key2[edx*4], edi
mov BYTE PTR byte2[edx], bl
B32: xor eax, eax
mov edx, -1
B33: inc eax
inc edx
cmp eax, 6000
jge B36
B34: mov ebx, DWORD PTR key1[edx*4]
mov ecx, DWORD PTR key1[eax*4]
cmp ebx, ecx
jbe B33
jmp B35
}
B36: return;
}
-
- Posts: 27790
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: C vs ASM
Well, it seems that results are more random than anything else.
The 'modfication' you show should in principle be a degradation on most CPUs. An ALU instruction with memory source does not fall in the 'RISC' class. It is decoded into 2 uOps (exactly the same uOps as the mov mem,reg, cmp reg,reg would decode to), and only decoder 1 can do that. So the beautified code creates extra decoder load. Which is expecially bad if you realize this could be jumped to through the label: After fetching the first block of instructions there, the first instruction (mov, going to decoder 1) is of RISC type, and the second (cmp reg, mem) is not, so decoder 2 cannot handle it. So for the first clock deocoder 2-4 will have a bubble, and only on the next cycle the cmp can go to decoder 1. The original Intel code prevents that, and can push the first 4 instructions through the decoders (as they are all RISC type). Provided the alignment was such that they could all be fetched in the first code block. (Not sure if this is 32 byte or 64 byte on i7.)
Decoding could potentially be a bottle-neck, as the loop contains many store instructions, which translate to 2 uOps. (Or has this been improved in i7?) So they can only be decoded by decoder 1, and since every other instruction seems to be one, that means you are limited to decoding two instructions per clock, and decoder 3 and 4 will almost always be idle.
But the execution itself will be limited on load ad store units anyway; I don't think you can do more than one load plus one store per clock.
What is the theoretical optimum for this task anyway? That is, to how many clock cycles does this 22 (or 17) sec translate? And how many times is the loop executed? Can you put a counter in there to measure that once and for all?
The 'modfication' you show should in principle be a degradation on most CPUs. An ALU instruction with memory source does not fall in the 'RISC' class. It is decoded into 2 uOps (exactly the same uOps as the mov mem,reg, cmp reg,reg would decode to), and only decoder 1 can do that. So the beautified code creates extra decoder load. Which is expecially bad if you realize this could be jumped to through the label: After fetching the first block of instructions there, the first instruction (mov, going to decoder 1) is of RISC type, and the second (cmp reg, mem) is not, so decoder 2 cannot handle it. So for the first clock deocoder 2-4 will have a bubble, and only on the next cycle the cmp can go to decoder 1. The original Intel code prevents that, and can push the first 4 instructions through the decoders (as they are all RISC type). Provided the alignment was such that they could all be fetched in the first code block. (Not sure if this is 32 byte or 64 byte on i7.)
Decoding could potentially be a bottle-neck, as the loop contains many store instructions, which translate to 2 uOps. (Or has this been improved in i7?) So they can only be decoded by decoder 1, and since every other instruction seems to be one, that means you are limited to decoding two instructions per clock, and decoder 3 and 4 will almost always be idle.
But the execution itself will be limited on load ad store units anyway; I don't think you can do more than one load plus one store per clock.
What is the theoretical optimum for this task anyway? That is, to how many clock cycles does this 22 (or 17) sec translate? And how many times is the loop executed? Can you put a counter in there to measure that once and for all?
-
- Posts: 481
- Joined: Thu Apr 16, 2009 12:00 pm
- Location: Slovakia, EU
Re: C vs ASM
The slowdown is probably due to the same reason why I really _hate_ x86 asm: jump target alignment.Rebel wrote: I get the feeling it's perhaps an alignment issue?
Two thoughts for that:
1. When I activate the debug code the speed gain totally disappears, from 14.7 to 22.2 secs
2. The same happens when I make a small (in principle meaningless) change in the ASM code:
Code: Select all
old mov EAX,dword ptr key1[EBX*4] // eax=key1[r2] cmp EAX,dword ptr key1[EDX*4] jbe sort10 mov ECX,dword ptr key1[EDX*4] // ecx=key1[r1]
And gone is the speed improvement.Code: Select all
new mov EAX,dword ptr key1[EBX*4] // eax=key1[r2] mov ECX,dword ptr key1[EDX*4] // ecx=key1[r1] cmp EAX,ECX jbe sort10 mov ECX,dword ptr key1[EDX*4] // ecx=key1[r1]
Hmm...
I am not aware of any explicitly documented rule about this, but sometimes a 16-byte jump target alignment can improve performance by as much as 50% in very tight loops. Other times it does not matter at all. (If anyone ever wondered why MSVC outputs code with some multi-byte NOP equivalents like REX: REX: LEA RAX, RAX + [DWORD PTR 0]).
Slightly off-topic:
Didn't check with latest cpus, but the original K7 arch (iirc P6 too) had a branch-predictor limitation about the number of branch targets per I-cache line.
Code like this pretty much screws up branch prediction in Athlons:
Code: Select all
xor edx, edx
mov eax, dword ptr [somevar]
cmp eax, 0xdeadc0de
jz label1
cmp eax, 0xc001babe
jz label2
cmp eax, 0x42
jz label3
cmp eax, 0xbadf00d
jz label 4
jmp label5
label1:
inc edx
label2:
inc edx
label3:
inc edx
label4:
inc edx
label5:
mov dword ptr[othervar], edx
-
- Posts: 481
- Joined: Thu Apr 16, 2009 12:00 pm
- Location: Slovakia, EU
Re: C vs ASM
Ooops. Nothing to do with I-Cache. I meant the instruction fetch buffer.rvida wrote: Slightly off-topic:
Didn't check with latest cpus, but the original K7 arch (iirc P6 too) had a branch-predictor limitation about the number of branch targets per I-cache line.
-
- Posts: 481
- Joined: Thu Apr 16, 2009 12:00 pm
- Location: Slovakia, EU
Re: C vs ASM
Most of 2 uOp Load+ALU or ALU+Store instructions are fast-path (in AMD terminology) and are decodable by any of the decoders. Only Load+ALU+Store (eg. add [mem], reg) instructions are somewhat problematic - and they decode to more than 2 uOps anyway.hgm wrote:Well, it seems that results are more random than anything else.
The 'modfication' you show should in principle be a degradation on most CPUs. An ALU instruction with memory source does not fall in the 'RISC' class. It is decoded into 2 uOps (exactly the same uOps as the mov mem,reg, cmp reg,reg would decode to), and only decoder 1 can do that. So the beautified code creates extra decoder load. Which is expecially bad if you realize this could be jumped to through the label: After fetching the first block of instructions there, the first instruction (mov, going to decoder 1) is of RISC type, and the second (cmp reg, mem) is not, so decoder 2 cannot handle it. So for the first clock deocoder 2-4 will have a bubble, and only on the next cycle the cmp can go to decoder 1. The original Intel code prevents that, and can push the first 4 instructions through the decoders (as they are all RISC type). Provided the alignment was such that they could all be fetched in the first code block. (Not sure if this is 32 byte or 64 byte on i7.)
Decoding could potentially be a bottle-neck, as the loop contains many store instructions, which translate to 2 uOps. (Or has this been improved in i7?) So they can only be decoded by decoder 1, and since every other instruction seems to be one, that means you are limited to decoding two instructions per clock, and decoder 3 and 4 will almost always be idle.
Edit: ... with recent CPUs. I realized you are talking about Pentium Pro/Pentium II - with these, you are right, they can decode up to 3 instructions per cycle only if you follow the complex/simple/simple pattern.
-
- Posts: 27790
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: C vs ASM
Actually we were talking about Intel CPUs, and not about AMD, which makes even more of a difference than P-6 vs i7. AFAIK the i7 still requires the complex/simple/simple/simple pattern. (With one extra simple, as there now are 4 decoders.) No such thing as FAST PATH there.