C vs ASM

hgm · Post by **hgm** » Fri Mar 08, 2013 10:27 am

I think the rule is that each simple variable should be aligned on a multiple of its size. So if you define struct { char c1, c2; int i; } two bytes of padding will be necessary between c2 and i to align the latter (assuming sizeof(int) = 4). So one should always define the largest elements first.

Joost Buijs · Post by **Joost Buijs** » Fri Mar 08, 2013 10:45 am

hgm wrote:I think the rule is that each simple variable should be aligned on a multiple of its size. So if you define struct { char c1, c2; int i; } two bytes of padding will be necessary between c2 and i to align the latter (assuming sizeof(int) = 4). So one should always define the largest elements first.

Yes you are right. I never looked at it in detail, but I always followed the rule to define the largest sized variables first and the smallest last.
I have no idea if this also holds for a big-endian architecture.

hgm · Post by **hgm** » Fri Mar 08, 2013 11:53 am

I expect it would. The layout of structs in terms of byte addresses should be independent of endianness. What would change is wheter you saw padded stretches of char fields left-aligned or right-alignd (and indeed, in what order) when you would read them as a larger data type.

bob · Post by **bob** » Fri Mar 08, 2013 6:15 pm

Rebel wrote:
lucasart wrote: (*) Ed please don't take this as a personal attack. I write sucky code too, and so does everyone (then we fix it, programming is often an iterative process). And I would like to thank you for your efforts and time on this case study.
I don't feel offended, instead I blame myself for (unconsciencely) cherry picking a too small piece of code that performed faster in ASM than in C on my PC.

It certainly helped to debunk the myth that putting assembly code in a chess program is a good ide: it looks tempting at first, until you do it and realize that it's a bloody stupid idea...
Certainly my respect for the compiler has grown.

However when I was porting my ASM engine back to C using MSVC I ran into several problems causing speed losses. One of the examples:

In my eval I have a bunch of variables that need zeroing before starting. For instance, when I declare them as follows:
Code: Select all
static char a1,a2,a3,a4,a5,a6,a7,a8;
static char b1,b2,b3,b4,b5,b6,b7,b8;
Then using "Digital Mars" in ASM and C I could clear those 16 variables in 4 instructions:
Code: Select all
ASM
mov dword ptr a1,0
mov dword ptr a5,0
mov dword ptr b1,0
mov dword ptr b5,0
Code: Select all
C
long *p_a1 = &#40;long *) &a1;       // 32-bit redefinition
long *p_b1 = &#40;long *) &b1;       // 32-bit redefinition
p_a1&#91;0&#93; = p_a1&#91;1&#93; = p_b1&#91;0&#93; = p_b1&#91;1&#93;=0;
This was (still is in the 2012 version?) impossible with MSVC because the compiler apparently has its own philosophy organizing a1-a8 and b1-b8 into memory while Digital Mars just leaves the chain as declared by the programmer in tact.

Simple solution: Put 'em in a struct. Won't slow a thing down, although it adds a bit to the typing "mystruct.a1 as opposed to just a1". But inside a struct, the compiler's hands are tied, it MUST keep things in the order you specified, while outside a struct, it can, as you pointed out, order variables in any way it wants to.

wgarvin · Post by **wgarvin** » Fri Mar 08, 2013 10:42 pm

Joost Buijs wrote:
hgm wrote:I thought the C standard for aligning char was on 1-byte boundaries? I am pretty sure it must be, as in Fairy-Max my hash entry is defined as
Code: Select all
struct _ &#123; int signature, score; char from, to, depth, flags; &#125; *hashTable;
and I know from the memory footprint that this measures 12 bytes.
Well, I don't know which compiler you use, but with MSVC and Intel C++ I think this is not true.

But of course I can have it wrong. It is something I read in the documentation a long time ago, and since that time I always used the pragma. Now you make me curious, and I'm going to check it immediately.

The rule followed by virtually all real-world compilers, in the absence of overriding things like #pragma pack, is "natural alignment": primitive types (ints, pointers, floats, char, etc.) need to be aligned to their size. So one-byte primitive types need one-byte alignment, 4 byte primitive types need 4 byte alignment, etc.

The original reason for this rule-of-thumb, was that some architectures would raise a hardware exception for misaligned accesses (they couldn't handle a 4-byte memory access properly unless it was to a 4-byte-aligned memory address). Even though x86 has always allowed misaligned access, there was a sorta-small penalty for it. For the last several iterations of x86 designs the penalty is pretty much zero if you stay within a single cacheline; when a mis-aligned access overlaps the boundary between cachelines, there is a small/moderate penalty of at least several cycles and the extra cache traffic, possible cache miss, etc. Although its minor, all x86 compilers I am aware of continue to pad structure members to follow the "natural alignment" rule by default, so that the compiled code never has to pay those penalty, and for backward compatibility reasons: there is a lot of non-portable code out there that actually assumes this kind of padding.

Alignment of structures is generally the alignment of the largest member inside of them. So padding gets inserted before members (to align them properly) but sometimes also at the end of the structure (to round its size up to a multiple of the structure's alignment). This is required because otherwise arrays wouldn't work properly: with struct T { ... }; and a T* ptr, whether you access it at ptr[0] or ptr[1] or ptr[whatever] it needs to make sure the T you access will be properly aligned. So if T contains an 8-byte double (or 8-byte __int64, or whatever) and the alignment requirement of the double is 8, then T's alignment requirement is also at least 8, so (ptr+0) and (ptr+1) need to both be multiples of 8, so sizeof(T) simply must be a multiple of 8.

Also, the 16-byte SSE types (and newer 32-byte AVX types) are an exception to the general rule that "x86 tolerates misalignment very well". If you use 16-byte vector types on x86 they HAVE to be aligned. It does have an unaligned-load instruction, but on most hardware its actually faster not to use it (use two 8-byte loads instead, since those are just as cheap as reading a misaligned 64-bit int, and at least one of them will be fully within a single cacheline).

Anyway: to minimize the wasted padding in structures, a good rule of thumb is to put the biggest types (or the ones with the biggest alignment requirement, at least) at the beginning of the struct. Put your 16-byte-aligned vectors and matrices and things first, and then any 8-byte-aligned doubles, and then pointers (which are either 8-byte or 4-byte on most target platforms), and finally the 4-byte things, then 2-byte then chars and bools and stuff last.

C++ virtual methods, multiple inheritance, virtual base classes etc. can all throw a wrench in this. Different compilers do different things: some force the _first field_ to have the same alignment as the class itself even if that isn't necessary, so a 4-byte vtable ptr followed by a 16-byte-aligned vector as the first field, might mean 12 wasted bytes, but putting 3 four-byte variables there instead and then following them by the 16-byte-aligned vector would _still_ mean 16 wasted bytes. On other compilers, the object needs 16-byte alignment but the vtable ptr only needs 4, so that trick will save you those 12 or 16 bytes. If you're writing the kind of nonportable code that cares about any of this stuff (a cross-platform binary data serialization system, for example) then you really need to test those kind of things on your target compiler+platform to be sure what it does. But for basic POD structs, "natural alignment" is very portable and reliable in practice, even between compilers and target platforms, as long as they are relatively mainstream 32- or 64-bit platforms, and not some ancient thing from the 70's or some weird embedded microcontroller!

mar · Post by **mar** » Sat Mar 09, 2013 4:57 am

wgarvin wrote:The original reason for this rule-of-thumb, was that some architectures would raise a hardware exception for misaligned accesses (they couldn't handle a 4-byte memory access properly unless it was to a 4-byte-aligned memory address). Even though x86 has always allowed misaligned access, there was a sorta-small penalty for it.

I think it's always good to have aligned data. Of course on ARM your code will crash when dereferencing a misaligned pointer to say int. There are special instructions for reading from misaligned addresses but one has to combine say two of them to read misaligned word.
Back in the old times (on my ancient 366MHz Celeron) I wrote some MMX code (don't remember what for anymore - whether it was only some filling test or so). I used misaligned access (4-byte aligned buffer instead of 8) and the code ran almost twice as slow (yes I was shocked too) as when I aligned the buffer to 8 bytes. So back in these times the penalty was actually huge.
Perhaps this has changed today but I still think it's a good idea to have everything aligned (let the compiler do it) and not assume anything about packing. I always consider #pragma pack(push,1) a hack - even when designing a binary file format it's always good to follow the natural packing rules and align data. Endianness is not a problem today as mainstream is little endian usually. Anyway C++ compilers are so good today that writing endianness conversion templates (specialized or simply overloaded) along with intrinsics can produce very fast code (nothing or a single bswap on x86/x64).

lucasart · Post by **lucasart** » Sat Mar 09, 2013 5:54 am

This may be a useful read:
http://www.mjmwired.net/kernel/Document ... access.txt

If I remember correctly, RISC processors simply crash on unaligned memory access, and x86 throw an interruption every time you have an unaligned memory access, which might explain the significant performance cost.

In particular, struct padding done by compilers, is there for a reason: avoid unaligned memory access. For example:

Code: Select all

struct &#123;
    int8_t a,b,c;
    int16_t d,e;
&#125;;

So what should an intelligent compiler do with that ? Let's look at the constraints:
1/ we need tobe able to access all the elements of the struct separtely, without causing an unaligned memory access
2/ we need the struct itself to be divisible by a machine word, like 32-bit or 64-bit depending on the architecture these days.

So I'm guessing that a,b,c will actually use 2*4 bytes, and not 2*3, because to access d or e would otherwise be unaligned. Also the struct itself needs to be a multiple of machine word size so as to be copyable or accessible at one in an aligned manner.

So the rule of thumb explained by Wylie Garvin makes a lot of sense: start with the bigger ones (16bit) then the smaller ones (8bit) to ensure a compact struct. And if the sizeof(struct) doesn't suit what you want, try to understand why the compiler aligns it the way it does, and reorganize the struct better. If you still can't get the sizeof(struct) you want, then you will need to force the compiler to generate alignment violating code which is ugly as hell and will destroy perofrmance. But I suppose there are cases where performance doesn't matter and compactness needs to be achieved at all cost, so it must be the right thing to do (otherwise pragma pack wouldn't exist). But in general it's a stupid idea to use pragma pack.

rvida · Post by **rvida** » Sat Mar 09, 2013 11:28 am

lucasart wrote:and x86 throw an interruption every time you have an unaligned memory access, which might explain the significant performance cost.

Although x86 can throw exception on unaligned data access, this feature is usually disabled except when using code profiling software. The exception is thrown only if AC (aligment check) flag of EFLAGS is set. By default it is unset.

The performance cost is not that big on x86 as long as the access does not cross a cache line boundary. (SSE/AVX instructions being an exception - with these the misalignment hurts quite badly)

syzygy · Post by **syzygy** » Sat Mar 09, 2013 2:18 pm

rvida wrote:
lucasart wrote:and x86 throw an interruption every time you have an unaligned memory access, which might explain the significant performance cost.
Although x86 can throw exception on unaligned data access, this feature is usually disabled except when using code profiling software. The exception is thrown only if AC (aligment check) flag of EFLAGS is set. By default it is unset.

The performance cost is not that big on x86 as long as the access does not cross a cache line boundary. (SSE/AVX instructions being an exception - with these the misalignment hurts quite badly)

And it's the RISC processors that throw an exception on unaligned data access. Of course those processors do not "simply crash", as Lucas phrases it. (The software does crash, unless it knows how to handle the exception.)

I think this thread nicely shows that C programmers can benefit a lot from studying computer architecture and assembly language.

lucasart · Post by **lucasart** » Sat Mar 09, 2013 3:10 pm

syzygy wrote:And it's the RISC processors that throw an exception on unaligned data access. Of course those processors do not "simply crash", as Lucas phrases it. (The software does crash, unless it knows how to handle the exception.)

I phrased it loosely (IIRC), because I actually don't have any experience in RISC assembly (only 80386 long long ago, and some basics of x86-64). I just heard that somewhere.

Anyway, the bottom line is that alignment is a crucial thing to understand, and it's not obvious to someone who knows only high level programming languages. So it deserved a little "aside". Here is what the Linux kernel documentation has to say about it (and these guys know what they are talking about...)

Linux documentation wrote:The effects of performing an unaligned memory access vary from architecture to architecture. It would be easy to write a whole document on the differences here; a summary of the common scenarios is presented below:

- Some architectures are able to perform unaligned memory accesses transparently, but there is usually a significant performance cost.
- Some architectures raise processor exceptions when unaligned accesses happen. The exception handler is able to correct the unaligned access, at significant cost to performance.
- Some architectures raise processor exceptions when unaligned accesses happen, but the exceptions do not contain enough information for the unaligned access to be corrected.
- Some architectures are not capable of unaligned memory access, but will silently perform a different memory access to the one that was requested,
resulting in a subtle code bug that is hard to detect!

It should be obvious from the above that if your code causes unaligned memory accesses to happen, your code will not work correctly on certain platforms and will cause performance problems on others.

syzygy wrote:I think this thread nicely shows that C programmers can benefit a lot from studying computer architecture and assembly language.

Indeed! Although I haven't done much assembly, and it was a long time when processors were so much simpler (80386), it certainly helped me to understand things like: when to pass a pointer (or reference) and when to pass a value, what's a stack frame and how to avoid having one (especially in x86-64 with the new registers), being cache friendly, not doing unaligned memory access, etc.

It certainly helps to do some basics of assembler before learning C, and to learn C before learning C++. It's so decievingly easy now for C++ newbies to write utter crap without having the slightest clue why their code sucks (especially with these new mystical template libraries, that really no one actually understands anymore)

C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM