C++ Memory Alignment – Why Alignment Same on 32-bit and 64-bit Systems?

32bit-64bitabic++memory-alignmentvisual-c++

I was wondering whether the compiler would use different padding on 32-bit and 64-bit systems, so I wrote the code below in a simple VS2019 C++ console project:

struct Z
{
    char s;
    __int64 i;
};

int main()
{
    std::cout << sizeof(Z) <<"\n"; 
}

What I expected on each "Platform" setting:

x86: 12
X64: 16

Actual result:

x86: 16
X64: 16

Since the memory word size on x86 is 4 bytes, this means it has to store the bytes of i in two different words. So I thought the compiler would do padding this way:

struct Z
{
    char s;
    char _pad[3];
    __int64 i;
};

So may I know what the reason behind this is?

For forward-compatibility with the 64-bit system?
Due to the limitation of supporting 64-bit numbers on the 32-bit processor?

Best Answer

Size and alignof() (minimum alignment that any object of that type must have) for each primitive type is an ABI¹ design choice separate from the register width of the architecture.

Struct-packing rules can also be more complicated than just aligning each struct member to its minimum alignment inside the struct; that's another part of the ABI.

MSVC targeting 32-bit x86 gives __int64 a minimum alignment of 4, but its default struct-packing rules align types within structs to min(8, sizeof(T)) relative to the start of the struct. (For non-aggregate types only). That's not a direct quote, that's my paraphrase of the MSVC docs link from @P.W's answer, based on what MSVC seems to actually do. (I suspect the "whichever is less" in the text is supposed to be outside the parens, but maybe they're making a different point about the interaction on the pragma and the command-line option?)

(An 8-byte struct containing a char[8] still only gets 1-byte alignment inside another struct, or a struct containing an alignas(16) member still gets 16-byte alignment inside another struct.)

Note that ISO C++ doesn't guarantee that primitive types have alignof(T) == sizeof(T). Also note that MSVC's definition of alignof() doesn't match the ISO C++ standard: MSVC says alignof(__int64) == 8, but some __int64 objects have less than that alignment².

So surprisingly, we get extra padding even though MSVC doesn't always bother to make sure the struct itself has any more than 4-byte alignment, unless you specify that with alignas() on the variable, or on a struct member to imply that for the type. (e.g. a local struct Z tmp on the stack inside a function will only have 4-byte alignment, because MSVC doesn't use extra instructions like and esp, -8 to round the stack pointer down to an 8-byte boundary.)

However, new / malloc does give you 8-byte-aligned memory in 32-bit mode, so this makes a lot of sense for dynamically-allocated objects (which are common). Forcing locals on the stack to be fully aligned would add cost to align the stack pointer, but by setting struct layout to take advantage of 8-byte-aligned storage, we get the advantage for static and dynamic storage.

This might also be designed to get 32 and 64-bit code to agree on some struct layouts for shared memory. (But note that the default for x86-64 is min(16, sizeof(T)), so they still don't fully agree on struct layout if there are any 16-byte types that aren't aggregates (struct/union/array) and don't have an alignas.)

The minimum absolute alignment of 4 comes from the 4-byte stack alignment that 32-bit code can assume. In static storage, compilers will choose natural alignment up to maybe 8 or 16 bytes for vars outside of structs, for efficient copying with SSE2 vectors.

In larger functions, MSVC may decide to align the stack by 8 for performance reasons, e.g. for double vars on the stack which actually can be manipulated with single instructions, or maybe also for int64_t with SSE2 vectors. See the Stack Alignment section in this 2006 article: Windows Data Alignment on IPF, x86, and x64. So in 32-bit code you can't depend on an int64_t* or double* being naturally aligned.

(I'm not sure if MSVC will ever create even less aligned int64_t or double objects on its own. Certainly yes if you use #pragma pack 1 or -Zp1, but that changes the ABI. But otherwise probably not, unless you carve space for an int64_t out of a buffer manually and don't bother to align it. But assuming alignof(int64_t) is still 8, that would be C++ undefined behaviour.)

If you use alignas(8) int64_t tmp, MSVC emits extra instructions to and esp, -8. If you don't, MSVC doesn't do anything special, so it's luck whether or not tmp ends up 8-byte aligned or not.

Other designs are possible, for example the i386 System V ABI (used on most non-Windows OSes) has alignof(long long) = 4 but sizeof(long long) = 8. These choices

Outside of structs (e.g. global vars or locals on the stack), modern compilers in 32-bit mode do choose to align int64_t to an 8-byte boundary for efficiency (so it can be loaded / copied with MMX or SSE2 64-bit loads, or x87 fild to do int64_t -> double conversion).

This is one reason why modern version of the i386 System V ABI maintain 16-byte stack alignment: so 8-byte and 16-byte aligned local vars are possible.

When the 32-bit Windows ABI was being designed, Pentium CPUs were at least on the horizon. Pentium has 64-bit wide data busses, so its FPU really can load a 64-bit double in a single cache access if it's 64-bit aligned.

Or for fild / fistp, load/store a 64-bit integer when converting to/from double. Fun fact: naturally aligned accesses up to 64 bits are guaranteed atomic on x86, since Pentium: Why is integer assignment on a naturally aligned variable atomic on x86?

Footnote 1: An ABI also includes a calling convention, or in the case of MS Windows, a choice of various calling conventions which you can declare with function attributes like __fastcall), but the sizes and alignment-requirements for primitive types like long long are also something that compilers have to agree on to make functions that can call each other. (The ISO C++ standard only talks about a single "C++ implementation"; ABI standards are how "C++ implementations" make themselves compatible with each other.)

Note that struct-layout rules are also part of the ABI: compilers have to agree with each other on struct layout to create compatible binaries that pass around structs or pointers to structs. Otherwise s.x = 10; foo(&x); might write to a different offset relative to the base of the struct than separately-compiled foo() (maybe in a DLL) was expecting to read it at.

Footnote 2:

GCC had this C++ alignof() bug, too, until it was fixed in 2018 for g++8 some time after being fixed for C11 _Alignof(). See that bug report for some discussion based on quotes from the standard which conclude that alignof(T) should really report the minimum guaranteed alignment you can ever see, not the preferred alignment you want for performance. i.e. that using an int64_t* with less than alignof(int64_t) alignment is undefined behaviour.

(It will usually work fine on x86, but vectorization that assumes a whole number of int64_t iterations will reach a 16 or 32-byte alignment boundary can fault. See Why does unaligned access to mmap'ed memory sometimes segfault on AMD64? for an example with gcc.)

The gcc bug report discusses the i386 System V ABI, which has different struct-packing rules than MSVC: based on minimum alignment, not preferred. But modern i386 System V maintains 16-byte stack alignment, so it's only inside structs (because of struct-packing rules that are part of the ABI) that the compiler ever creates int64_t and double objects that are less than naturally aligned. Anyway, that's why the GCC bug report was discussing struct members as the special case.

Kind of the opposite from 32-bit Windows with MSVC where the struct-packing rules are compatible with an alignof(int64_t) == 8 but locals on the stack are always potentially under-aligned unless you use alignas() to specifically request alignment.

32-bit MSVC has the bizarre behaviour that alignas(int64_t) int64_t tmp is not the same as int64_t tmp;, and emits extra instructions to align the stack. That's because alignas(int64_t) is like alignas(8), which is more aligned than the actual minimum.

void extfunc(int64_t *);

void foo_align8(void) {
    alignas(int64_t) int64_t tmp;
    extfunc(&tmp);
}

(32-bit) x86 MSVC 19.20 -O2 compiles it like so (on Godbolt, also includes 32-bit GCC and the struct test-case):

_tmp$ = -8                                          ; size = 8
void foo_align8(void) PROC                       ; foo_align8, COMDAT
        push    ebp
        mov     ebp, esp
        and     esp, -8                             ; fffffff8H  align the stack
        sub     esp, 8                                  ; and reserve 8 bytes
        lea     eax, DWORD PTR _tmp$[esp+8]             ; get a pointer to those 8 bytes
        push    eax                                     ; pass the pointer as an arg
        call    void extfunc(__int64 *)           ; extfunc
        add     esp, 4
        mov     esp, ebp
        pop     ebp
        ret     0

But without the alignas(), or with alignas(4), we get the much simpler

_tmp$ = -8                                          ; size = 8
void foo_noalign(void) PROC                                ; foo_noalign, COMDAT
        sub     esp, 8                             ; reserve 8 bytes
        lea     eax, DWORD PTR _tmp$[esp+8]        ; "calculate" a pointer to it
        push    eax                                ; pass the pointer as a function arg
        call    void extfunc(__int64 *)           ; extfunc
        add     esp, 12                             ; 0000000cH
        ret     0

It could just push esp instead of LEA/push; that's a minor missed optimization.

Passing a pointer to a non-inline function proves that it's not just locally bending the rules. Some other function that just gets an int64_t* as an arg has to deal with this potentially under-aligned pointer, without having gotten any information about where it came from.

If alignof(int64_t) was really 8, that function could be hand-written in asm in a way that faulted on misaligned pointers. Or it could be written in C with SSE2 intrinsics like _mm_load_si128() that require 16-byte alignment, after handling 0 or 1 elements to reach an alignment boundary.

But with MSVC's actual behaviour, it's possible that none of the int64_t array elements are aligned by 16, because they all span an 8-byte boundary.

BTW, I wouldn't recommend using compiler-specific types like __int64 directly. You can write portable code by using int64_t from <cstdint>, aka <stdint.h>.

In MSVC, int64_t will be the same type as __int64.

On other platforms, it will typically be long or long long. int64_t is guaranteed to be exactly 64 bits with no padding, and 2's complement, if provided at all. (It is by all sane compilers targeting normal CPUs. C99 and C++ require long long to be at least 64-bit, and on machines with 8-bit bytes and registers that are a power of 2, long long is normally exactly 64 bits and can be used as int64_t. Or if long is a 64-bit type, then <cstdint> might use that as the typedef.)

I assume __int64 and long long are the same type in MSVC, but MSVC doesn't enforce strict-aliasing anyway so it doesn't matter whether they're the exact same type or not, just that they use the same representation.

Best Answer

Related Solutions

C – 64-bit Pointer Alignment

C++ – Specifying 64-bit Alignment with GCC

Related Question