GCC only does this extra stack alignment in main
; that function is special. You won't see it if you look at code-gen for any other function, unless you have a local with alignas(32)
or something.
GCC is just taking a defensive approach with -m32
, by not assuming that main
is called with a properly 16B-aligned stack. Or this special treatment is left over from when -mpreferred-stack-boundary=4
was only a good idea, not the law1.
The i386 System V ABI has guaranteed/required for years that ESP+4 is 16B-aligned on entry to a function. (i.e. ESP must be 16B-aligned before a CALL instruction, so args on the stack start at a 16B boundary. This is the same as for x86-64 System V.) ESP % 16 == 0
before a call, ESP % 16 == 12
on function entry, after a call.
The ABI also guarantees that new 32-bit processes start with ESP aligned on a 16B boundary (e.g. at _start
, the ELF entry point, where ESP points at argc, not a return address), and the glibc CRT code maintains that alignment.
As far as the calling convention is concerned, EBP is just another call-preserved register. But yes, compiler output with -fno-omit-frame-pointer
does take care to push ebp
before other call-preserved registers (like EBX) so the saved EBP values form a linked list. (Because it also does the mov ebp, esp
part of setting up a frame pointer after that push.)
Perhaps gcc is defensive because an extremely ancient Linux kernel (from before that revision to the i386 ABI, when the required alignment was only 4B) could violate that assumption, and it's only an extra couple instructions that run once in the life-time of the process (assuming the program doesn't call main
recursively).
Unlike gcc, clang assumes the stack is properly aligned on entry to main. (clang also assumes that narrow args have been sign or zero-extended to 32 bits, even though the current ABI revision doesn't specify that behaviour (yet). gcc and clang both emit code that does in the caller side, but only clang depends on it in the callee. This happens in 64-bit code, but I didn't check 32-bit.)
Look at compiler output on http://gcc.godbolt.org/ for main and functions other than main if you're curious.
I just updated the ABI links in the x86 tag wiki the other day. http://x86-64.org/ is still dead and seems to be not coming back, so I updated the System V links to point to the PDFs of the current revision in HJ Lu's github repo, and his page with links.
Note that the last version on SCO's site is not the current revision, and doesn't include the 16B-stack-alignment requirement.
History of the ABI change from 4 to 16-byte alignment
Footnote 1:
Adding a 16-byte alignment requirement to the i386 SysV ABI was sort of an accident; GCC maintained 16-byte alignment for performance reasons (so for example 8-byte double
would never be split across a cache line boundary).
See also a section at the bottom of my answer on Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment? for more detail.
In some GCC version, SSE/SSE2 code-gen started using movaps
to spill/reload __m128
variables to the stack, without manually aligning the incoming ESP. This turned the tuning choice into a requirement, but it wasn't detected until libraries with code like that were widely deployed in some long-term-stable Linux distros.
Faced with this choice, GCC devs / ABI maintainers chose the least-bad path of making it an official requirement. This broke existing hand-written asm that calls other functions.
See https://sourceforge.net/p/fbc/bugs/659/ for some history, and my comment on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838#c91 for an attempt at summarizing the unfortunate history of how i386 GNU/Linux + GCC accidentally got into a situation where a backwards-incompat change to the i386 System V ABI was the lesser of two evils.
Most BSD versions and i386 MacOS did not adopt this ABI change, and still don't require 16-byte stack alignment. GCC may default to -mpreferred-stack-boundary=4
for those targets, but code-gen for alignas(16) char buf[16];
(or __m128
locals that get spilled from regs) needs to manually align ESP inside functions in case it wasn't to start with.
So really this bump from 4 to 16-byte alignment was a change for Linux, mostly not other OSes. That might be another reason to simplify GCC's source code and always include the extra stack-alignment code in main
for 32-bit targets. At this point 32-bit x86 for Linux is obsolete enough that it's not worth changing now.
The examples given in the book are highly dependent on the used compiler and computer architecture. If you test them in your own program you may get totally different results than the author. I will assume a 64-bit architecture, because the author does also, from what I've read in the description.
Lets look at the examples one by one:
ReallySlowStruct
IF the used compiler supports non-byte aligned struct members, the start of "d" will be at the seventh bit of the first byte of the struct. Sounds very good for memory saving. The problem with this is, that C does not allow bit-adressing. So to save newValue to the "d" member, the compiler must do a whole lot of bit shifting operations: Save the first two bits of "newValue" in byte0, shifted 6 bits to the right. Then shift "newValue" two bits to the left and save it starting at byte 1. Byte 1 is a non-aligned memory location, that means the bulk memory transfer instructions won't work, the compiler must save every byte at a time.
SlowStruct
It gets better. The compiler can get rid of all the bit-fiddling. But writing "d" will still require writing every byte at a time, because it is not aligned to the native "int" size. The native size on a 64-bit system is 8. so every memory address not divisable by 8 can only be accessed one byte at a time. And worse, if I switch off packing, I will waste a lot of memory space: every member which is followed by an int will be padded with enough bytes to let the integer start at a memory location divisable by 8. In this case: char a and c will both take up 8 bytes.
FastStruct
this is aligned to the size of int on the target machine. "d" takes up 8 bytes as it should. Because the chars are all bundled at one place, the compiler does not pad them and does not waste space. chars are only 1 byte each, so we do not need to pad them. The complete structure adds up to an overall size of 16 bytes. Divisable by 8, so no padding needed.
In most scenarios, you never have to be concerned with alignment because the default alignment is already optimal. In some cases however, you can achieve significant performance improvements, or memory savings, by specifying a custom alignment for your data stuctures.
In terms of memory space, the compiler pads the structure in a way that naturally aligns each element of the structure.
struct x_
{
char a; // 1 byte
int b; // 4 bytes
short c; // 2 bytes
char d; // 1 byte
} bar[3];
struct x_
is padded by the compiler and thus becomes:
// Shows the actual memory layout
struct x_
{
char a; // 1 byte
char _pad0[3]; // padding to put 'b' on 4-byte boundary
int b; // 4 bytes
short c; // 2 bytes
char d; // 1 byte
char _pad1[1]; // padding to make sizeof(x_) multiple of 4
} bar[3];
Source: https://learn.microsoft.com/en-us/cpp/cpp/alignment-cpp-declarations?view=vs-2019
Best Answer
What has changed is SSE, which requires 16 byte alignment, this is covered in this older gcc document for -mpreferred-stack-boundary=num which says (emphasis mine):
This is also backed up by the paper Smashing The Modern Stack For Fun And Profit which covers this an other modern changes that break Smashing the Stack for Fun and Profit.