GCC only does this extra stack alignment in main
; that function is special. You won't see it if you look at code-gen for any other function, unless you have a local with alignas(32)
or something.
GCC is just taking a defensive approach with -m32
, by not assuming that main
is called with a properly 16B-aligned stack. Or this special treatment is left over from when -mpreferred-stack-boundary=4
was only a good idea, not the law1.
The i386 System V ABI has guaranteed/required for years that ESP+4 is 16B-aligned on entry to a function. (i.e. ESP must be 16B-aligned before a CALL instruction, so args on the stack start at a 16B boundary. This is the same as for x86-64 System V.) ESP % 16 == 0
before a call, ESP % 16 == 12
on function entry, after a call.
The ABI also guarantees that new 32-bit processes start with ESP aligned on a 16B boundary (e.g. at _start
, the ELF entry point, where ESP points at argc, not a return address), and the glibc CRT code maintains that alignment.
As far as the calling convention is concerned, EBP is just another call-preserved register. But yes, compiler output with -fno-omit-frame-pointer
does take care to push ebp
before other call-preserved registers (like EBX) so the saved EBP values form a linked list. (Because it also does the mov ebp, esp
part of setting up a frame pointer after that push.)
Perhaps gcc is defensive because an extremely ancient Linux kernel (from before that revision to the i386 ABI, when the required alignment was only 4B) could violate that assumption, and it's only an extra couple instructions that run once in the life-time of the process (assuming the program doesn't call main
recursively).
Unlike gcc, clang assumes the stack is properly aligned on entry to main. (clang also assumes that narrow args have been sign or zero-extended to 32 bits, even though the current ABI revision doesn't specify that behaviour (yet). gcc and clang both emit code that does in the caller side, but only clang depends on it in the callee. This happens in 64-bit code, but I didn't check 32-bit.)
Look at compiler output on http://gcc.godbolt.org/ for main and functions other than main if you're curious.
I just updated the ABI links in the x86 tag wiki the other day. http://x86-64.org/ is still dead and seems to be not coming back, so I updated the System V links to point to the PDFs of the current revision in HJ Lu's github repo, and his page with links.
Note that the last version on SCO's site is not the current revision, and doesn't include the 16B-stack-alignment requirement.
History of the ABI change from 4 to 16-byte alignment
Footnote 1:
Adding a 16-byte alignment requirement to the i386 SysV ABI was sort of an accident; GCC maintained 16-byte alignment for performance reasons (so for example 8-byte double
would never be split across a cache line boundary).
See also a section at the bottom of my answer on Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment? for more detail.
In some GCC version, SSE/SSE2 code-gen started using movaps
to spill/reload __m128
variables to the stack, without manually aligning the incoming ESP. This turned the tuning choice into a requirement, but it wasn't detected until libraries with code like that were widely deployed in some long-term-stable Linux distros.
Faced with this choice, GCC devs / ABI maintainers chose the least-bad path of making it an official requirement. This broke existing hand-written asm that calls other functions.
See https://sourceforge.net/p/fbc/bugs/659/ for some history, and my comment on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838#c91 for an attempt at summarizing the unfortunate history of how i386 GNU/Linux + GCC accidentally got into a situation where a backwards-incompat change to the i386 System V ABI was the lesser of two evils.
Most BSD versions and i386 MacOS did not adopt this ABI change, and still don't require 16-byte stack alignment. GCC may default to -mpreferred-stack-boundary=4
for those targets, but code-gen for alignas(16) char buf[16];
(or __m128
locals that get spilled from regs) needs to manually align ESP inside functions in case it wasn't to start with.
So really this bump from 4 to 16-byte alignment was a change for Linux, mostly not other OSes. That might be another reason to simplify GCC's source code and always include the extra stack-alignment code in main
for 32-bit targets. At this point 32-bit x86 for Linux is obsolete enough that it's not worth changing now.
Best Answer
rsp % 16 == 0
at_start
- that's the OS entry point. It's not a function (there's no return address on the stack, instead RSP points atargc
). Unlike functions, RSP is aligned by 16 on entry to_start
, as specified by the x86-64 System V ABI.From
_start
, you're ready to call a function right away, without having to adjust the stack, because the stack should be aligned beforecall
.call
itself will add 8B of return address, and you can expect thersp % 16 == 8
upon entry, one more push away from 16-byte alignment. That's guaranteed upon entry to any function1.Upon app entry, you can trust the kernel to give you 16-byte RSP alignment, or you could align the stack manually with
and rsp, -16
before calling any other code conforming to ABI. (Or if you plan to use C runtime lib, then the entry point of your app code should bemain
, and let libc's crt startup code code run as_start
.main
is a normal function like any other, so RSP & 0xF == 0x8 on entry to it when it's eventually called.)Footnote 1: Unless you build with special options that change the ABI, like
-mpreferred-stack-boundary=3
instead of the default4
. But that would make it unsafe to call functions in any code compiled without that. For example glibc scanf Segmentation faults when called from a function that doesn't align RSPYes, if you would at that point
call
some more complex function like for exampleprintf
with non trivial arguments (so it would use SSE instruction for implementation), it will highly likely segfault.About
push byte 0xFF
:That's not legal instruction in 64b mode (not even in 16 and 32 bit modes) (not legal in the sense of
byte
operand target size,byte
immediate as source value is legal, but operand size can be only 16, 32 or 64 bits), so the NASM will guess the target size (any from legal ones, naturally pickingqword
in 64b mode), and use the guessed target size with theimm8
from source.BTW use
-w+all
option to make the NASM emit (sort of weird, but at least you can investigate) warning in such case:For example legit
push word 0xFF
would push only two bytes to stack, of word value0x00FF
.How to align the stack: if you already know initial alignment, just adjust as needed before calling some ABI requiring subroutine (in common 64b code that is usually as simple as either not pushing anything, or doing one more redundant push, like
push rbp
).If you are not sure about alignment, use some spare register to store original
rsp
(oftenrbp
is used, so it also functions as stack frame pointer), and thenand rsp,-16
to clear the bottom bits.Keep in mind, when creating your own ABI conforming subroutines, that stack was aligned before
call
, so it is -8B upon entry. Again simplepush rbp
is often enough to resolve several issues at the same time, preservingrbp
value (somov rbp, rsp
is possible "for free") and aligning stack for rest of subroutine.EDIT: about encoding, source size, and immediate size...
Unfortunately I'm not 100% sure about how exactly this is supposed to be defined in NASM, but I think actually the
push
definition is so complex, that it breaks NASM syntax a bit (exhausting the current syntax to a point where you can't specify whether you mean operand size, or source immediate size, so it is silently assumed the size specifier is operand size mainly and affects immediate in certain cases).By using
push byte 0xFF
the NASM will take thebyte
part ALSO as "operand size", not just as immediate size. Andbyte
is not legal operand size for push, so NASM will instead chooseqword
as by default in 64b mode. Then it will also consider thebyte
as immediate size, and sign-extend the0xFF
toqword
. I.e. this looks to me as a bit of undefined behaviour. NASM creators probably don't expect you to specify immediate size, because the NASM optimizes for size, so when you dopush word -1
, it will assemble that as "push word operand imm8". You can override that the other way, to make sure you get imm16 bypush strict word -1
.See the machine code produced by the various combinations (in 64b mode) (some of them speaking strictly are worth at least of warning, or even error, like "strict qword" producing only imm32, not imm64 (as imm64 opcode does not exist of course) ... not even mentioning that the
dword
variants are effectivelyqword
operand sizes, you can't use 32b operand size in 64b mode):Anyway, I guess not too many people are bothered by this, as in 64b mode you usually want qword push (
rsp -= 8
) with immediate encoded in shortest possible way, so you just writepush -1
and let the NASM handle theimm8
optimization itself, expectingrsp
to change by -8 of course. And in other case, they probably expect you to know legal operand sizes, and not to usebyte
at all.If you think this is not acceptable, I would raise this on the NASM forum/bugzilla/somewhere, how it is supposed to work exactly. As far as I'm personally concerned, the current behaviour is "good enough" for me (makes both sense, plus I give quick look to listing file from time to time to verify there's no nasty surprise in the machine code bytes and it landed as expected). That said, I mostly code size intros, so I know about every byte produced and it's purpose. If the NASM would suddenly produce
imm16
instead of expectedimm8
, I would see it on the binary size and investigate.