Understanding Stack Alignment in x86-64 Assembly


I'm reading Intel manual about Stack Frames. It was noted that

The end of the input argument area shall be aligned on a 16 (32, if
__m256 is passed on stack) byte boundary.

I don't quite understand what it means. Does it mean that rsp should point to the address that is always aligned on 16?

I tried to experiment with it and wrote very simple program:

section .text
    global _start

    push byte 0xFF

    ;SYS_exit syscall

I ran it with gdb and noted that before executing the push instruction rsp = 0x7fffffffdcf0. And it was really aligned on 16. x/1xg $rsp returned 0x0000000000000001.

Now, after pushing the content of rsp became 0x7fffffffdce8. Is it a violation of the alignment requirements?

And what I also noticed x/1xg $rsp returned 0xffffffffffffffff. It means we set 1 to the next 8 bytes, not just one specified in the push instruction. Why? I expected the output of x/1xg $rsp after pushing to be 0x00000000000000FF (we pushed just one byte).

Best Answer

rsp % 16 == 0 at _start - that's the OS entry point. It's not a function (there's no return address on the stack, instead RSP points at argc). Unlike functions, RSP is aligned by 16 on entry to _start, as specified by the x86-64 System V ABI.

From _start, you're ready to call a function right away, without having to adjust the stack, because the stack should be aligned before call. call itself will add 8B of return address, and you can expect the rsp % 16 == 8 upon entry, one more push away from 16-byte alignment. That's guaranteed upon entry to any function1.

Upon app entry, you can trust the kernel to give you 16-byte RSP alignment, or you could align the stack manually with and rsp, -16 before calling any other code conforming to ABI. (Or if you plan to use C runtime lib, then the entry point of your app code should be main, and let libc's crt startup code code run as _start. main is a normal function like any other, so RSP & 0xF == 0x8 on entry to it when it's eventually called.)

Footnote 1: Unless you build with special options that change the ABI, like -mpreferred-stack-boundary=3 instead of the default 4. But that would make it unsafe to call functions in any code compiled without that. For example glibc scanf Segmentation faults when called from a function that doesn't align RSP

Now, after pushing the content of rsp became 0x7fffffffdce8. Is it a violation of the alignment requirements?

Yes, if you would at that point call some more complex function like for example printf with non trivial arguments (so it would use SSE instruction for implementation), it will highly likely segfault.

About push byte 0xFF:

That's not legal instruction in 64b mode (not even in 16 and 32 bit modes) (not legal in the sense of byte operand target size, byte immediate as source value is legal, but operand size can be only 16, 32 or 64 bits), so the NASM will guess the target size (any from legal ones, naturally picking qword in 64b mode), and use the guessed target size with the imm8 from source.

BTW use -w+all option to make the NASM emit (sort of weird, but at least you can investigate) warning in such case:

warning: signed byte value exceeds bounds

For example legit push word 0xFF would push only two bytes to stack, of word value 0x00FF.

How to align the stack: if you already know initial alignment, just adjust as needed before calling some ABI requiring subroutine (in common 64b code that is usually as simple as either not pushing anything, or doing one more redundant push, like push rbp).

If you are not sure about alignment, use some spare register to store original rsp (often rbp is used, so it also functions as stack frame pointer), and then and rsp,-16 to clear the bottom bits.

Keep in mind, when creating your own ABI conforming subroutines, that stack was aligned before call, so it is -8B upon entry. Again simple push rbp is often enough to resolve several issues at the same time, preserving rbp value (so mov rbp, rsp is possible "for free") and aligning stack for rest of subroutine.

EDIT: about encoding, source size, and immediate size...

Unfortunately I'm not 100% sure about how exactly this is supposed to be defined in NASM, but I think actually the push definition is so complex, that it breaks NASM syntax a bit (exhausting the current syntax to a point where you can't specify whether you mean operand size, or source immediate size, so it is silently assumed the size specifier is operand size mainly and affects immediate in certain cases).

By using push byte 0xFF the NASM will take the byte part ALSO as "operand size", not just as immediate size. And byte is not legal operand size for push, so NASM will instead choose qword as by default in 64b mode. Then it will also consider the byte as immediate size, and sign-extend the 0xFF to qword. I.e. this looks to me as a bit of undefined behaviour. NASM creators probably don't expect you to specify immediate size, because the NASM optimizes for size, so when you do push word -1, it will assemble that as "push word operand imm8". You can override that the other way, to make sure you get imm16 by push strict word -1.

See the machine code produced by the various combinations (in 64b mode) (some of them speaking strictly are worth at least of warning, or even error, like "strict qword" producing only imm32, not imm64 (as imm64 opcode does not exist of course) ... not even mentioning that the dword variants are effectively qword operand sizes, you can't use 32b operand size in 64b mode):

 6 00000000 6AFF                            push    -1
 7 00000002 6AFF                            push    strict byte 0xFF
 8          ******************       warning: signed byte value exceeds bounds
 9 00000004 6AFF                            push    byte 0xFF
10          ******************       warning: signed byte value exceeds bounds
11 00000006 6AFF                            push    strict byte -1
12 00000008 6AFF                            push    byte -1
13 0000000A 6668FF00                        push    strict word 0xFF
14 0000000E 6668FF00                        push    word 0xFF
15 00000012 6668FFFF                        push    strict word -1
16 00000016 666AFF                          push    word -1
17 00000019 68FF000000                      push    strict dword 0xFF
18 0000001E 68FF000000                      push    dword 0xFF
19 00000023 68FFFFFFFF                      push    strict dword -1
20 00000028 6AFF                            push    dword -1
21 0000002A 68FF000000                      push    strict qword 0xFF
22 0000002F 68FF000000                      push    qword 0xFF
23 00000034 68FFFFFFFF                      push    strict qword -1
24 00000039 6AFF                            push    qword -1

Anyway, I guess not too many people are bothered by this, as in 64b mode you usually want qword push (rsp -= 8) with immediate encoded in shortest possible way, so you just write push -1 and let the NASM handle the imm8 optimization itself, expecting rsp to change by -8 of course. And in other case, they probably expect you to know legal operand sizes, and not to use byte at all.

If you think this is not acceptable, I would raise this on the NASM forum/bugzilla/somewhere, how it is supposed to work exactly. As far as I'm personally concerned, the current behaviour is "good enough" for me (makes both sense, plus I give quick look to listing file from time to time to verify there's no nasty surprise in the machine code bytes and it landed as expected). That said, I mostly code size intros, so I know about every byte produced and it's purpose. If the NASM would suddenly produce imm16 instead of expected imm8, I would see it on the binary size and investigate.