Memory Alignment – Comparison of Today and 20 Years Ago

assemblyc++gccx86

In the famous paper "Smashing the Stack for Fun and Profit", its author takes a C function

void function(int a, int b, int c) {
  char buffer1[5];
  char buffer2[10];
}

and generates the corresponding assembly code output

pushl %ebp
movl %esp,%ebp
subl $20,%esp

The author explains that since computers address memory in multiples of word size, the compiler reserved 20 bytes on the stack (8 bytes for buffer1, 12 bytes for buffer2).

I tried to recreate this example and got the following

pushl   %ebp
movl    %esp, %ebp
subl    $16, %esp

A different result! I tried various combinations of sizes for buffer1 and buffer2, and it seems that modern gcc does not pad buffer sizes to multiples of word size anymore. Instead it abides the -mpreferred-stack-boundary option.

As an illustration — using the paper's arithmetic rules, for buffer1[5] and buffer2[13] I'd get 8+16 = 24 bytes reserved on the stack. But in reality I got 32 bytes.

The paper is quite old and a lot of stuff happened since. I'd like to know, what exactly motivated this change of behavior? Is it the move towards 64bit machines? Or something else?

Edit

The code is compiled on a x86_64 machine using gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) like that:

$ gcc -S -o example1.s example1.c -fno-stack-protector -m32

Best Answer

What has changed is SSE, which requires 16 byte alignment, this is covered in this older gcc document for -mpreferred-stack-boundary=num which says (emphasis mine):

On Pentium and PentiumPro, double and long double values should be aligned to an 8 byte boundary (see -malign-double) or suffer significant run time performance penalties. On Pentium III, the Streaming SIMD Extension (SSE) data type __m128 suffers similar penalties if it is not 16 byte aligned.

This is also backed up by the paper Smashing The Modern Stack For Fun And Profit which covers this an other modern changes that break Smashing the Stack for Fun and Profit.

History of the ABI change from 4 to 16-byte alignment

Footnote 1: Adding a 16-byte alignment requirement to the i386 SysV ABI was sort of an accident; GCC maintained 16-byte alignment for performance reasons (so for example 8-byte double would never be split across a cache line boundary).

See also a section at the bottom of my answer on Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment? for more detail.

In some GCC version, SSE/SSE2 code-gen started using movaps to spill/reload __m128 variables to the stack, without manually aligning the incoming ESP. This turned the tuning choice into a requirement, but it wasn't detected until libraries with code like that were widely deployed in some long-term-stable Linux distros.

Faced with this choice, GCC devs / ABI maintainers chose the least-bad path of making it an official requirement. This broke existing hand-written asm that calls other functions.

See https://sourceforge.net/p/fbc/bugs/659/ for some history, and my comment on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838#c91 for an attempt at summarizing the unfortunate history of how i386 GNU/Linux + GCC accidentally got into a situation where a backwards-incompat change to the i386 System V ABI was the lesser of two evils.

Most BSD versions and i386 MacOS did not adopt this ABI change, and still don't require 16-byte stack alignment. GCC may default to -mpreferred-stack-boundary=4 for those targets, but code-gen for alignas(16) char buf[16]; (or __m128 locals that get spilled from regs) needs to manually align ESP inside functions in case it wasn't to start with.

So really this bump from 4 to 16-byte alignment was a change for Linux, mostly not other OSes. That might be another reason to simplify GCC's source code and always include the extra stack-alignment code in main for 32-bit targets. At this point 32-bit x86 for Linux is obsolete enough that it's not worth changing now.

C/C++ – Understanding Memory Alignment

The examples given in the book are highly dependent on the used compiler and computer architecture. If you test them in your own program you may get totally different results than the author. I will assume a 64-bit architecture, because the author does also, from what I've read in the description. Lets look at the examples one by one:

ReallySlowStruct IF the used compiler supports non-byte aligned struct members, the start of "d" will be at the seventh bit of the first byte of the struct. Sounds very good for memory saving. The problem with this is, that C does not allow bit-adressing. So to save newValue to the "d" member, the compiler must do a whole lot of bit shifting operations: Save the first two bits of "newValue" in byte0, shifted 6 bits to the right. Then shift "newValue" two bits to the left and save it starting at byte 1. Byte 1 is a non-aligned memory location, that means the bulk memory transfer instructions won't work, the compiler must save every byte at a time.

SlowStruct It gets better. The compiler can get rid of all the bit-fiddling. But writing "d" will still require writing every byte at a time, because it is not aligned to the native "int" size. The native size on a 64-bit system is 8. so every memory address not divisable by 8 can only be accessed one byte at a time. And worse, if I switch off packing, I will waste a lot of memory space: every member which is followed by an int will be padded with enough bytes to let the integer start at a memory location divisable by 8. In this case: char a and c will both take up 8 bytes.

FastStruct this is aligned to the size of int on the target machine. "d" takes up 8 bytes as it should. Because the chars are all bundled at one place, the compiler does not pad them and does not waste space. chars are only 1 byte each, so we do not need to pad them. The complete structure adds up to an overall size of 16 bytes. Divisable by 8, so no padding needed.

In most scenarios, you never have to be concerned with alignment because the default alignment is already optimal. In some cases however, you can achieve significant performance improvements, or memory savings, by specifying a custom alignment for your data stuctures.

In terms of memory space, the compiler pads the structure in a way that naturally aligns each element of the structure.

struct x_
{
   char a;     // 1 byte
   int b;      // 4 bytes
   short c;    // 2 bytes
   char d;     // 1 byte
} bar[3];

struct x_ is padded by the compiler and thus becomes:

// Shows the actual memory layout
struct x_
{
   char a;           // 1 byte
   char _pad0[3];    // padding to put 'b' on 4-byte boundary
   int b;            // 4 bytes
   short c;          // 2 bytes
   char d;           // 1 byte
   char _pad1[1];    // padding to make sizeof(x_) multiple of 4
} bar[3];

Source: https://learn.microsoft.com/en-us/cpp/cpp/alignment-cpp-declarations?view=vs-2019

Best Answer

Related Solutions

Linux – Responsibility of Stack Alignment in 32-bit x86 Assembly

History of the ABI change from 4 to 16-byte alignment

C/C++ – Understanding Memory Alignment

Related Question