x86 Assembly – Opcode Alignment References and Guidelines

assemblymemory-alignmentmicro-optimizationx86x86-64

I'm generating some opcodes dynamically in a JIT compiler and I'm looking for guidelines for opcode alignment.

1) I've read comments that briefly "recommend" alignment by adding nops after calls

2) I've also read about using nop for optimizing sequences for parallelism.

3) I've read that alignment of ops is good for "cache" performance

Usually these comments don't give any supporting references. Its one thing to read a blog or a comment that says, "its a good idea to do such and such", but its another to actually write a compiler that implements specific op sequences and realize most material online, especially blogs, are not useful for practical application. So I'm a believer in finding things out myself (disassembly, etc. to see what real world apps do). This is one case where I need some outside info.

I notice compilers will usually start an odd byte instruction immediately after whatever previous instruction sequence there was. So the compiler is not taking any special care in most cases. I see "nop" here or there, but usually it seems nop is used sparingly, if at all. How critical is opcode alignment? Can you provide references for cases that I can actually use for implementation? Thanks.

Best Answer

I would recommend against inserting nops except for the alignment of branch targets. On some specific CPUs, branch prediction algorithms may penalize control transfers to control transfers, and so a nop may be able to act as a flag and invert the prediction, but otherwise it is unlikely to help.

Modern CPU's are going to translate your ISA ops into micro-ops anyway. This may make classical alignment techniques less important, as presumably the micro-operation transcoder will leave out nops and change both the size and alignment of the secret true machine ops.

However, by the same token, optimizations based on first principles should do little or no harm.

The theory is that one makes better use of the cache by starting loops at cache line boundaries. If a loop were to start in the middle of a cache line, then the first half of the cache line would be unavoidably loaded and kept loaded during the loop, and this would be wasted space in the cache if the loop is longer than 1/2 of a cache line.

Also, for branch targets, the initial load of the cache line loads the largest forward window of instruction stream when the target is aligned.

Regarding separating in-line instructions that are not branch targets with nops, there are few reasons for doing this on modern CPU's. (There was a time when RISC machines had delay slots which often led to inserting nops after control transfers.) Decoding the instruction stream is easy to pipeline and if an architecture has odd-byte-length ops you can be assured that they are decoded reasonably.

Related Question