TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0
anti-optimized code almost always involve store/reload of some C variables.
Other examples of this slowdown in action: hyperthreading, calling an empty function, accessing vars through pointers.
And apparently also on low-power Goldmont, unless there's a different cause there for an extra load helping.
None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.
You're benchmarking a debug build, which is basically useless. They have different bottlenecks than optimized code, not a uniform slowdown.
But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)
If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0
compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.
You haven't shown the whole inner loop, only some of the loop body, but gcc -O0
is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.
The bottleneck is probably the loop-carried dependency on k
, with a store/reload and an add
to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add
.
If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.
So with more work in the loop, that addl $1, -12(%rbp)
which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.
This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.
Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0
on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.
The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p
being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)
Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer (MOB) keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals. The MOB is a combination of the store buffer and load buffer)
Does this mean adding useless statements will speed up real programs? (with optimization enabled)
In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.
Tuning your source for gcc -O0
is useless. Measure with -O3
, or whatever options the default build scripts for your project use.
Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.
Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile
if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments
Best Answer
As @RossRidge says, the overhead will mostly be lost in the noise of servicing an interrupt on a modern x86 (probably at least 100 clock cycles, and many many more if this is part of a modern OS with Meltdown + Spectre mitigation set up).
If
MAX_VAL
is a power of 2,counter %= MAX_VAL
is excellent, especially ifcounter
is unsigned (in which case just a simpleand
, or in this case amovzx
byte to dword which can have zero latency on Intel CPUs. It still has a throughput cost of course: Can x86's MOV really be "free"? Why can't I reproduce this at all?)Is it possible to fill the last
255-240
entries with something harmless, or repeats of something?As long as
MAX_VAL
is a compile-time constant, though,counter %= MAX_VAL
will compile efficiently to just a couple multiplies, shift, and adds. (Again, more efficient for unsigned.) Why does GCC use multiplication by a strange number in implementing integer division?But a check for wrap-around is even better. A branchless check (using
cmov
) has lower latency than the remainder using a multiplicative inverse, and costs fewer uops for throughput.As you say, a branchy check can take the check off the critical path entirely, at at the cost of a mispredict sometimes.
I compiled many versions of this on the Godbolt compiler explorer, with recent gcc and clang.
(For more about static performance analysis of throughput and latency for short blocks of x86 asm, see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?, and other links in the x86 tag wiki, especially Agner Fog's guides.)
clang uses branchless
cmov
for both versions. I compiled with-fPIE
in case you're using that in your kernels. If you can use-fno-pie
, then the compiler can save an LEA and usemov edi, [table + 4*rcx]
, assuming you're on a target where static addresses in position-dependent code fit in 32-bit sign-extended constants (e.g. true in the Linux kernel, but I'm not sure if they compile with-fPIE
or do kernel ASLR with relocations when the kernel is loaded.)The block of 8 instructions starting at the load of the old counter value is a total of 8 uops (on AMD, or Intel Broadwell and later, where
cmov
is only 1 uop). The critical-path latency fromcounter
being ready totable[++counter % MAX_VAL]
being ready is 1 (add) + 1 (cmp) + 1 (cmov) + load-use latency for the load. i.e. 3 extra cycles. That's the latency of 1mul
instruction. Or 1 extra cycle on older Intel wherecmov
is 2 uops.By comparison, the version using modulo is 14 uops for that block with gcc, including a 3-uop
mul r32
. The latency is at least 8 cycles, I didn't count exactly. (For throughput it's only little bit worse, though, unless the higher latency reduces the ability of out-of-order execution to overlap execution of stuff that depends on the counter.)Other optimizations
Use the old value of
counter
, and prepare a value for next time (taking the calculation off the critical path.)Use a pointer instead of a counter. Saves a couple instructions, at the cost of using 8 bytes instead of 1 or 4 of cache footprint for the variable. (
uint8_t counter
compiles nicely with some versions, just usingmovzx
to 64-bit).This counts upward, so the table can be in order. It increments after loading, taking that logic off the critical path dependency chain for out-of-order execution.
This compiles really nicely with both gcc and clang, especially if you're using
-fPIE
so the compiler needs the table address in a register anyway.Again, 8 uops (assuming
cmov
is 1 uop). This has even lower latency than the counter version possibly could, because the[rax]
addressing mode (with RAX coming from a load) has 1 cycle lower latency than an indexed addressing mode, on Sandybridge-family. With no displacement, it never suffers the penalty described in Is there a penalty when base+offset is in a different page than the base?Or (with a counter) count down towards zero: potentially saves an instruction if the compiler can use flags set by the decrement to detect the value becoming negative. Or we can always use the decremented value, and do the wrap around after that: so when
counter
is 1, we'd usetable[--counter]
(table[0]
), but storecounter=MAX_VAL
. Again, takes the wrap check off the critical path.If you wanted a branch version, you'd want it to branch on the carry flag, because
sub eax,1
/jc
can macro-fuse into 1 uops, butsub eax,1
/js
can't macro-fuse on Sandybridge-family. x86_64 - Assembly - loop conditions and out of order. But with branchless, it's fine.cmovs
(mov if sign flag set, i.e. if the last result was negative) is just as efficient ascmovc
(mov if carry flag is set).It was tricky to get gcc to use the flag results from dec or sub without also doing a
cdqe
to sign-extend the index to pointer width. I guess I could useintptr_t
counter, but that would be silly; might as well just use a pointer. With an unsigned counter, gcc and clang both want to do anothercmp eax, 239
or something after the decrement, even though flags are already set just fine from the decrement. But we can get gcc to use SF by checking(int)counter < 0
:still 8 uops (should be 7), but no extra latency on the critical path. So all of the extra decrement and wrap instructions are juicy instruction-level parallelism for out-of-order execution.