What
So @Peter Cordes was along the right lines in the comments: this is caused by stalls due to attempting to read a memory location less than three cycles after writing to it.
Here's a smaller program which produces the same effects, basically a minimal example of the program anove:
.bss
.align 256
a:
.zero 4
.text
.p2align 8, 0
.globl _start
_start:
mov $0x80000000, %ecx
1:
mov a(%rip), %edx
mov %edx, a(%rip)
.byte 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90
add $1, %ecx
jno 1b
mov $60, %rax
mov $0, %rdi
syscall
The eight no-op instructions inside the loop body slow the loop down to the extent that it runs at the rate of exactly one instruction per three cycles, causing the loop to run in around 3½ seconds (with the exact amount based on frequency scaling). If you remove them, the loop takes more like five seconds to run.
Why
Looking at the small program above
That answers the question of what the trigger is, but it still doesn't really explain what's actually going on to cause the issue, or why it creates such a large performance deviation. The key clue to this was looking at the number of µops dispatched on each port. Here are the numbers for the minimized program above, together with a few other interesting statistics:
slow fast
1,156,849,106 1,426,795,253 uops_dispatched_port.port_0:u
1,386,792,998 1,432,967,110 uops_dispatched_port.port_1:u
3,635,449,011 3,582,057,598 uops_dispatched_port.port_2:u
4,169,272,248 5,003,425,117 uops_dispatched_port.port_3:u
18,538,282,061 6,427,027,115 uops_dispatched_port.port_4:u
1,763,689,624 1,435,813,046 uops_dispatched_port.port_5:u
4,299,495,754 4,286,287,760 uops_dispatched_port.port_6:u
784,337,695 12,429,585 uops_dispatched_port.port_7:u
19,393,649,837 12,969,389,151 cpu-cycles:u
17,197,960,860 51,654,810,808 uops_issued.any:u
21,677,912,988 21,641,546,006 uops_executed.core:u
5.245720659 3.491356466 seconds time elapsed
5.220800000 3.483496000 seconds user
The only difference between the two programs is that the "fast" program is running an extra eight nop
s per iteration; with 232 iterations, that's an extra 34359738368 nop
s. At the µop level, a nop
is a µop that gets issued and then just disappears, so to get a fair comparison between the program, we can subtract 34,359,738,368 from 51,654,810,808 to get an "adjusted" count of 17,295,072,440 µops issued for the "fast" program.
This reveals some pretty strange results when it comes to the µop lifecycle. After removing the nop
s, the number of µops issued is pretty much the same between the two programs – unsurprising, because they have the same code in that case. The number of µops executed is also pretty much the same, which is again unsurprising. (The issue and execution counts don't match each other because µops have a tendency to be split up, fused, rearranged etc. during execution, but this should affect both programs in the same way.)
On the other hand, the µop dispatch counts don't seem to match at all, which initially seems like it should be impossible. The trick is that, despite the naming, the performance counters for dispatch aren't counting µops directly – rather, they're counting the number of cycles that the port is busy handling a µop. So what the crazy mismatch on port 4 is showing is that the µops are getting "stuck" in port 4 and taking a while to execute (around 3 cycles, based on the statistics).
On the processor I'm using, port 4 is the only execution port capable of doing stores to memory (the MOV instruction itself takes up two ports; 2, 3 or 7 calculate the address, and 4 does the actual storing). I think the mental model I had of the processor oddity in question is that upon trying to read from a memory location less than 3 cycles after storing to it, the read would take longer than normal. However, the actual effects seem to be worse than that: the processor behaviour appears to be that if you attempt to read from a memory location less than 3 cycles after storing to it, all writes on the processor thread get blocked for 3 cycles.
Looking at the program in the question
Now let's look at the program in "Further experiments #2" above, in terms of port usage (a side note: for some reason I had to add an extra three nop
s to get the "fast" behaviour, so maybe there's been some extra optimisation in a microcode update):
slow fast
8,881,288,565 3,278,012,510 uops_dispatched_port.port_0:u
8,852,984,797 4,992,142,029 uops_dispatched_port.port_1:u
10,385,944,986 10,726,029,245 uops_dispatched_port.port_2:u
11,048,867,603 10,721,449,693 uops_dispatched_port.port_3:u
25,015,457,606 11,608,396,673 uops_dispatched_port.port_4:u
9,255,886,026 5,102,583,196 uops_dispatched_port.port_5:u
11,265,935,146 7,295,609,724 uops_dispatched_port.port_6:u
4,333,803,057 4,304,364,919 uops_dispatched_port.port_7:u
42,554,124,800 26,024,822,032 cpu-cycles:u
38,686,833,876 103,170,056,257 uops_issued.any:u
52,058,581,194 51,940,941,322 uops_executed.core:u
11.504080112 7.024786187 seconds time elapsed
11.488620000 7.018980000 seconds user
With fifteen nop
s being issued per iteration in the "fast" version, the number of non-nop
µops issued there is 38,745,546,817. So we're seeing much the same results: not counting the nop
s, the µop issue and execution rate are the same between the two programs, but some µops are taking much longer to execute in the slow version and tying up the ports.
There's one small difference: the instruction that's reading memory too soon after writing it is now the read-modify-write instruction or $1, a(%rip)
. The write half of that needs to use both port 4 to do the writing, and a port capable of doing or
(0, 1, 5, or 6) to calculate the value to write. So we now see that ports 0, 1, 5, and 6 are all showing as having stuck µops – the instruction is getting stuck due to the early read, and tying up its ALU port in addition to port 4 during the three-cycle "I can't write" period. The main consequence of this appears to be that the rest of the program slows down a little too; my guess at this (which is consistent with the numbers above but might be wrong anyway) is that when port 6 ends up getting stuck, the loop logic has to wait, because port 6 is the only port that can handle the add $1, %ebx; jno 1b
macro-fused instruction required to return to the start of the loop. So the program is being slowed by 3 clock cycles per loop iteration, and sometimes more because the loop logic is jammed too.
The interesting effect, though, is how large the performance deviations are. With the small program above, port 4 was stuck for 3 cycles per loop iteration, but that didn't slow down the loop as a whole by that much, because port 4 throughput wasn't the limiting factor. With this more complex program, though, port 4 being stuck for 3 cycles per loop iteration does seem to slow down every loop iteration by 3 cycles (and a little more, perhaps due to port 6 being stuck too), so the port 4 jam is having a close to maximal effect on the program as a whole. It's still not entirely obvious why port 4 being jammed for 3 cycles is causing the program as a whole to stall for 3 cycles (this program isn't obviously limited on port 4 throughput either), but it's at least plausible that an effect like this could be occurring.
Anyway, to answer the main practical question in the OP, "In particular, is there a general rule that can be used to work out when adding an extra read to a program will make it run (this much) faster?": the rule is "on this processor, reading a memory address less than three cycles after writing it will jam write instructions for 3 cycles, which will in turn slow down your program by up to 3 cycles, maybe even a little more; you can avoid it by preventing writes that happen that quickly after a read". Unfortunately, with out-of-order processors, it can be hard to know when the processor will attempt to run instructions, so it can be hard to guarantee that two instructions won't run too quickly after each other; for programs like the one I was initially working on, most likely the main piece of practical advice is to avoid repeatedly reading and writing the same memory address in a tight loop (as opposed to storing the value in a register and repeatedly reading and writing from there – the original assembly code only ended up looking the way it did because I was calling a function through a function pointer, meaning that the compiler couldn't safely make this optimisation).
Best Answer
It's likely a combination of a longer dependency chain, along with Load Misprediction*.
Longer Dependency Chain:
First, we identify the critical dependency paths. Then we look at the instruction latencies provided by: http://www.agner.org/optimize/instruction_tables.pdf (page 117)
In the unoptimized version, the critical dependency path is:
addsd -72(%rbp), %xmm0
movsd %xmm0, -72(%rbp)
Internally, it probably breaks up into:
If we look at the optimized version, it's just:
So you have 8 cycles vs. 3 cycles. Almost a factor of 3.
I'm not sure how sensitive the Nehalem processor line is to store-load dependencies and how well it does forwarding. But it's reasonable to believe that it's not zero.
Load-store Misprediction:
Modern processors use prediction in more ways you can imagine. The most famous of these is probably Branch Prediction. One of the lesser known ones is Load Prediction.
When a processor sees a load, it will immediately load it before all pending writes finish. It will assume that those writes will not conflict with the loaded values.
If an earlier write turns out to conflict with a load, then the load must be re-executed and the computation rolled back to the point of the load. (in much the same way that branch mispredictions roll back)
How it is relevant here:
Needless to say, modern processors will be able to execute multiple iterations of this loop simultaneously. So the processor will be attempting to perform the load (
addsd -72(%rbp), %xmm0)
before it finishes the store (movsd %xmm0, -72(%rbp)
) from the previous iteration.The result? The previous store conflicts with the load - thus a misprediction and a roll back.
*Note that I'm unsure of the name "Load Prediction". I only read about it in the Intel docs and they didn't seem to give it a name.