C++ Performance – Speedup from Keeping Operand in Register

assemblyc++optimizationperformance

I have been trying to get an idea of the impact of having an array in L1 cache versus memory by timing a routine that scales and sums the elements of an array using the following code (I am aware that I should just scale the result by 'a' at the end; the point is to do both a multiply and an add within the loop – so far, the compiler hasn't figured out to factor out 'a'):

double sum(double a,double* X,int size)
{
    double total = 0.0;
    for(int i = 0;  i < size; ++i)
    {
        total += a*X[i];
    }
    return total;
}

#define KB 1024
int main()
{
    //Approximately half the L1 cache size of my machine
    int operand_size = (32*KB)/(sizeof(double)*2);
    printf("Operand size: %d\n", operand_size);
    double* X = new double[operand_size];
    fill(X,operand_size);

    double seconds = timer();
    double result;
    int n_iterations = 100000;
    for(int i = 0; i < n_iterations; ++i)
    {
        result = sum(3.5,X,operand_size);
        //result += rand();  
    }
    seconds = timer() - seconds; 

    double mflops = 2e-6*double(n_iterations*operand_size)/seconds;
    printf("Vector size %d: mflops=%.1f, result=%.1f\n",operand_size,mflops,result);
    return 0;
}

Note that the timer() and fill() routines are not included for brevity; their full source can be found here if you want to run the code:

http://codepad.org/agPWItZS

Now, here is where it gets interesting. This is the output:

Operand size: 2048
Vector size 2048: mflops=588.8, result=-67.8

This is totally un-cached performance, despite the fact that all elements of X should be held in cache between loop iterations. Looking at the assembly code generated by:

g++ -O3 -S -fno-asynchronous-unwind-tables register_opt_example.cpp

I notice one oddity in the sum function loop:

L55:
    movsd   (%r12,%rax,8), %xmm0
    mulsd   %xmm1, %xmm0
    addsd   -72(%rbp), %xmm0
    movsd   %xmm0, -72(%rbp)
    incq    %rax
    cmpq    $2048, %rax
    jne L55

The instructions:

    addsd   -72(%rbp), %xmm0
    movsd   %xmm0, -72(%rbp)

indicate that it is storing the value of "total" in sum() on the stack, and reading and writing it at every loop iteration. I modified the assembly so that this operand is kept in a a register:

...
addsd   %xmm0, %xmm3
...

This small change creates a huge performance boost:

Operand size: 2048
Vector size 2048: mflops=1958.9, result=-67.8

tl;dr
My question is: why does replacing a single memory location access with a register, speed up the code so much, given that the single location should be stored in L1 cache? What architectural factors make this possible? It seems very strange that writing one stack location repeatedly would completely destroy the effectiveness of a cache.

Appendix

My gcc version is:

Target: i686-apple-darwin10
Configured with: /var/tmp/gcc/gcc-5646.1~2/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10
Thread model: posix
gcc version 4.2.1 (Apple Inc. build 5646) (dot 1)

My CPU is:

Intel Xeon X5650

Best Answer

It's likely a combination of a longer dependency chain, along with Load Misprediction*.

Longer Dependency Chain:

First, we identify the critical dependency paths. Then we look at the instruction latencies provided by: http://www.agner.org/optimize/instruction_tables.pdf (page 117)

In the unoptimized version, the critical dependency path is:

addsd -72(%rbp), %xmm0
movsd %xmm0, -72(%rbp)

Internally, it probably breaks up into:

load (2 cycles)
addsd (3 cycles)
store (3 cycles)

If we look at the optimized version, it's just:

addsd (3 cycles)

So you have 8 cycles vs. 3 cycles. Almost a factor of 3.

I'm not sure how sensitive the Nehalem processor line is to store-load dependencies and how well it does forwarding. But it's reasonable to believe that it's not zero.

Load-store Misprediction:

Modern processors use prediction in more ways you can imagine. The most famous of these is probably Branch Prediction. One of the lesser known ones is Load Prediction.

When a processor sees a load, it will immediately load it before all pending writes finish. It will assume that those writes will not conflict with the loaded values.

If an earlier write turns out to conflict with a load, then the load must be re-executed and the computation rolled back to the point of the load. (in much the same way that branch mispredictions roll back)

How it is relevant here:

Needless to say, modern processors will be able to execute multiple iterations of this loop simultaneously. So the processor will be attempting to perform the load (addsd -72(%rbp), %xmm0) before it finishes the store (movsd %xmm0, -72(%rbp)) from the previous iteration.

The result? The previous store conflicts with the load - thus a misprediction and a roll back.

_{*Note that I'm unsure of the name "Load Prediction". I only read about it in the Intel docs and they didn't seem to give it a name.}

What

So @Peter Cordes was along the right lines in the comments: this is caused by stalls due to attempting to read a memory location less than three cycles after writing to it.

Here's a smaller program which produces the same effects, basically a minimal example of the program anove:

    .bss
    .align 256
a:
    .zero   4

    .text
    .p2align 8, 0
    .globl _start
_start:
    mov $0x80000000, %ecx
1:
    mov a(%rip), %edx
    mov %edx, a(%rip)
    .byte 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90
    add $1, %ecx
    jno 1b

    mov $60, %rax
    mov $0, %rdi
    syscall

The eight no-op instructions inside the loop body slow the loop down to the extent that it runs at the rate of exactly one instruction per three cycles, causing the loop to run in around 3½ seconds (with the exact amount based on frequency scaling). If you remove them, the loop takes more like five seconds to run.

Why

Looking at the small program above

That answers the question of what the trigger is, but it still doesn't really explain what's actually going on to cause the issue, or why it creates such a large performance deviation. The key clue to this was looking at the number of µops dispatched on each port. Here are the numbers for the minimized program above, together with a few other interesting statistics:

          slow             fast
 1,156,849,106    1,426,795,253      uops_dispatched_port.port_0:u
 1,386,792,998    1,432,967,110      uops_dispatched_port.port_1:u
 3,635,449,011    3,582,057,598      uops_dispatched_port.port_2:u
 4,169,272,248    5,003,425,117      uops_dispatched_port.port_3:u
18,538,282,061    6,427,027,115      uops_dispatched_port.port_4:u
 1,763,689,624    1,435,813,046      uops_dispatched_port.port_5:u
 4,299,495,754    4,286,287,760      uops_dispatched_port.port_6:u
   784,337,695       12,429,585      uops_dispatched_port.port_7:u
19,393,649,837   12,969,389,151      cpu-cycles:u
17,197,960,860   51,654,810,808      uops_issued.any:u
21,677,912,988   21,641,546,006      uops_executed.core:u
   5.245720659      3.491356466      seconds time elapsed
   5.220800000      3.483496000      seconds user

The only difference between the two programs is that the "fast" program is running an extra eight nops per iteration; with 2³² iterations, that's an extra 34359738368 nops. At the µop level, a nop is a µop that gets issued and then just disappears, so to get a fair comparison between the program, we can subtract 34,359,738,368 from 51,654,810,808 to get an "adjusted" count of 17,295,072,440 µops issued for the "fast" program.

This reveals some pretty strange results when it comes to the µop lifecycle. After removing the nops, the number of µops issued is pretty much the same between the two programs – unsurprising, because they have the same code in that case. The number of µops executed is also pretty much the same, which is again unsurprising. (The issue and execution counts don't match each other because µops have a tendency to be split up, fused, rearranged etc. during execution, but this should affect both programs in the same way.)

On the other hand, the µop dispatch counts don't seem to match at all, which initially seems like it should be impossible. The trick is that, despite the naming, the performance counters for dispatch aren't counting µops directly – rather, they're counting the number of cycles that the port is busy handling a µop. So what the crazy mismatch on port 4 is showing is that the µops are getting "stuck" in port 4 and taking a while to execute (around 3 cycles, based on the statistics).

On the processor I'm using, port 4 is the only execution port capable of doing stores to memory (the MOV instruction itself takes up two ports; 2, 3 or 7 calculate the address, and 4 does the actual storing). I think the mental model I had of the processor oddity in question is that upon trying to read from a memory location less than 3 cycles after storing to it, the read would take longer than normal. However, the actual effects seem to be worse than that: the processor behaviour appears to be that if you attempt to read from a memory location less than 3 cycles after storing to it, all writes on the processor thread get blocked for 3 cycles.

Looking at the program in the question

Now let's look at the program in "Further experiments #2" above, in terms of port usage (a side note: for some reason I had to add an extra three nops to get the "fast" behaviour, so maybe there's been some extra optimisation in a microcode update):

          slow             fast
 8,881,288,565    3,278,012,510      uops_dispatched_port.port_0:u
 8,852,984,797    4,992,142,029      uops_dispatched_port.port_1:u
10,385,944,986   10,726,029,245      uops_dispatched_port.port_2:u
11,048,867,603   10,721,449,693      uops_dispatched_port.port_3:u
25,015,457,606   11,608,396,673      uops_dispatched_port.port_4:u
 9,255,886,026    5,102,583,196      uops_dispatched_port.port_5:u
11,265,935,146    7,295,609,724      uops_dispatched_port.port_6:u
 4,333,803,057    4,304,364,919      uops_dispatched_port.port_7:u
42,554,124,800   26,024,822,032      cpu-cycles:u
38,686,833,876  103,170,056,257      uops_issued.any:u
52,058,581,194   51,940,941,322      uops_executed.core:u
  11.504080112      7.024786187      seconds time elapsed
  11.488620000      7.018980000      seconds user

With fifteen nops being issued per iteration in the "fast" version, the number of non-nop µops issued there is 38,745,546,817. So we're seeing much the same results: not counting the nops, the µop issue and execution rate are the same between the two programs, but some µops are taking much longer to execute in the slow version and tying up the ports.

There's one small difference: the instruction that's reading memory too soon after writing it is now the read-modify-write instruction or $1, a(%rip). The write half of that needs to use both port 4 to do the writing, and a port capable of doing or (0, 1, 5, or 6) to calculate the value to write. So we now see that ports 0, 1, 5, and 6 are all showing as having stuck µops – the instruction is getting stuck due to the early read, and tying up its ALU port in addition to port 4 during the three-cycle "I can't write" period. The main consequence of this appears to be that the rest of the program slows down a little too; my guess at this (which is consistent with the numbers above but might be wrong anyway) is that when port 6 ends up getting stuck, the loop logic has to wait, because port 6 is the only port that can handle the add $1, %ebx; jno 1b macro-fused instruction required to return to the start of the loop. So the program is being slowed by 3 clock cycles per loop iteration, and sometimes more because the loop logic is jammed too.

The interesting effect, though, is how large the performance deviations are. With the small program above, port 4 was stuck for 3 cycles per loop iteration, but that didn't slow down the loop as a whole by that much, because port 4 throughput wasn't the limiting factor. With this more complex program, though, port 4 being stuck for 3 cycles per loop iteration does seem to slow down every loop iteration by 3 cycles (and a little more, perhaps due to port 6 being stuck too), so the port 4 jam is having a close to maximal effect on the program as a whole. It's still not entirely obvious why port 4 being jammed for 3 cycles is causing the program as a whole to stall for 3 cycles (this program isn't obviously limited on port 4 throughput either), but it's at least plausible that an effect like this could be occurring.

Anyway, to answer the main practical question in the OP, "In particular, is there a general rule that can be used to work out when adding an extra read to a program will make it run (this much) faster?": the rule is "on this processor, reading a memory address less than three cycles after writing it will jam write instructions for 3 cycles, which will in turn slow down your program by up to 3 cycles, maybe even a little more; you can avoid it by preventing writes that happen that quickly after a read". Unfortunately, with out-of-order processors, it can be hard to know when the processor will attempt to run instructions, so it can be hard to guarantee that two instructions won't run too quickly after each other; for programs like the one I was initially working on, most likely the main piece of practical advice is to avoid repeatedly reading and writing the same memory address in a tight loop (as opposed to storing the value in a register and repeatedly reading and writing from there – the original assembly code only ended up looking the way it did because I was calling a function through a function pointer, meaning that the compiler couldn't safely make this optimisation).

Best Answer

Related Solutions

Performance – Why Function Runs Faster with Extra Memory Read

What

Why

Looking at the small program above

Looking at the program in the question

Related Question