C++ Performance – Speedup from Keeping Operand in Register

assemblyc++optimizationperformance

I have been trying to get an idea of the impact of having an array in L1 cache versus memory by timing a routine that scales and sums the elements of an array using the following code (I am aware that I should just scale the result by 'a' at the end; the point is to do both a multiply and an add within the loop – so far, the compiler hasn't figured out to factor out 'a'):

double sum(double a,double* X,int size)
{
    double total = 0.0;
    for(int i = 0;  i < size; ++i)
    {
        total += a*X[i];
    }
    return total;
}

#define KB 1024
int main()
{
    //Approximately half the L1 cache size of my machine
    int operand_size = (32*KB)/(sizeof(double)*2);
    printf("Operand size: %d\n", operand_size);
    double* X = new double[operand_size];
    fill(X,operand_size);

    double seconds = timer();
    double result;
    int n_iterations = 100000;
    for(int i = 0; i < n_iterations; ++i)
    {
        result = sum(3.5,X,operand_size);
        //result += rand();  
    }
    seconds = timer() - seconds; 

    double mflops = 2e-6*double(n_iterations*operand_size)/seconds;
    printf("Vector size %d: mflops=%.1f, result=%.1f\n",operand_size,mflops,result);
    return 0;
}

Note that the timer() and fill() routines are not included for brevity; their full source can be found here if you want to run the code:

http://codepad.org/agPWItZS

Now, here is where it gets interesting. This is the output:

Operand size: 2048
Vector size 2048: mflops=588.8, result=-67.8

This is totally un-cached performance, despite the fact that all elements of X should be held in cache between loop iterations. Looking at the assembly code generated by:

g++ -O3 -S -fno-asynchronous-unwind-tables register_opt_example.cpp

I notice one oddity in the sum function loop:

L55:
    movsd   (%r12,%rax,8), %xmm0
    mulsd   %xmm1, %xmm0
    addsd   -72(%rbp), %xmm0
    movsd   %xmm0, -72(%rbp)
    incq    %rax
    cmpq    $2048, %rax
    jne L55

The instructions:

    addsd   -72(%rbp), %xmm0
    movsd   %xmm0, -72(%rbp)

indicate that it is storing the value of "total" in sum() on the stack, and reading and writing it at every loop iteration. I modified the assembly so that this operand is kept in a a register:

...
addsd   %xmm0, %xmm3
...

This small change creates a huge performance boost:

Operand size: 2048
Vector size 2048: mflops=1958.9, result=-67.8

tl;dr
My question is: why does replacing a single memory location access with a register, speed up the code so much, given that the single location should be stored in L1 cache? What architectural factors make this possible? It seems very strange that writing one stack location repeatedly would completely destroy the effectiveness of a cache.

Appendix

My gcc version is:

Target: i686-apple-darwin10
Configured with: /var/tmp/gcc/gcc-5646.1~2/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10
Thread model: posix
gcc version 4.2.1 (Apple Inc. build 5646) (dot 1)

My CPU is:

Intel Xeon X5650

Best Answer

It's likely a combination of a longer dependency chain, along with Load Misprediction*.


Longer Dependency Chain:

First, we identify the critical dependency paths. Then we look at the instruction latencies provided by: http://www.agner.org/optimize/instruction_tables.pdf (page 117)

In the unoptimized version, the critical dependency path is:

  • addsd -72(%rbp), %xmm0
  • movsd %xmm0, -72(%rbp)

Internally, it probably breaks up into:

  • load (2 cycles)
  • addsd (3 cycles)
  • store (3 cycles)

If we look at the optimized version, it's just:

  • addsd (3 cycles)

So you have 8 cycles vs. 3 cycles. Almost a factor of 3.

I'm not sure how sensitive the Nehalem processor line is to store-load dependencies and how well it does forwarding. But it's reasonable to believe that it's not zero.


Load-store Misprediction:

Modern processors use prediction in more ways you can imagine. The most famous of these is probably Branch Prediction. One of the lesser known ones is Load Prediction.

When a processor sees a load, it will immediately load it before all pending writes finish. It will assume that those writes will not conflict with the loaded values.

If an earlier write turns out to conflict with a load, then the load must be re-executed and the computation rolled back to the point of the load. (in much the same way that branch mispredictions roll back)

How it is relevant here:

Needless to say, modern processors will be able to execute multiple iterations of this loop simultaneously. So the processor will be attempting to perform the load (addsd -72(%rbp), %xmm0) before it finishes the store (movsd %xmm0, -72(%rbp)) from the previous iteration.

The result? The previous store conflicts with the load - thus a misprediction and a roll back.

*Note that I'm unsure of the name "Load Prediction". I only read about it in the Intel docs and they didn't seem to give it a name.

Related Question