There was a major bug in the timing function I used for previous benchmarks. This grossly underestimated the bandwidth without vectorization as well as other measurements. Additionally, there was another problem that was overestimating the bandwidth due to COW on the array that was read but not written to. Finally, the maximum bandwidth I used was incorrect. I have updated my answer with the corrections and I have left the old answer at the end of this answer.
Your operation is memory bandwidth bound. This means the CPU is spending most of its time waiting on slow memory reads and writes. An excellent explanation for this can be found here: Why vectorizing the loop does not have performance improvement.
However, I have to disagree slightly with one statement in that answer.
So regardless of how it's optimized, (vectorized, unrolled, etc...) it isn't gonna get much faster.
In fact, vectorization, unrolling, and multiple threads can significantly increase the bandwidth even in memory bandwidth bound operations. The reason is that it is difficult to obtain the maximum memory bandwidth. A good explanation for this can be found here: https://stackoverflow.com/a/25187492/2542702.
The rest of my answer will show how vectorization and multiple threads can get closer to the maximum memory bandwidth.
My test system: Ubuntu 16.10, Skylake ([email protected]), 32GB RAM, dual channel DDR4@2400 GHz. The maximum bandwidth from my system is 38.4 GB/s.
From the code below I produce the following tables. I set the number of thread using OMP_NUM_THREADS e.g. export OMP_NUM_THREADS=4
. The efficiency is bandwidth/max_bandwidth
.
-O2 -march=native -fopenmp
Threads Efficiency
1 59.2%
2 76.6%
4 74.3%
8 70.7%
-O2 -march=native -fopenmp -funroll-loops
1 55.8%
2 76.5%
4 72.1%
8 72.2%
-O3 -march=native -fopenmp
1 63.9%
2 74.6%
4 63.9%
8 63.2%
-O3 -march=native -fopenmp -mprefer-avx128
1 67.8%
2 76.0%
4 63.9%
8 63.2%
-O3 -march=native -fopenmp -mprefer-avx128 -funroll-loops
1 68.8%
2 73.9%
4 69.0%
8 66.8%
After several iterations of running due to uncertainties in the measurements I have formed the following conclusions:
- single threaded scalar operations get more than 50% of the bandwidth.
- two threaded scalar operations get the highest bandwidth.
- single threaded vector operations are faster than single threaded scalar operations.
- single threaded SSE operations are faster than single threaded AVX operations.
- unrolling is not helpful.
- unrolling single-threaded operations is slower than without unrolling.
- more threads than cores (Hyper-Threading) gives a lower bandwidth.
The solution that gives the best bandwidth is scalar operations with two threads.
The code I used to benchmark:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <omp.h>
#define N 10000000
#define R 100
void mul(double *a, double *b) {
#pragma omp parallel for
for (int i = 0; i<N; i++) a[i] *= b[i];
}
int main() {
double maxbw = 2.4*2*8; // 2.4GHz * 2-channels * 64-bits * 1-byte/8-bits
double mem = 3*sizeof(double)*N*R*1E-9; // GB
double *a = (double*)malloc(sizeof *a * N);
double *b = (double*)malloc(sizeof *b * N);
//due to copy-on-write b must be initialized to get the correct bandwidth
//also, GCC will convert malloc + memset(0) to calloc so use memset(1)
memset(b, 1, sizeof *b * N);
double dtime = -omp_get_wtime();
for(int i=0; i<R; i++) mul(a,b);
dtime += omp_get_wtime();
printf("%.2f s, %.1f GB/s, %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);
free(a), free(b);
}
The old solution with the timing bug
The modern solution for inline assembly is to use intrinsics. There are still cases where one needs inline assembly but this is not one of them.
One intrinsics solution for you inline assembly approach is simply:
void mul_SSE(double* a, double* b) {
for (int i = 0; i<N/2; i++)
_mm_store_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}
Let me define some test code
#include <x86intrin.h>
#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <omp.h>
#define N 1000000
#define R 1000
typedef __attribute__(( aligned(32))) double aligned_double;
void (*fp)(aligned_double *a, aligned_double *b);
void mul(aligned_double* __restrict a, aligned_double* __restrict b) {
for (int i = 0; i<N; i++) a[i] *= b[i];
}
void mul_SSE(double* a, double* b) {
for (int i = 0; i<N/2; i++) _mm_store_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}
void mul_SSE_NT(double* a, double* b) {
for (int i = 0; i<N/2; i++) _mm_stream_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}
void mul_SSE_OMP(double* a, double* b) {
#pragma omp parallel for
for (int i = 0; i<N; i++) a[i] *= b[i];
}
void test(aligned_double *a, aligned_double *b, const char *name) {
double dtime;
const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
const double maxbw = 34.1;
dtime = -omp_get_wtime();
for(int i=0; i<R; i++) fp(a,b);
dtime += omp_get_wtime();
printf("%s \t time %.2f s, %.1f GB/s, efficency %.1f%%\n", name, dtime, mem/dtime, 100*mem/dtime/maxbw);
}
int main() {
double *a = (double*)_mm_malloc(sizeof *a * N, 32);
double *b = (double*)_mm_malloc(sizeof *b * N, 32);
//b must be initialized to get the correct bandwidth!!!
memset(a, 1, sizeof *a * N);
memset(b, 1, sizeof *a * N);
fp = mul, test(a,b, "mul ");
fp = mul_SSE, test(a,b, "mul_SSE ");
fp = mul_SSE_NT, test(a,b, "mul_SSE_NT ");
fp = mul_SSE_OMP, test(a,b, "mul_SSE_OMP");
_mm_free(a), _mm_free(b);
}
Now the first test
g++ -O2 -fopenmp test.cpp
./a.out
mul time 1.67 s, 13.1 GB/s, efficiency 38.5%
mul_SSE time 1.00 s, 21.9 GB/s, efficiency 64.3%
mul_SSE_NT time 1.05 s, 20.9 GB/s, efficiency 61.4%
mul_SSE_OMP time 0.74 s, 29.7 GB/s, efficiency 87.0%
So with -O2
which does not vectorize loops we see that the intrinsic SSE version is much faster than the plain C solution mul
. efficiency = bandwith_measured/max_bandwidth
where the max is 34.1 GB/s for my system.
Second test
g++ -O3 -fopenmp test.cpp
./a.out
mul time 1.05 s, 20.9 GB/s, efficiency 61.2%
mul_SSE time 0.99 s, 22.3 GB/s, efficiency 65.3%
mul_SSE_NT time 1.01 s, 21.7 GB/s, efficiency 63.7%
mul_SSE_OMP time 0.68 s, 32.5 GB/s, efficiency 95.2%
With -O3
vectorizes the loop and the intrinsic function offers essentially no advantage.
Third test
g++ -O3 -fopenmp -funroll-loops test.cpp
./a.out
mul time 0.85 s, 25.9 GB/s, efficency 76.1%
mul_SSE time 0.84 s, 26.2 GB/s, efficency 76.7%
mul_SSE_NT time 1.06 s, 20.8 GB/s, efficency 61.0%
mul_SSE_OMP time 0.76 s, 29.0 GB/s, efficency 85.0%
With -funroll-loops
GCC unrolls the loops eight times and we see a significant improvement except for the non-temporal store solution and not real advantage for OpenMP solution.
Before unrolling the loop the assembly for mul
wiht -O3
is
xor eax, eax
.L2:
movupd xmm0, XMMWORD PTR [rsi+rax]
mulpd xmm0, XMMWORD PTR [rdi+rax]
movaps XMMWORD PTR [rdi+rax], xmm0
add rax, 16
cmp rax, 8000000
jne .L2
rep ret
With -O3 -funroll-loops
the assembly for mul
is:
xor eax, eax
.L2:
movupd xmm0, XMMWORD PTR [rsi+rax]
movupd xmm1, XMMWORD PTR [rsi+16+rax]
mulpd xmm0, XMMWORD PTR [rdi+rax]
movupd xmm2, XMMWORD PTR [rsi+32+rax]
mulpd xmm1, XMMWORD PTR [rdi+16+rax]
movupd xmm3, XMMWORD PTR [rsi+48+rax]
mulpd xmm2, XMMWORD PTR [rdi+32+rax]
movupd xmm4, XMMWORD PTR [rsi+64+rax]
mulpd xmm3, XMMWORD PTR [rdi+48+rax]
movupd xmm5, XMMWORD PTR [rsi+80+rax]
mulpd xmm4, XMMWORD PTR [rdi+64+rax]
movupd xmm6, XMMWORD PTR [rsi+96+rax]
mulpd xmm5, XMMWORD PTR [rdi+80+rax]
movupd xmm7, XMMWORD PTR [rsi+112+rax]
mulpd xmm6, XMMWORD PTR [rdi+96+rax]
movaps XMMWORD PTR [rdi+rax], xmm0
mulpd xmm7, XMMWORD PTR [rdi+112+rax]
movaps XMMWORD PTR [rdi+16+rax], xmm1
movaps XMMWORD PTR [rdi+32+rax], xmm2
movaps XMMWORD PTR [rdi+48+rax], xmm3
movaps XMMWORD PTR [rdi+64+rax], xmm4
movaps XMMWORD PTR [rdi+80+rax], xmm5
movaps XMMWORD PTR [rdi+96+rax], xmm6
movaps XMMWORD PTR [rdi+112+rax], xmm7
sub rax, -128
cmp rax, 8000000
jne .L2
rep ret
Fourth test
g++ -O3 -fopenmp -mavx test.cpp
./a.out
mul time 0.87 s, 25.3 GB/s, efficiency 74.3%
mul_SSE time 0.88 s, 24.9 GB/s, efficiency 73.0%
mul_SSE_NT time 1.07 s, 20.6 GB/s, efficiency 60.5%
mul_SSE_OMP time 0.76 s, 29.0 GB/s, efficiency 85.2%
Now the non-intrinsic function is the fastest (excluding the OpenMP version).
So there is no reason to use intrinsics or inline assembly in this case because we can get the best performance with appropriate compiler options (e.g. -O3
, -funroll-loops
, -mavx
).
Test system: Ubuntu 16.10, Skylake ([email protected]), 32GB RAM. Maximum memory bandwidth (34.1 GB/s) https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz
Here is another solution worth considering. The cmp
instruction is not necessary if we count from -N up to zero and access the arrays as N+i
. GCC should have fixed this a long time ago. It eliminates one instruction (though due to macro-op fusion the cmp and jmp often count as one micro-op).
void mul_SSE_v2(double* a, double* b) {
for (ptrdiff_t i = -N; i<0; i+=2)
_mm_store_pd(&a[N + i], _mm_mul_pd(_mm_load_pd(&a[N + i]),_mm_load_pd(&b[N + i])));
Assembly with -O3
mul_SSE_v2(double*, double*):
mov rax, -1000000
.L9:
movapd xmm0, XMMWORD PTR [rdi+8000000+rax*8]
mulpd xmm0, XMMWORD PTR [rsi+8000000+rax*8]
movaps XMMWORD PTR [rdi+8000000+rax*8], xmm0
add rax, 2
jne .L9
rep ret
}
This optimization will only possibly be helpful the arrays fit e.g. the L1 cache i.e. not reading from main memory.
I finally found a way to get the plain C solution to not generate the cmp
instruction.
void mul_v2(aligned_double* __restrict a, aligned_double* __restrict b) {
for (int i = -N; i<0; i++) a[i] *= b[i];
}
And then call the function from a separate object file like this mul_v2(&a[N],&b[N])
so this is perhaps the best solution. However, if you call the function from the same object file (translation unit) as the one it's defined in the GCC generates the cmp
instruction again.
Also,
void mul_v3(aligned_double* __restrict a, aligned_double* __restrict b) {
for (int i = -N; i<0; i++) a[N+i] *= b[N+i];
}
still generates the cmp
instruction and generates the same assembly as the mul
function.
The function mul_SSE_NT
is silly. It uses non-temporal stores which are only useful when only writing to memory but since the function reads and writes to the same address non-temporal stores are not only useless they give inferior results.
Previous versions of this answer were getting the wrong bandwidth. The reason was when the arrays were not initialized.
Best Answer
This effect only happens at
-O0
(or withvolatile
), and is a result of the compiler keeping your variables in memory (not registers). You'd expect that to just introduce a fixed amount of extra latency into a loop-carried dependency chains throughi
,x
, andy
, but modern CPUs are not that simple.On Intel Sandybridge-family CPUs, store-forwarding latency is lower when the load uop runs some time after the store whose data it's reloading, not right away. So an empty loop with the loop counter in memory is the worst case. I don't understand what CPU design choices could lead to that micro-architectural quirk, but it's a real thing.
This is basically a duplicate of Adding a redundant assignment speeds up code when compiled without optimization, at least for Intel Sandybridge-family CPUs.
This is is one of the major reasons why you shouldn't benchmark at
-O0
: the bottlenecks are different than in realistically optimized code. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about why compilers make such terrible asm on purpose.Micro-benchmarking is hard; you can only measure something properly if you can get compilers to emit realistically optimized asm loops for the thing you're trying to measure. (And even then you're only measuring throughput or latency, not both; those are separate things for single operations on out-of-order pipelined CPUs: What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)
See @rcgldr's answer for measurement + explanation of what would happen with loops that keep variables in registers.
With clang,
benchmark::DoNotOptimize(x1 += 31)
also de-optimizes into keepingx
in memory, but with GCC it does just stay in a register. Unfortunately @SashaKnorre's answer used clang on QuickBench, not gcc, to get results similar to your-O0
asm. It does show the cost of lots of short-NOPs being hidden by the bottleneck through memory, and a slight speedup when those NOPs delay the reload next iteration just long enough for store-forwarding to hit the lower latency good case. (QuickBench I think runs on Intel Xeon server CPUs, with the same microarchitecture inside each CPU core as desktop version of the same generation.)Presumably all the x86 machines you tested on had Intel CPUs from the last 10 years, or else there's a similar effect on AMD. It's plausible there's a similar effect on whichever ARM CPU your RPi uses, if your measurements really were meaningful there. Otherwise, maybe another case of seeing what you expected (confirmation bias), especially if you tested with optimization enabled there.
So actually you didn't reproduce this effect for
-O1
or higher, you just saw what you wanted to see (confirmation bias) and mostly made up the claim that the effect was the same. If you'd accurately reported your data (measurable effect at-O0
, empty timed region at-O1
and higher), I could have answered right away.See Idiomatic way of performance evaluation? - if your times don't increase linearly with increasing repeat count, you aren't measuring what you think you're measuring. Also, startup effects (like cold caches, soft page faults, lazy dynamic linking, and dynamic CPU frequency) can easily lead to the first empty timed region being slower than the second.
I assume you only swapped the loops around when testing at
-O0
, otherwise you would have ruled out there being any effect at-O1
or higher with that test code.The loop with optimization enabled:
As you can see on Godbolt, gcc fully removes the loop with optimization enabled. Sometimes GCC leaves empty loops alone, like maybe it thinks the delay was intentional, but here it doesn't even loop at all. Time doesn't scale with anything, and both timed regions look the same like this:
So the only instruction in the timed region is saving
start
to a call-preserved register. You're measuring literally nothing about your source code.With Google Benchmark, we can get asm that doesn't optimize the work away, but which doesn't store/reload to introduce new bottlenecks:
I assume
benchmark::DoNotOptimize
is something likeasm volatile("" : "+rm"(x) )
(GNU C inline asm) to make the compiler materializex
in a register or memory, and to assume the lvalue has been modified by that empty asm statement. (i.e. forget anything it knew about the value, blocking constant-propagation, CSE, and whatever.) That would explain why clang stores/reloads to memory while GCC picks a register: this is a longstanding missed-optimization bug with clang's inline asm support. It likes to pick memory when given the choice, which you can sometimes work around with multi-alternative constraints like"+r,m"
. But not here; I had to just drop the memory alternative; we don't want the compiler to spill/reload to memory anyway.For GNU C compatible compilers, we can use
asm volatile
manually with only"+r"
register constraints to get clang to make good scalar asm (Godbolt), like GCC. We get an essentially identical inner loop, with 3 add instructions, the last one being anadd rbx, -1
/jnz
that can macro-fuse.All of these should run at 1 clock cycle per iteration on modern Intel and AMD CPUs, again see @rcgldr's answer.
Of course this also disables auto-vectorization with SIMD, which compilers would do in many real use cases. Or if you used the result at all outside the loop, it might optimize the repeated increment into a single multiply.
You can't measure the cost of the
+
operator in C++ - it can compile very differently depending on context / surrounding code. Even without considering loop-invariant stuff that hoists work. e.g.x + (y<<2) + 4
can compile to a single LEA instruction for x86.TL:DR: it's not the operations, it's the loop-carried dependency chain through memory that stops the CPU from running the loop at 1 clock cycle per iteration, doing all 3 adds in parallel on separate execution ports.
Note that the loop counter increment is just as much of an operation as what you're doing with
x
(and sometimesy
).