C GCC – Vectorization Difference Between Two Similar Loops

c++compiler-optimizationgccloopsvectorization

When compiling with gcc -O3, why does the following loop not vectorize (automatically):

#define SIZE (65536)

int a[SIZE], b[SIZE], c[SIZE];

int foo () {
  int i, j;

  for (i=0; i<SIZE; i++){
    for (j=i; j<SIZE; j++) {
      a[i] = b[i] > c[j] ? b[i] : c[j];
    }
  }
  return a[0];
}

when the following one does?

#define SIZE (65536)

int a[SIZE], b[SIZE], c[SIZE];

int foov () {
  int i, j;

  for (i=0; i<SIZE; i++){
    for (j=i; j<SIZE; j++) {
      a[i] += b[i] > c[j] ? b[i] : c[j];
    }
  }
  return a[0];
}

The only difference is whether the result of the expression in the inner loop is assigned to a[i], or added to a[i].

For reference -ftree-vectorizer-verbose=6 gives the following output for the first (non-vectorizing) loop.

v.c:8: note: not vectorized: inner-loop count not invariant.
v.c:9: note: Unknown alignment for access: c
v.c:9: note: Alignment of access forced using peeling.
v.c:9: note: not vectorized: live stmt not supported: D.2700_5 = c[j_20];

v.c:5: note: vectorized 0 loops in function.

And the same output for the loop that vectorizes is:

v.c:8: note: not vectorized: inner-loop count not invariant.
v.c:9: note: Unknown alignment for access: c
v.c:9: note: Alignment of access forced using peeling.
v.c:9: note: vect_model_load_cost: aligned.
v.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
v.c:9: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
v.c:9: note: vect_model_reduction_cost: inside_cost = 1, outside_cost = 6 .
v.c:9: note: cost model: prologue peel iters set to vf/2.
v.c:9: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .
v.c:9: note: Cost model analysis:
  Vector inside of loop cost: 3
  Vector outside of loop cost: 27
  Scalar iteration cost: 3
  Scalar outside cost: 7
  prologue iterations: 2
  epilogue iterations: 2
  Calculated minimum iters for profitability: 8

v.c:9: note:   Profitability threshold = 7

v.c:9: note: Profitability threshold is 7 loop iterations.
v.c:9: note: LOOP VECTORIZED.
v.c:5: note: vectorized 1 loops in function.

Best Answer

In the first case: the code overwrites the same memory location a[i] in each iteration. This inherently sequentializes the loop as the loop iterations are not independent.
(In reality, only the final iteration is actually needed. So the entire inner loop could be taken out.)

In the second case: GCC recognizes the loop as a reduction operation - for which it has special case handling to vectorize.

Compiler vectorization is often implemented as some sort of "pattern matching". Meaning the compiler analyzes code to see if it fits a certain pattern that it's able to vectorize. If it does, it gets vectorized. If it doesn't, then it doesn't.

This seems to be a corner case where the first loop doesn't fit any of the pre-coded patterns that GCC can handle. But the second case fits the "vectorizable reduction" pattern.

Here's the relevant part of GCC's source code that spits out that "not vectorized: live stmt not supported: " message:

http://svn.open64.net/svnroot/open64/trunk/osprey-gcc-4.2.0/gcc/tree-vect-analyze.c

if (STMT_VINFO_LIVE_P (stmt_info))
{
    ok = vectorizable_reduction (stmt, NULL, NULL);

    if (ok)
        need_to_vectorize = true;
    else
        ok = vectorizable_live_operation (stmt, NULL, NULL);

    if (!ok)
    {
        if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
        {
            fprintf (vect_dump, 
                "not vectorized: live stmt not supported: ");
            print_generic_expr (vect_dump, stmt, TDF_SLIM);
        }
        return false;
    }
}

From just the line:

vectorizable_reduction (stmt, NULL, NULL);

It's clear that GCC is checking to see if it matches a "vectorizable reduction" pattern.

Related Solutions

C++ – Why GCC Does Not Auto-Vectorize This Loop

I may have slightly botched the details, but this is the way you need to restructure your loop to get it to vectorize. The trick is to precompute the number of iterations and iterate from 0 to one short of that number. Do not change the for statement. You may need to fix the two lines before it and the two lines at the top of the loop. They're approximately right. ;)

const unsigned it=(b-a)-1;
const unsigned diff=b-a;
for (unsigned n = 0; n < it; n++)
{
    float d_an = path_res * static_cast<float>(n);
    float d_nb = path_res * static_cast<float>(diff - n);

    float h = elevation_path[n] + (d_an * d_nb) / (2.0f * r_e) - (h_a * d_nb + h_b * d_an) / d_ab;
    float v = h * sqrt((2.0f * d_ab) / (wavelength * d_an * d_nb));

    diff_path[n] = v;
}

C++ – For Loop Performance Difference and Compiler Optimization

My immediate guess would be that the second is faster, not because of the changes you made to the loop, but because it's second, so the cache is already primed when it runs.

To test the theory, I re-arranged your code to reverse the order in which the two calculations were called:

#include <cstdint>
#include <chrono>
#include <cstdio>

using std::uint64_t;

uint64_t superCalculationA(int init, int end)
{
    uint64_t total = 0;
    for (int i = init; i < end; i++)
        total += i;
    return total;
}

uint64_t superCalculationB(int init, int todo)
{
    uint64_t total = 0;
    for (int i = init; i < init + todo; i++)
        total += i;
    return total;
}

int main()
{
    const uint64_t answer = 500000110500000000;

    std::chrono::time_point<std::chrono::high_resolution_clock> start, end;
    double elapsed;

    std::printf("=====================================================\n");

    start = std::chrono::high_resolution_clock::now();
    uint64_t ret2 = superCalculationB(111, 1000000000);
    end = std::chrono::high_resolution_clock::now();
    elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    start = std::chrono::high_resolution_clock::now();
    uint64_t ret1 = superCalculationA(111, 1000000111);
    end = std::chrono::high_resolution_clock::now();
    elapsed = (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den);
    std::printf("Elapsed time: %.3f s | %.3f ms | %.3f us\n", elapsed, 1e+3*elapsed, 1e+6*elapsed);

    if (ret1 == answer)
    {
        std::printf("The first method, i.e. superCalculationA, succeeded.\n");
    }
    if (ret2 == answer)
    {
        std::printf("The second method, i.e. superCalculationB, succeeded.\n");
    }

    std::printf("=====================================================\n");

    return 0;
}

The result I got was:

=====================================================
Elapsed time: 0.286 s | 286.000 ms | 286000.000 us
Elapsed time: 0.271 s | 271.000 ms | 271000.000 us
The first method, i.e. superCalculationA, succeeded.
The second method, i.e. superCalculationB, succeeded.
=====================================================

So, when version A runs first, it's slower. When version B run's first, it's slower.

To confirm, I added an extra call to superCalculationB before doing the timing on either version A or B. After that, I tried running the program three times. For those three runs, I'd judge the results a tie (version A was faster once and version B was faster twice, but neither won dependably nor by a wide enough margin to be meaningful).

That doesn't prove that it's actually a cache situation as such, but does give a pretty strong indication that it's a matter of the order in which the functions are called, not the difference in the code itself.

As far as what the compiler does to make the code faster: the main thing it does is unroll a few iterations of the loop. We can get pretty much the same effect if we unroll a few iterations by hand:

uint64_t superCalculationC(int init, int end)
{
    int f_end = end - ((end - init) & 7);

    int i;
    uint64_t total = 0;
    for (i = init; i < f_end; i += 8) {
        total += i;
        total += i + 1;
        total += i + 2;
        total += i + 3;
        total += i + 4;
        total += i + 5;
        total += i + 6;
        total += i + 7;
    }

    for (; i < end; i++)
        total += i;

    return total;
}

This has a property that some might find rather odd: it's actually faster when compiled with -O2 than with -O3. When compiled with -O2, it's also about five times faster than either of the other two are when compiled with -O3.

The primary reason for the ~5x speed gain compared to the compiler's loop unrolling is that we've unrolled the loop somewhat differently (and more intelligently, IMO) than the compiler does. We compute f_end to tell us how many times the unrolled loop should execute. We execute those iterations, then we execute a separate loop to "clean up" any odd iterations at the end.

The compiler instead generates code that's roughly equivalent to something like this:

for (i = init; i < end; i += 8) {
    total += i;
    if (i + 1 >= end) break;
    total += i + 1;
    if (i + 2 >= end) break;
    total += i + 2;
    // ...
}

Although this is quite a bit faster than when the loop hasn't been unrolled at all, it's quite a bit faster still to eliminate those extra checks from the main loop, and execute a separate loop for any odd iterations.

Given such a trivial loop body being executed such a large number of times, you can also improve speed (when compiled with -O2) still further by unrolling more iterations of the loop. With 16 iterations unrolled, it was about twice as fast as the code above with 8 iterations unrolled:

uint64_t superCalculationC(int init, int end)
{
    int first_end = end - ((end - init) & 0xf);

    int i;
    uint64_t total = 0;
    for (i = init; i < first_end; i += 16) {
        total += i + 0;
        total += i + 1;
        total += i + 2;

        // code for `i+3` through `i+13` goes here

        total += i + 14;
        total += i + 15;
    }

    for (; i < end; i++)
        total += i;

    return total;
}

I haven't tried to explore the limit of gains from unrolling this particular loop, but unrolling 32 iterations nearly doubles the speed again. Depending on the processor you're using, you might get some small gains by unrolling 64 iterations, but I'd guess we're starting to approach the limits--at some point, performance gains will probably level off, then (if you unroll still more iterations) probably drop off, quite possibly dramatically.

Summary: with -O3 the compiler unrolls a number of iterations of the loop. This is extremely effective in this case, primarily because we have many executions of nearly the most trivial possible loop body. Unrolling the loop by hand is even more effective than letting the compiler do it--we can unroll more intelligently, and we can simply unroll more iterations than the compiler does. The extra intelligence can give us an improvement of around 5:1, and the extra iterations another 4:1 or so¹ (at the expense of somewhat longer, slightly less readable code).

Final caveat: as always with optimization, your mileage may vary. Differences in compilers and/or processors mean you're likely to get at least somewhat different results than I did. I'd expect my hand-unrolled loop to be substantially faster than the other two in most cases, but exactly how much faster is likely to vary.

^{1. But note that this is comparing the hand-unrolled loop with -O2 to the original loop with -O3. When compiled with -O3, the hand-unrolled loop runs much more slowly.}

Best Answer

Related Solutions

C++ – Why GCC Does Not Auto-Vectorize This Loop

C++ – For Loop Performance Difference and Compiler Optimization

Related Question