C++ – Why is Iterating Through std::vector Faster Than std::array?

benchmarkingc++performance

I recently asked this question:
Why is iterating an std::array much faster than iterating an std::vector?

As people quickly pointed out, my benchmark had many flaws. So as I was trying to fix my benchmark, I noticed that std::vector wasn't slower than std::array and, in fact, it was quite the opposite.

#include <vector>
#include <array>
#include <stdio.h>
#include <chrono>

using namespace std;

constexpr int n = 100'000'000;
vector<int> v(n);
//array<int, n> v;

int main()
{
    int res = 0;
    auto start = chrono::steady_clock::now();
    for(int x : v)
        res += x;
    auto end = chrono::steady_clock::now();
    auto diff = end - start;
    double elapsed =
        std::chrono::duration_cast<
            std::chrono::duration<double, std::milli>
        >(end - start).count();
    printf("result: %d\ntime: %f\n", res, elapsed);
}

Things I've tried to improve from my previous benchmark:

  • Made sure I'm using the result, so the whole loop is not optimized away
  • Using -O3 flag for speed
  • Use std::chrono instead of the time command. That's so we can isolate the part we want to measure (just the for loop). Static initialization of variables and things like that won't be measured.

The measured times:

array:

$ g++ arrVsVec.cpp -O3
$ ./a.out
result: 0
time: 99.554109

vector:

$ g++ arrVsVec.cpp -O3
$ ./a.out
result: 0
time: 30.734491

I'm just wondering what I'm doing wrong this time.

Watch the disassembly in godbolt

Best Answer

The difference is due to memory pages of array not being resident in process address space (global scope array is stored in .bss section of the executable that hasn't been paged in, it is zero-initialized). Whereas vector has just been allocated and zero-filled, so its memory pages are already present.

If you add

std::fill_n(v.data(), n, 1); // included in <algorithm>

as the first line of main to bring the pages in (pre-fault), that makes array time the same as that of vector.


On Linux, instead of that, you can do mlock(v.data(), v.size() * sizeof(v[0])); to bring the pages into the address space. See man mlock for full details.

Related Question