Our C++ library currently uses time_t for storing time values. I'm beginning to need sub-second precision in some places, so a larger data type will be necessary there anyway. Also, it might be useful to get around the Year-2038 problem in some places. So I'm thinking about completely switching to a single Time class with an underlying int64_t value, to replace the time_t value in all places.
Now I'm wondering about the performance impact of such a change when running this code on a 32-bit operating system or 32-bit CPU. IIUC the compiler will generate code to perform 64-bit arithmetic using 32-bit registers. But if this is too slow, I might have to use a more differentiated way for dealing with time values, which might make the software more difficult to maintain.
What I'm interested in:
- which factors influence performance of these operations? Probably the compiler and compiler version; but does the operating system or the CPU make/model influence this as well? Will a normal 32-bit system use the 64-bit registers of modern CPUs?
- which operations will be especially slow when emulated on 32-bit? Or which will have nearly no slowdown?
- are there any existing benchmark results for using int64_t/uint64_t on 32-bit systems?
- does anyone have own experience about this performance impact?
I'm mostly interested in g++ 4.1 and 4.4 on Linux 2.6 (RHEL5, RHEL6) on Intel Core 2 systems; but it would also be nice to know about the situation for other systems (like Sparc Solaris + Solaris CC, Windows + MSVC).
Best Answer
Mostly the processor architecture (and model - please read model where I mention processor architecture in this section). The compiler may have some influence, but most compilers do pretty well on this, so the processor architecture will have a bigger influence than the compiler.
The operating system will have no influence whatsoever (other than "if you change OS, you need to use a different type of compiler which changes what the compiler does" in some cases - but that's probably a small effect).
This is not possible. If the system is in 32-bit mode, it will act as a 32-bit system, the extra 32-bits of the registers is completely invisible, just as it would be if the system was actually a "true 32-bit system".
Addition and subtraction, is worse as these have to be done in sequence of two operations, and the second operation requires the first to have completed - this is not the case if the compiler is just producing two add operations on independent data.
Mulitplication will get a lot worse if the input parameters are actually 64-bits - so 2^35 * 83 is worse than 2^31 * 2^31, for example. This is due to the fact that the processor can produce a 32 x 32 bit multiply into a 64-bit result pretty well - some 5-10 clockcycles. But a 64 x 64 bit multiply requires a fair bit of extra code, so will take longer.
Division is a similar problem to multiplication - but here it's OK to take a 64-bit input on the one side, divide it by a 32-bit value and get a 32-bit value out. Since it's hard to predict when this will work, the 64-bit divide is probably nearly always slow.
The data will also take twice as much cache-space, which may impact the results. And as a similar consequence, general assignment and passing data around will take twice as long as a minimum, since there is twice as much data to operate on.
The compiler will also need to use more registers.
Probably, but I'm not aware of any. And even if there are, it would only be somewhat meaningful to you, since the mix of operations is HIGHLY critical to the speed of operations.
If performance is an important part of your application, then benchmark YOUR code (or some representative part of it). It doesn't really matter if Benchmark X gives 5%, 25% or 103% slower results, if your code is some completely different amount slower or faster under the same circumstances.
I've recompiled some code that uses 64-bit integers for 64-bit architecture, and found the performance improve by some substantial amount - as much as 25% on some bits of code.
Changing your OS to a 64-bit version of the same OS, would help, perhaps?
Edit:
Because I like to find out what the difference is in these sort of things, I have written a bit of code, and with some primitive template (still learning that bit - templates isn't exactly my hottest topic, I must say - give me bitfiddling and pointer arithmetics, and I'll (usually) get it right... )
Here's the code I wrote, trying to replicate a few common functons:
Compiled with:
And the results are: Note: See 2016 results below - these results are slightly optimistic due to the difference in usage of SSE instructions in 64-bit mode, but no SSE usage in 32-bit mode.
As you can see, addition, and multiplication isn't that much worse. Division gets really bad. Interestingly, the matrix addition is not much difference at all.
And is it faster on 64-bit I hear some of you ask: Using the same compiler options, just -m64 instead of -m32 - yupp, a lot faster:
Edit, update for 2016: four variants, with and without SSE, in 32- and 64-bit mode of the compiler.
I'm typically using clang++ as my usual compiler these days. I tried compiling with g++ (but it would still be a different version than above, as I've updated my machine - and I have a different CPU too). Since g++ failed to compile the no-sse version in 64-bit, I didn't see the point in that. (g++ gives similar results anyway)
As a short table:
Full results with compile options.