C++ – Choosing Between Float, Double, and Long Double

c++doublefloating-pointlong-double

I'm still a beginner at programming and I always have more questions than our book or internet searches can answer (unless I missed something). So I apologize in advance if this was answered but I couldn't find it.

I understand that float has a smaller range than double making it less precise, and from what I understand, long double is even more precise(?). So my question is why would you want to use a variable that is less precise in the first place? Does it have something to do with different platforms, different OS versions, different compilers? Or are there specific moments in programming where its strategically more advantageous to use a float over a double/long double?

Thanks everyone!

Best Answer

In nearly all processors, "smaller" floating point numbers take the same or less clock-cycles in execution. Sometimes the difference isn't very big (or nothing), other times it can be literally twice the number of cycles for double vs. float.

Of course, memory foot-print, which is affecting cache-usage, will also be a factor. float takes half the size of double, and long double is bigger yet.

Edit: Another side-effect of smaller size is that the processor's SIMD extensions (3DNow!, SSE, AVX in x86, and similar extensions are available in several other architectures) may either only work with float, or can take twice as many float vs. double (and as far as I know, no SIMD instructions are available for long double in any processor). So this may improve performance if float is used vs. double, by processing twice as much data in one go. End edit.

So, assuming 6-7 digits of precision is good enough for what you need, and the range of +/-10^+/-38 is sufficient, then float should be used. If you need either more digits in the number, or a bigger range, move to double, and if that's not good enough, use long double. But for most things, double should be perfectly adequate.

Obviously, the importance of using "the right size" becomes more important when you have either lots of calculations, or lots of data to work with - if there are 5 variables, and you just use each a couple of times in a program that does a million other things, who cares? If you are doing fluid dynamics calculations for how well a Formula 1 car is doing at 200 mph, then you probably have several tens of million datapoints to calculate, and every data point needs to be calculated dozens of times per second of the cars travel, then using up just a few clockcycles extra in each calculation will make the whole simulation take noticeably longer.

Related Solutions

Float vs Double – Differences Between Float and Double in C++

Huge difference.

As the name implies, a double has 2x the precision of float^[1]. In general a double has 15 decimal digits of precision, while float has 7.

Here's how the number of digits are calculated:

double has 52 mantissa bits + 1 hidden bit: log(2⁵³)÷log(10) = 15.95 digits

float has 23 mantissa bits + 1 hidden bit: log(2²⁴)÷log(10) = 7.22 digits

This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.

float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
    b += a;
printf("%.7g\n", b); // prints 9.000023

while

double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
    b += a;
printf("%.15g\n", b); // prints 8.99999999999996

Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.

During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.

Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double^[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.

Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.

^{[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).}

C++ – Long Double vs Double: Key Differences

Quoting from Wikipedia:

On the x86 architecture, most compilers implement long double as the 80-bit extended precision type supported by that hardware (sometimes stored as 12 or 16 bytes to maintain data structure .

and

Compilers may also use long double for a 128-bit quadruple precision format, which is currently implemented in software.

In other words, yes, a long double may be able to store a larger range of values than a double. But it's completely up to the compiler.

Best Answer

Related Solutions

Float vs Double – Differences Between Float and Double in C++

C++ – Long Double vs Double: Key Differences

Related Question