You should mark the functions static
so that the compiler know they are local to that translation unit.
Without static
the compiler cannot assume (barring LTO / WPA) that the function is only called once, so is less likely to inline it.
Demonstration using the LLVM Try Out page.
That said, code for readability first, micro-optimizations (and such tweaking is a micro-optimization) should only come after performance measures.
Example:
#include <cstdio>
static void foo(int i) {
int m = i % 3;
printf("%d %d", i, m);
}
int main(int argc, char* argv[]) {
for (int i = 0; i != argc; ++i) {
foo(i);
}
}
Produces with static
:
; ModuleID = '/tmp/webcompile/_27689_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"
@.str = private constant [6 x i8] c"%d %d\00" ; <[6 x i8]*> [#uses=1]
define i32 @main(i32 %argc, i8** nocapture %argv) nounwind {
entry:
%cmp4 = icmp eq i32 %argc, 0 ; <i1> [#uses=1]
br i1 %cmp4, label %for.end, label %for.body
for.body: ; preds = %for.body, %entry
%0 = phi i32 [ %inc, %for.body ], [ 0, %entry ] ; <i32> [#uses=3]
%rem.i = srem i32 %0, 3 ; <i32> [#uses=1]
%call.i = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([6 x i8]* @.str, i64 0, i64 0), i32 %0, i32 %rem.i) nounwind ; <i32> [#uses=0]
%inc = add nsw i32 %0, 1 ; <i32> [#uses=2]
%exitcond = icmp eq i32 %inc, %argc ; <i1> [#uses=1]
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body, %entry
ret i32 0
}
declare i32 @printf(i8* nocapture, ...) nounwind
Without static
:
; ModuleID = '/tmp/webcompile/_27859_0.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"
@.str = private constant [6 x i8] c"%d %d\00" ; <[6 x i8]*> [#uses=1]
define void @foo(int)(i32 %i) nounwind {
entry:
%rem = srem i32 %i, 3 ; <i32> [#uses=1]
%call = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([6 x i8]* @.str, i64 0, i64 0), i32 %i, i32 %rem) ; <i32> [#uses=0]
ret void
}
declare i32 @printf(i8* nocapture, ...) nounwind
define i32 @main(i32 %argc, i8** nocapture %argv) nounwind {
entry:
%cmp4 = icmp eq i32 %argc, 0 ; <i1> [#uses=1]
br i1 %cmp4, label %for.end, label %for.body
for.body: ; preds = %for.body, %entry
%0 = phi i32 [ %inc, %for.body ], [ 0, %entry ] ; <i32> [#uses=3]
%rem.i = srem i32 %0, 3 ; <i32> [#uses=1]
%call.i = tail call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([6 x i8]* @.str, i64 0, i64 0), i32 %0, i32 %rem.i) nounwind ; <i32> [#uses=0]
%inc = add nsw i32 %0, 1 ; <i32> [#uses=2]
%exitcond = icmp eq i32 %inc, %argc ; <i1> [#uses=1]
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body, %entry
ret i32 0
}
Using g++ built with default optimization flags:
float f = rand();
40117e: e8 75 01 00 00 call 4012f8 <_rand>
401183: 89 44 24 1c mov %eax,0x1c(%esp)
401187: db 44 24 1c fildl 0x1c(%esp)
40118b: d9 5c 24 2c fstps 0x2c(%esp)
std::cout << sin(f) << " " << sin(f);
40118f: d9 44 24 2c flds 0x2c(%esp)
401193: dd 1c 24 fstpl (%esp)
401196: e8 65 01 00 00 call 401300 <_sin> <----- 1st call
40119b: dd 5c 24 10 fstpl 0x10(%esp)
40119f: d9 44 24 2c flds 0x2c(%esp)
4011a3: dd 1c 24 fstpl (%esp)
4011a6: e8 55 01 00 00 call 401300 <_sin> <----- 2nd call
4011ab: dd 5c 24 04 fstpl 0x4(%esp)
4011af: c7 04 24 e8 60 40 00 movl $0x4060e8,(%esp)
Built with -O2
:
float f = rand();
4011af: e8 24 01 00 00 call 4012d8 <_rand>
4011b4: 89 44 24 1c mov %eax,0x1c(%esp)
4011b8: db 44 24 1c fildl 0x1c(%esp)
std::cout << sin(f) << " " << sin(f);
4011bc: dd 1c 24 fstpl (%esp)
4011bf: e8 1c 01 00 00 call 4012e0 <_sin> <----- 1 call
From this we can see that without optimizations the compiler uses 2 calls and just 1 with optimizations, empirically I guess, we can say the compiler does optimize the call.
Best Answer
GCC absolutely optimizes across compilation units if you have Link Time Optimization on and the optimization level is high enough, see here: https://gcc.gnu.org/wiki/LinkTimeOptimization There is really no reason besides compilation time to not do both of these.
Additionally, you can always help the compiler along by marking the function with the appropriate attributes. You probably want to mark the function with the attribute const as follows:
Take a look at GCCs documentation here to see which attribute is appropriate: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
In a more general sense, this is very easy for a compiler to detect. It actually performs transformations that are much less obvious. The reason why Link Time Optimization is important, though, is that once GCC has generated actual machine code, it will not really know what is safe at that point to do. Your function could, for example, modify data (outside your class) or access a volatile variable.
EDIT:
GCC most definitely can do this optimization. With this code and the flags -O3 -fno-inline:
C++ code:
Assembly Output:
It does, however, fail to do this when the function is in a separate compilation unit and the -flto option is not specified. Just to clarify, this line calls the function:
And this line multiplies the result by 5 (adding together five copies):