I was trying to reproduce some of the processor cache effects described in here. I understand that Java is a managed environment, and these examples will not translate exactly, but I came across a strange case, that I have tried to distill to a simple example that illustrates the effect:
public static void main(String[] args) {
final int runs = 10;
final int steps = 1024 * 1024 * 1024;
for (int run = 0; run < runs; run++) {
final int[] a = new int[1];
long start = System.nanoTime();
for (int i = 0; i < steps; i++) {
a[0]++;
}
long stop = System.nanoTime();
long time = TimeUnit.MILLISECONDS.convert(stop - start, TimeUnit.NANOSECONDS);
System.out.printf("Time for loop# %2d: %5d ms\n", run, time);
}
}
Output:
Time for loop# 0: 24 ms
Time for loop# 1: 106 ms
Time for loop# 2: 104 ms
Time for loop# 3: 103 ms
Time for loop# 4: 102 ms
Time for loop# 5: 103 ms
Time for loop# 6: 104 ms
Time for loop# 7: 102 ms
Time for loop# 8: 105 ms
Time for loop# 9: 102 ms
The first iteration of the inner loop is about 4 times as fast as subsequent iterations. This is the opposite of what I would normally expect, as usually performace goes up as the JIT kicks in.
Of course, one would do several warm-up loops in any serious micro-benchmark, but I'm curious as to what could be causing this behaviour, especially since if we know the loop can be performed in 24ms, it's not very satisfying that the steady-state time is over 100ms.
For reference the JDK I am using (on linux):
openjdk version "1.8.0_40"
OpenJDK Runtime Environment (build 1.8.0_40-b20)
OpenJDK 64-Bit Server VM (build 25.40-b23, mixed mode)
UPDATE:
Here's some update info, based on some of the comments, and some experimenting:
1) moving the System.out I/O out of the loop (by storing the timing in an array of size 'runs') makes no significant difference on time.
2) the output displayed above is when I run from within Eclipse. When I compile and run from the command line (with the same JDK/JVM) I get more modest, but still significant results (2x instead of 4x faster). This seems interesting, since usaully running in eclipse will slow things down, if anything.
3) moving a
up, out of the loop, so that it is reused each iteration has no effect.
4) if int[] a
is changed to long[] a
, the first iteration runs even faster (about 20%), while the other iterations are still the same (slower) speed.
UPDATE 2:
I think the answer by apangin explains it. I tried this with Sun's 1.9 JVM and it's going from:
openjdk version "1.8.0_40"
OpenJDK Runtime Environment (build 1.8.0_40-b20)
OpenJDK 64-Bit Server VM (build 25.40-b23, mixed mode)
Time for loop# 0: 48 ms
Time for loop# 1: 116 ms
Time for loop# 2: 112 ms
Time for loop# 3: 113 ms
Time for loop# 4: 112 ms
Time for loop# 5: 112 ms
Time for loop# 6: 111 ms
Time for loop# 7: 111 ms
Time for loop# 8: 113 ms
Time for loop# 9: 113 ms
to:
java version "1.9.0-ea"
Java(TM) SE Runtime Environment (build 1.9.0-ea-b73)
Java HotSpot(TM) 64-Bit Server VM (build 1.9.0-ea-b73, mixed mode)
Time for loop# 0: 48 ms
Time for loop# 1: 26 ms
Time for loop# 2: 22 ms
Time for loop# 3: 22 ms
Time for loop# 4: 22 ms
Time for loop# 5: 22 ms
Time for loop# 6: 22 ms
Time for loop# 7: 22 ms
Time for loop# 8: 22 ms
Time for loop# 9: 23 ms
That's quite the improvement!
Best Answer
This is a suboptimal recompilation of a method.
JIT compiler relies on a run-time statistics gathered during interpretation. When
main
method is compiled for the first time, the outer loop has not yet completed its first iteration => the run-time statistics tells that the code after the inner loop is never executed, so JIT does not ever bother compiling it. It rather generates an uncommon trap.When the inner loop ends for the first time, the uncommon trap is hit causing the method to be deoptimized.
On the second iteration of the outer loop the
main
method is recompiled with the new knowledge. Now JIT has more statistics and more context to compile. For some reason now it does not cache the valuea[0]
in the register (probably because JIT is fooled by the wider context). So it generatesaddl
instruction to update the array in memory, that is effectively a combination of memory load and store.On the contrary, during the first compilation JIT caches the value of
a[0]
in the register, there is onlymov
instruction to store a value in memory (without load).Fast loop (first iteration):
Slow loop (after recompilation):
However this problem seems to be fixed in JDK 9. I've checked this test against a recent JDK 9 Early Access release and verified that it works as expected: