Java Performance – Why is This Inner Loop 4X Faster in First Iteration?

javajitperformance

I was trying to reproduce some of the processor cache effects described in here. I understand that Java is a managed environment, and these examples will not translate exactly, but I came across a strange case, that I have tried to distill to a simple example that illustrates the effect:

public static void main(String[] args) {
    final int runs = 10;
    final int steps = 1024 * 1024 * 1024;

    for (int run = 0; run < runs; run++) {
        final int[] a = new int[1];
        long start = System.nanoTime();
        for (int i = 0; i < steps; i++) {
            a[0]++;
        }
        long stop = System.nanoTime();
        long time = TimeUnit.MILLISECONDS.convert(stop - start, TimeUnit.NANOSECONDS);
        System.out.printf("Time for loop# %2d: %5d ms\n", run, time);
    }
}

Output:

 Time for loop#  0:    24 ms
 Time for loop#  1:   106 ms
 Time for loop#  2:   104 ms
 Time for loop#  3:   103 ms
 Time for loop#  4:   102 ms
 Time for loop#  5:   103 ms
 Time for loop#  6:   104 ms
 Time for loop#  7:   102 ms
 Time for loop#  8:   105 ms
 Time for loop#  9:   102 ms

The first iteration of the inner loop is about 4 times as fast as subsequent iterations. This is the opposite of what I would normally expect, as usually performace goes up as the JIT kicks in.

Of course, one would do several warm-up loops in any serious micro-benchmark, but I'm curious as to what could be causing this behaviour, especially since if we know the loop can be performed in 24ms, it's not very satisfying that the steady-state time is over 100ms.

For reference the JDK I am using (on linux):

 openjdk version "1.8.0_40"
 OpenJDK Runtime Environment (build 1.8.0_40-b20)
 OpenJDK 64-Bit Server VM (build 25.40-b23, mixed mode)

UPDATE:

Here's some update info, based on some of the comments, and some experimenting:

1) moving the System.out I/O out of the loop (by storing the timing in an array of size 'runs') makes no significant difference on time.

2) the output displayed above is when I run from within Eclipse. When I compile and run from the command line (with the same JDK/JVM) I get more modest, but still significant results (2x instead of 4x faster). This seems interesting, since usaully running in eclipse will slow things down, if anything.

3) moving a up, out of the loop, so that it is reused each iteration has no effect.

4) if int[] a is changed to long[] a, the first iteration runs even faster (about 20%), while the other iterations are still the same (slower) speed.

UPDATE 2:

I think the answer by apangin explains it. I tried this with Sun's 1.9 JVM and it's going from:

openjdk version "1.8.0_40"
OpenJDK Runtime Environment (build 1.8.0_40-b20)
OpenJDK 64-Bit Server VM (build 25.40-b23, mixed mode)

Time for loop#  0:    48 ms
Time for loop#  1:   116 ms
Time for loop#  2:   112 ms
Time for loop#  3:   113 ms
Time for loop#  4:   112 ms
Time for loop#  5:   112 ms
Time for loop#  6:   111 ms
Time for loop#  7:   111 ms
Time for loop#  8:   113 ms
Time for loop#  9:   113 ms

to:

java version "1.9.0-ea"
Java(TM) SE Runtime Environment (build 1.9.0-ea-b73)
Java HotSpot(TM) 64-Bit Server VM (build 1.9.0-ea-b73, mixed mode)

Time for loop#  0:    48 ms
Time for loop#  1:    26 ms
Time for loop#  2:    22 ms
Time for loop#  3:    22 ms
Time for loop#  4:    22 ms
Time for loop#  5:    22 ms
Time for loop#  6:    22 ms
Time for loop#  7:    22 ms
Time for loop#  8:    22 ms
Time for loop#  9:    23 ms

That's quite the improvement!

Best Answer

This is a suboptimal recompilation of a method.

JIT compiler relies on a run-time statistics gathered during interpretation. When main method is compiled for the first time, the outer loop has not yet completed its first iteration => the run-time statistics tells that the code after the inner loop is never executed, so JIT does not ever bother compiling it. It rather generates an uncommon trap.

When the inner loop ends for the first time, the uncommon trap is hit causing the method to be deoptimized.

On the second iteration of the outer loop the main method is recompiled with the new knowledge. Now JIT has more statistics and more context to compile. For some reason now it does not cache the value a[0] in the register (probably because JIT is fooled by the wider context). So it generates addl instruction to update the array in memory, that is effectively a combination of memory load and store.

On the contrary, during the first compilation JIT caches the value of a[0] in the register, there is only mov instruction to store a value in memory (without load).

Fast loop (first iteration):

0x00000000029fc562: mov    %ecx,0x10(%r14)   <<< array store
0x00000000029fc566: mov    %r11d,%edi
0x00000000029fc569: mov    %r9d,%ecx
0x00000000029fc56c: add    %edi,%ecx
0x00000000029fc56e: mov    %ecx,%r11d
0x00000000029fc571: add    $0x10,%r11d       <<< increment in register
0x00000000029fc575: mov    %r11d,0x10(%r14)  <<< array store
0x00000000029fc579: add    $0x11,%ecx
0x00000000029fc57c: mov    %edi,%r11d
0x00000000029fc57f: add    $0x10,%r11d
0x00000000029fc583: cmp    $0x3ffffff2,%r11d
0x00000000029fc58a: jl     0x00000000029fc562

Slow loop (after recompilation):

0x00000000029fa1b0: addl   $0x10,0x10(%r14)  <<< increment in memory
0x00000000029fa1b5: add    $0x10,%r13d
0x00000000029fa1b9: cmp    $0x3ffffff1,%r13d
0x00000000029fa1c0: jl     0x00000000029fa1b0

However this problem seems to be fixed in JDK 9. I've checked this test against a recent JDK 9 Early Access release and verified that it works as expected:

Time for loop#  0:   104 ms
Time for loop#  1:   101 ms
Time for loop#  2:    91 ms
Time for loop#  3:    63 ms
Time for loop#  4:    60 ms
Time for loop#  5:    60 ms
Time for loop#  6:    59 ms
Time for loop#  7:    55 ms
Time for loop#  8:    57 ms
Time for loop#  9:    59 ms

Best Answer

Related Solutions

Java Performance – Why Algorithm Becomes Faster After Multiple Executions

Java Performance Benchmarking – Why is the Second Loop Faster in Java Benchmarking?

Related Question