Why Parallel Stream is Slower in Java 8 – Performance Analysis

javajava-8java-stream

I was playing around with infinite streams and made this program for benchmarking. Basically the bigger the number you provide, the faster it will finish. However, I was amazed to find that using a parellel stream resulted in exponentially worse performance compared to a sequential stream. Intuitively, one would expect an infinite stream of random numbers to be generated and evaluated much faster in a multi-threaded environment, but this appears not to be the case. Why is this?

    final int target = Integer.parseInt(args[0]);
    if (target <= 0) {
        System.err.println("Target must be between 1 and 2147483647");
        return;
    }

    final long startTime, endTime;
    startTime = System.currentTimeMillis();

    System.out.println(
        IntStream.generate(() -> new Double(Math.random()*2147483647).intValue())
        //.parallel()
        .filter(i -> i <= target)
        .findFirst()
        .getAsInt()
    );

    endTime = System.currentTimeMillis();
    System.out.println("Execution time: "+(endTime-startTime)+" ms");

Best Answer

I totally agree with the other comments and answers but indeed your test behaves strange in case that the target is very low. On my modest laptop the parallel version is on average about 60x slower when very low targets are given. This extreme difference cannot be explained by the overhead of the parallelization in the stream APIs so I was also amazed :-). IMO the culprit lies here:

Math.random()

Internally this call relies on a global instance of java.util.Random. In the documentation of Random it is written:

Instances of java.util.Random are threadsafe. However, the concurrent use of the same java.util.Random instance across threads may encounter contention and consequent poor performance. Consider instead using ThreadLocalRandom in multithreaded designs.

So I think that the really poor performance of the parallel execution compared to the sequential one is explained by the thread contention in random rather than any other overheads. If you use ThreadLocalRandom instead (as recommended in the documentation) then the performance difference will not be so dramatic. Another option would be to implement a more advanced number supplier.