The problem is that current implementation of Stream API along with the current implementation of IteratorSpliterator
for unknown size source badly splits such sources to parallel tasks. You were lucky having more than 1024 files, otherwise you would have no parallelization benefit at all. Current Stream API implementation takes into account the estimateSize()
value returned from Spliterator
. The IteratorSpliterator
of unknown size returns Long.MAX_VALUE
before split and its suffix always returns Long.MAX_VALUE
as well. Its splitting strategy is the following:
- Define the current batch size. Current formula is to start with 1024 elements and increase arithmetically (2048, 3072, 4096, 5120 and so on) until
MAX_BATCH
size is reached (which is 33554432 elements).
- Consume input elements (in your case Paths) into array until the batch size is reached or input is exhausted.
- Return an
ArraySpliterator
iterating over the created array as prefix, leaving itself as suffix.
Suppose you have 7000 files. Stream API asks for estimated size, IteratorSpliterator
returns Long.MAX_VALUE
. Ok, Stream API asks the IteratorSpliterator
to split, it collects 1024 elements from the underlying DirectoryStream
to the array and splits to ArraySpliterator
(with estimated size 1024) and itself (with estimated size which is still Long.MAX_VALUE
). As Long.MAX_VALUE
is much much more than 1024, Stream API decides to continue splitting the bigger part without even trying to split the smaller part. So the overall splitting tree goes like this:
IteratorSpliterator (est. MAX_VALUE elements)
| |
ArraySpliterator (est. 1024 elements) IteratorSpliterator (est. MAX_VALUE elements)
| |
/---------------/ |
| |
ArraySpliterator (est. 2048 elements) IteratorSpliterator (est. MAX_VALUE elements)
| |
/---------------/ |
| |
ArraySpliterator (est. 3072 elements) IteratorSpliterator (est. MAX_VALUE elements)
| |
/---------------/ |
| |
ArraySpliterator (est. 856 elements) IteratorSpliterator (est. MAX_VALUE elements)
|
(split returns null: refuses to split anymore)
So after that you have five parallel tasks to be executed: actually containing 1024, 2048, 3072, 856 and 0 elements. Note that even though the last chunk has 0 elements, it still reports that it has estimatedly Long.MAX_VALUE
elements, so Stream API will send it to the ForkJoinPool
as well. The bad thing is that Stream API thinks that further splitting of first four tasks is useless as their estimated size is much less. So what you get is very uneven splitting of the input which utilizes four CPU cores max (even if you have much more). If your per-element processing takes roughly the same time for any element, then the whole process would wait for the biggest part (3072 elements) to complete. So maximum speedup you may have is 7000/3072=2.28x. Thus if sequential processing takes 41 seconds, then the parallel stream will take around 41/2.28 = 18 seconds (which is close to your actual numbers).
Your work-around solution is completely fine. Note that using Files.list().parallel()
you also have all the input Path
elements stored in the memory (in ArraySpliterator
objects). Thus you will not waste more memory if you manually dump them into the List
. Array-backed list implementations like ArrayList
(which is currently created by Collectors.toList()
) can split evenly without any problems, which results in additional speed-up.
Why such case is not optimized? Of course it's not impossible problem (though implementation could be quite tricky). It seems that it's not high-priority problem for JDK developers. There were several discussions on this topic in mailing lists. You may read Paul Sandoz message here where he comments on my optimization effort.
Best Answer
The Javadocs for
Collection.(parallelS|s)tream()
andStream
itself don't answer the question, so it's off to the mailing lists for the rationale. I went through the lambda-libs-spec-observers archives and found one thread specifically about Collection.parallelStream() and another thread that touched on whether java.util.Arrays should provide parallelStream() to match (or actually, whether it should be removed). There was no once-and-for-all conclusion, so perhaps I've missed something from another list or the matter was settled in private discussion. (Perhaps Brian Goetz, one of the principals of this discussion, can fill in anything missing.)The participants made their points well, so this answer is mostly just an organization of the relevant quotes, with a few clarifications in [brackets], presented in order of importance (as I interpret it).
parallelStream() covers a very common case
Brian Goetz in the first thread, explaining why
Collections.parallelStream()
is valuable enough to keep even after other parallel stream factory methods have been removed:Brian Goetz stands by this position in the later discussion about
Arrays.parallelStream()
:parallelStream() is more performant
Brian Goetz:
In response to Kevin Bourrillion's skepticism about whether the effect is significant, Brian again:
Doug Lea follows up, but hedges his position:
Indeed, the later discussion about
Arrays.parallelStream()
takes notice of lower Stream.parallel() cost.stream().parallel() statefulness complicates the future
At the time of the discussion, switching a stream from sequential to parallel and back could be interleaved with other stream operations. Brian Goetz, on behalf of Doug Lea, explains why sequential/parallel mode switching may complicate future development of the Java platform:
This mode switching was removed after further discussion. In the current version of the library, a stream pipeline is either sequential or parallel; last call to
sequential()
/parallel()
wins. Besides side-stepping the statefulness problem, this change also improved the performance of usingparallel()
to set up a parallel pipeline from a sequential stream factory.exposing parallelStream() as a first-class citizen improves programmer perception of the library, leading them to write better code
Brian Goetz again, in response to Tim Peierls's argument that
Stream.parallel()
allows programmers to understand streams sequentially before going parallel: