Java – Why Collection.parallelStream() Exists When .stream().parallel() Does the Same

javajava-8java-stream

In Java 8, the Collection interface was extended with two methods that return Stream<E>: stream(), which returns a sequential stream, and parallelStream(), which returns a possibly-parallel stream. Stream itself also has a parallel() method that returns an equivalent parallel stream (either mutating the current stream to be parallel or creating a new stream).

The duplication has obvious disadvantages:

It's confusing. A question asks whether calling both parallelStream().parallel() is necessary to be sure the stream is parallel, given that parallelStream() may return a sequential stream. Why does parallelStream() exist if it can't make a guarantee? The other way around is also confusing — if parallelStream() returns a sequential stream, there's probably a reason (e.g., an inherently sequential data structure for which parallel streams are a performance trap); what should Stream.parallel() do for such a stream? (UnsupportedOperationException is not allowed by parallel()'s specification.)
Adding methods to an interface risks conflicts if an existing implementation has a similarly-named method with an incompatible return type. Adding parallelStream() in addition to stream() doubles the risk for little gain. (Note that parallelStream() was at one point just named parallel(), though I don't know if it was renamed to avoid name clashes or for another reason.)

Why does Collection.parallelStream() exist when calling Collection.stream().parallel() does the same thing?

Best Answer

The Javadocs for Collection.(parallelS|s)tream() and Stream itself don't answer the question, so it's off to the mailing lists for the rationale. I went through the lambda-libs-spec-observers archives and found one thread specifically about Collection.parallelStream() and another thread that touched on whether java.util.Arrays should provide parallelStream() to match (or actually, whether it should be removed). There was no once-and-for-all conclusion, so perhaps I've missed something from another list or the matter was settled in private discussion. (Perhaps Brian Goetz, one of the principals of this discussion, can fill in anything missing.)

The participants made their points well, so this answer is mostly just an organization of the relevant quotes, with a few clarifications in [brackets], presented in order of importance (as I interpret it).

parallelStream() covers a very common case

Brian Goetz in the first thread, explaining why Collections.parallelStream() is valuable enough to keep even after other parallel stream factory methods have been removed:

We do not have explicit parallel versions of each of these [stream factories]; we did originally, and to prune down the API surface area, we cut them on the theory that dropping 20+ methods from the API was worth the tradeoff of the surface yuckiness and performance cost of .intRange(...).parallel(). But we did not make that choice with Collection.

We could either remove the Collection.parallelStream(), or we could add the parallel versions of all the generators, or we could do nothing and leave it as is. I think all are justifiable on API design grounds.

I kind of like the status quo, despite its inconsistency. Instead of having 2N stream construction methods, we have N+1 -- but that extra 1 covers a huge number of cases, because it is inherited by every Collection. So I can justify to myself why having that extra 1 method is worth it, and why accepting the inconsistency of going no further is acceptable.

Do others disagree? Is N+1 [Collections.parallelStream() only] the practical choice here? Or should we go for the purity of N [rely on Stream.parallel()]? Or the convenience and consistency of 2N [parallel versions of all factories]? Or is there some even better N+3 [Collections.parallelStream() plus other special cases], for some other specially chosen cases we want to give special support to?

Brian Goetz stands by this position in the later discussion about Arrays.parallelStream():

I still really like Collection.parallelStream; it has huge discoverability advantages, and offers a pretty big return on API surface area -- one more method, but provides value in a lot of places, since Collection will be a really common case of a stream source.

parallelStream() is more performant

Brian Goetz:

Direct version [parallelStream()] is more performant, in that it requires less wrapping (to turn a stream into a parallel stream, you have to first create the sequential stream, then transfer ownership of its state into a new Stream.)

In response to Kevin Bourrillion's skepticism about whether the effect is significant, Brian again:

Depends how seriously you are counting. Doug counts individual object creations and virtual invocations on the way to a parallel operation, because until you start forking, you're on the wrong side of Amdahl's law -- this is all "serial fraction" that happens before you can fork any work, which pushes your breakeven threshold further out. So getting the setup path for parallel ops fast is valuable.

Doug Lea follows up, but hedges his position:

People dealing with parallel library support need some attitude adjustment about such things. On a soon-to-be-typical machine, every cycle you waste setting up parallelism costs you say 64 cycles. You would probably have had a different reaction if it required 64 object creations to start a parallel computation.

That said, I'm always completely supportive of forcing implementors to work harder for the sake of better APIs, so long as the APIs do not rule out efficient implementation. So if killing parallelStream is really important, we'll find some way to turn stream().parallel() into a bit-flip or somesuch.

Indeed, the later discussion about Arrays.parallelStream() takes notice of lower Stream.parallel() cost.

stream().parallel() statefulness complicates the future

At the time of the discussion, switching a stream from sequential to parallel and back could be interleaved with other stream operations. Brian Goetz, on behalf of Doug Lea, explains why sequential/parallel mode switching may complicate future development of the Java platform:

I'll take my best stab at explaining why: because it (like the stateful methods (sort, distinct, limit)) which you also don't like, move us incrementally farther from being able to express stream pipelines in terms of traditional data-parallel constructs, which further constrains our ability to to map them directly to tomorrow's computing substrate, whether that be vector processors, FPGAs, GPUs, or whatever we cook up.

Filter-map-reduce map[s] very cleanly to all sorts of parallel computing substrates; filter-parallel-map-sequential-sorted-limit-parallel-map-uniq-reduce does not.

So the whole API design here embodies many tensions between making it easy to express things the user is likely to want to express, and doing is in a manner that we can predictably make fast with transparent cost models.

This mode switching was removed after further discussion. In the current version of the library, a stream pipeline is either sequential or parallel; last call to sequential()/parallel() wins. Besides side-stepping the statefulness problem, this change also improved the performance of using parallel() to set up a parallel pipeline from a sequential stream factory.

exposing parallelStream() as a first-class citizen improves programmer perception of the library, leading them to write better code

Brian Goetz again, in response to Tim Peierls's argument that Stream.parallel() allows programmers to understand streams sequentially before going parallel:

I have a slightly different viewpoint about the value of this sequential intuition -- I view the pervasive "sequential expectation" as one if the biggest challenges of this entire effort; people are constantly bringing their incorrect sequential bias, which leads them to do stupid things like using a one-element array as a way to "trick" the "stupid" compiler into letting them capture a mutable local, or using lambdas as arguments to map that mutate state that will be used during the computation (in a non-thread-safe way), and then, when its pointed out that what they're doing, shrug it off and say "yeah, but I'm not doing it in parallel."

We've made a lot of design tradeoffs to merge sequential and parallel streams. The result, I believe, is a clean one and will add to the library's chances of still being useful in 10+ years, but I don't particularly like the idea of encouraging people to think this is a sequential library with some parallel bags nailed on the side.

Related Solutions

Java Parallel Processing – Why Files.list() Parallel Stream is Slower than Collection.parallelStream()

The problem is that current implementation of Stream API along with the current implementation of IteratorSpliterator for unknown size source badly splits such sources to parallel tasks. You were lucky having more than 1024 files, otherwise you would have no parallelization benefit at all. Current Stream API implementation takes into account the estimateSize() value returned from Spliterator. The IteratorSpliterator of unknown size returns Long.MAX_VALUE before split and its suffix always returns Long.MAX_VALUE as well. Its splitting strategy is the following:

Define the current batch size. Current formula is to start with 1024 elements and increase arithmetically (2048, 3072, 4096, 5120 and so on) until MAX_BATCH size is reached (which is 33554432 elements).
Consume input elements (in your case Paths) into array until the batch size is reached or input is exhausted.
Return an ArraySpliterator iterating over the created array as prefix, leaving itself as suffix.

Suppose you have 7000 files. Stream API asks for estimated size, IteratorSpliterator returns Long.MAX_VALUE. Ok, Stream API asks the IteratorSpliterator to split, it collects 1024 elements from the underlying DirectoryStream to the array and splits to ArraySpliterator (with estimated size 1024) and itself (with estimated size which is still Long.MAX_VALUE). As Long.MAX_VALUE is much much more than 1024, Stream API decides to continue splitting the bigger part without even trying to split the smaller part. So the overall splitting tree goes like this:

                     IteratorSpliterator (est. MAX_VALUE elements)
                           |                    |
ArraySpliterator (est. 1024 elements)   IteratorSpliterator (est. MAX_VALUE elements)
                                           |        |
                           /---------------/        |
                           |                        |
ArraySpliterator (est. 2048 elements)   IteratorSpliterator (est. MAX_VALUE elements)
                                           |        |
                           /---------------/        |
                           |                        |
ArraySpliterator (est. 3072 elements)   IteratorSpliterator (est. MAX_VALUE elements)
                                           |        |
                           /---------------/        |
                           |                        |
ArraySpliterator (est. 856 elements)    IteratorSpliterator (est. MAX_VALUE elements)
                                                    |
                                        (split returns null: refuses to split anymore)

So after that you have five parallel tasks to be executed: actually containing 1024, 2048, 3072, 856 and 0 elements. Note that even though the last chunk has 0 elements, it still reports that it has estimatedly Long.MAX_VALUE elements, so Stream API will send it to the ForkJoinPool as well. The bad thing is that Stream API thinks that further splitting of first four tasks is useless as their estimated size is much less. So what you get is very uneven splitting of the input which utilizes four CPU cores max (even if you have much more). If your per-element processing takes roughly the same time for any element, then the whole process would wait for the biggest part (3072 elements) to complete. So maximum speedup you may have is 7000/3072=2.28x. Thus if sequential processing takes 41 seconds, then the parallel stream will take around 41/2.28 = 18 seconds (which is close to your actual numbers).

Your work-around solution is completely fine. Note that using Files.list().parallel() you also have all the input Path elements stored in the memory (in ArraySpliterator objects). Thus you will not waste more memory if you manually dump them into the List. Array-backed list implementations like ArrayList (which is currently created by Collectors.toList()) can split evenly without any problems, which results in additional speed-up.

Why such case is not optimized? Of course it's not impossible problem (though implementation could be quite tricky). It seems that it's not high-priority problem for JDK developers. There were several discussions on this topic in mailing lists. You may read Paul Sandoz message here where he comments on my optimization effort.