Java Syntax – Why is n+++n Valid While n++++n is Not?

javasyntaxtokenize

In Java, the expression:

n+++n

Appears to evaluate as equivalent to:

n++ + n

Despite the fact that +n is a valid unary operator with higher precedence than the arithmetic + operator in n + n. So the compiler appears to be assuming that the operator cannot be the unary operator and resolving the expression.

However, the expression:

n++++n

Does not compile, even though there is a single valid possibility for it to be resolved as:

n++ + +n

++n and +n are specified as having the same precedence, so why does the compiler resolve the seeming ambiguity in n+++n in favour of the arithmetic + but does not do so with n++++n?

Best Answer

The file is tokenized (transformed into sequence of tokens) first with the maximal munch rule - always get longest possible valid token. Your text is transformed to following sequence:

n ++ ++ n

And this is not valid expression.

From JLS §3.2:

3.2. Lexical Translations

A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:

  1. A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.

  2. A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (§3.4).

  3. A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements (§3.5) which, after white space (§3.6) and comments (§3.7) are discarded, comprise the tokens (§3.5) that are the terminal symbols of the syntactic grammar (§2.3).

The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would.

(Thus, the input characters a--b are tokenized (§3.5) as a, --, b, which is not part of any grammatically correct program, even though the tokenization a, -, -, b could be part of a grammatically correct program.)