Are optional non-capturing groups redundant?
Is the following regex:
(?:wo)?men
semantically equivalent to the following regex?
(wo)?men
capture-groupjavaregex
Are optional non-capturing groups redundant?
Is the following regex:
(?:wo)?men
semantically equivalent to the following regex?
(wo)?men
The documentation of Matcher class from the reference implementation doesn't specify the behavior of appendReplacement
method when a capturing group which doesn't capture anything (null
) is specified in the replacement string. While the behavior of group
method is clear, nothing is mentioned in appendReplacement
method.
Below are 3 exhibits of difference in implementation for the case above:
null
for the case above.Some code has been omitted for the sake of brevity, and is indicated by ...
.
For the reference implementation (Sun/Oracle JDK and OpenJDK), the code for appendReplacement
doesn't seem to have changed from Java 6, and it will not append anything when a capturing group doesn't capture anything:
} else if (nextChar == '$') {
// Skip past $
cursor++;
// The first number is always a group
int refNum = (int)replacement.charAt(cursor) - '0';
if ((refNum < 0)||(refNum > 9))
throw new IllegalArgumentException(
"Illegal group reference");
cursor++;
// Capture the largest legal group string
...
// Append group
if (start(refNum) != -1 && end(refNum) != -1)
result.append(text, start(refNum), end(refNum));
} else {
GNU Classpath, which is a complete reimplementation of Java Class Library has a different implementation for appendReplacement
in the case above. In Classpath, the classes in java.util.regex
package in Classpath is just a wrapper for classes in gnu.java.util.regex
.
Matcher.appendReplacement
calls RE.getReplacement
to process replacement for the matched portion:
public Matcher appendReplacement (StringBuffer sb, String replacement)
throws IllegalStateException
{
assertMatchOp();
sb.append(input.subSequence(appendPosition,
match.getStartIndex()).toString());
sb.append(RE.getReplacement(replacement, match,
RE.REG_REPLACE_USE_BACKSLASHESCAPE));
appendPosition = match.getEndIndex();
return this;
}
RE.getReplacement
calls REMatch.substituteInto
to get the content of the capturing group and appends its result directly:
case '$':
int i1 = i + 1;
while (i1 < replace.length () &&
Character.isDigit (replace.charAt (i1)))
i1++;
sb.append (m.substituteInto (replace.substring (i, i1)));
i = i1 - 1;
break;
REMatch.substituteInto
appends the result of REMatch.toString(int)
directly without checking whether the capturing group has captured anything:
if ((input.charAt (pos) == '$')
&& (Character.isDigit (input.charAt (pos + 1))))
{
// Omitted code parses the group number into val
...
if (val < start.length)
{
output.append (toString (val));
}
}
And REMatch.toString(int)
returns null
when the capturing group doesn't capture (irrelevant code has been omitted).
public String toString (int sub)
{
if ((sub >= start.length) || sub < 0)
throw new IndexOutOfBoundsException ("No group " + sub);
if (start[sub] == -1)
return null;
...
}
So in GNU Classpath's case, null
will be appended to the string when a capturing group which fails to capture anything is specified in the replacement string.
In Android, Matcher.appendReplacement
calls private method appendEvaluated
, which in turn directly appends the result of group(int)
to the replacement string.
public Matcher appendReplacement(StringBuffer buffer, String replacement) {
buffer.append(input.substring(appendPos, start()));
appendEvaluated(buffer, replacement);
appendPos = end();
return this;
}
private void appendEvaluated(StringBuffer buffer, String s) {
boolean escape = false;
boolean dollar = false;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '\\' && !escape) {
escape = true;
} else if (c == '$' && !escape) {
dollar = true;
} else if (c >= '0' && c <= '9' && dollar) {
buffer.append(group(c - '0'));
dollar = false;
} else {
buffer.append(c);
dollar = false;
escape = false;
}
}
// This seemingly stupid piece of code reproduces a JDK bug.
if (escape) {
throw new ArrayIndexOutOfBoundsException(s.length());
}
}
Since Matcher.group(int)
returns null
for capturing group which fails to capture, Matcher.appendReplacement
appends null
when the capturing group is referred to in the replacement string.
It is most likely that the 2 people complaining to you are running their code on Android.
Typically, non-capturing groups perform better than capturing groups, because they require less allocation of memory, and do not make a copy of the group match. However, there are three important caveats:
grep
itself takes a significant amount of time and memory, and may overwhelm any small improvement gained by using non-capturing group(s).
Best Answer
Your
(?:wo)?men
and(wo)?men
are semantically equivalent, but technically are different, namely, the first is using a non-capturing and the other a capturing group. Thus, the question is why use non-capturing groups when we have capturing ones?Non-caprturing groups are of help sometimes.
NOTE this does not pertain to Java regex engine, nor to PHP or .NET regex engines.
Also, it is just makes our matches cleaner:
It does not seem a good idea to re-factor existing regular expressions to convert capturing to non-capturing groups, since it may ruin the code or require too much effort.