Python – re.findall() Not Working as Expected

pythonregex

I have the code:

import re
sequence="aabbaa"
rexp=re.compile("(aa|bb)+")
rexp.findall(sequence)

This returns ['aa']

If we have

import re
sequence="aabbaa"
rexp=re.compile("(aa|cc)+")
rexp.findall(sequence)

we get ['aa','aa']

Why is there a difference and why (for the first) do we not get ['aa','bb','aa']?

Thanks!

Best Answer

The unwanted behaviour comes down to the way you formulate regualar expression:

rexp=re.compile("(aa|bb)+")

Parentheses (aa|bb) forms a group.

And if we look at the docs of findall we will see this:

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.**

As you formed a group, it mathced first aa, then bb, then aa again (because of + quantifier). So this group holds aa in the end. And findall returns this value in the list ['aa'] (as there is only one match aabbaa of the whole expression, the list contains only one element aa which is saved in the group).

From the code you gave, you seemed to want to do this:

>>> rexp=re.compile("(?:aa|bb)+")
>>> rexp.findall(sequence)
['aabbaa']

(?: ...) doesnt create any group, so findall returns the match of the whole expression.

In the end of your question you show the desired output. This is achieved by just looking for aa or bb. No quantifiers (+ or *) are needed. Just do it the way is in the Inbar Rose's answer:

>>> rexp=re.compile("aa|bb")
>>> rexp.findall(sequence)
['aa', 'bb', 'aa']
Related Question