The unwanted behaviour comes down to the way you formulate regualar expression:
rexp=re.compile("(aa|bb)+")
Parentheses (aa|bb)
forms a group.
And if we look at the docs of findall we will see this:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.**
As you formed a group, it mathced first aa
, then bb
, then aa
again (because of +
quantifier). So this group holds aa
in the end. And findall
returns this value in the list ['aa']
(as there is only one match aabbaa
of the whole expression, the list contains only one element aa
which is saved in the group).
From the code you gave, you seemed to want to do this:
>>> rexp=re.compile("(?:aa|bb)+")
>>> rexp.findall(sequence)
['aabbaa']
(?: ...)
doesnt create any group, so findall
returns the match of the whole expression.
In the end of your question you show the desired output. This is achieved by just looking for aa
or bb
. No quantifiers (+
or *
) are needed. Just do it the way is in the Inbar Rose's answer:
>>> rexp=re.compile("aa|bb")
>>> rexp.findall(sequence)
['aa', 'bb', 'aa']
The problem you have is that if the regex that re.findall
tries to match captures groups (i.e. the portions of the regex that are enclosed in parentheses), then it is the groups that are returned, rather than the matched string.
One way to solve this issue is to use non-capturing groups (prefixed with ?:
).
>>> import re
>>> s = 'size=50;size=51;'
>>> re.findall('size=(?:50|51);', s)
['size=50;', 'size=51;']
If the regex that re.findall
tries to match does not capture anything, it returns the whole of the matched string.
Although using character classes might be the simplest option in this particular case, non-capturing groups provide a more general solution.
Best Answer
When there are capture groups
findall
returns only the captured parts. Use?:
to prevent the parentheses from being a capture group.