I have the code:
import re
sequence="aabbaa"
rexp=re.compile("(aa|bb)+")
rexp.findall(sequence)
This returns ['aa']
If we have
import re
sequence="aabbaa"
rexp=re.compile("(aa|cc)+")
rexp.findall(sequence)
we get ['aa','aa']
Why is there a difference and why (for the first) do we not get ['aa','bb','aa']
?
Thanks!
Best Answer
The unwanted behaviour comes down to the way you formulate regualar expression:
Parentheses
(aa|bb)
forms a group.And if we look at the docs of findall we will see this:
As you formed a group, it mathced first
aa
, thenbb
, thenaa
again (because of+
quantifier). So this group holdsaa
in the end. Andfindall
returns this value in the list['aa']
(as there is only one matchaabbaa
of the whole expression, the list contains only one elementaa
which is saved in the group).From the code you gave, you seemed to want to do this:
(?: ...)
doesnt create any group, sofindall
returns the match of the whole expression.In the end of your question you show the desired output. This is achieved by just looking for
aa
orbb
. No quantifiers (+
or*
) are needed. Just do it the way is in the Inbar Rose's answer: