Different number of matches from re.findall and re.split

J

Jeremy

Hello all,

I am using re.split to separate some text into logical structures.
The trouble is that re.split doesn't find everything while re.findall
does; i.e.:
found = re.findall('^ 1', line, re.MULTILINE)
len(found) 6439
tables = re.split('^ 1', line, re.MULTILINE)
len(tables)
1

Can someone explain why these two commands are giving different
results? I thought I should have the same number of matches (or maybe
different by 1, but not 6000!)

Thanks,
Jeremy
 
I

Iain King

Hello all,

I am using re.split to separate some text into logical structures.
The trouble is that re.split doesn't find everything while re.findall
does; i.e.:




Can someone explain why these two commands are giving different
results?  I thought I should have the same number of matches (or maybe
different by 1, but not 6000!)

Thanks,
Jeremy

re.split doesn't take re.MULTILINE as a flag: it doesn't take any
flags. It does take a maxsplit parameter, which you are passing the
value of re.MULTILINE (which happens to be 8 in my implementation).
Since your pattern is looking for line starts, and your first line
presumably has more splits than the maxsplits you are specifying, your
re.split never finds more than 1.
'split(pattern, string, maxsplit=0)\n Split the source string by
the occurren
ces of the pattern,\n returning a list containing the resulting
substrings.\n
'['split(pattern,', 'string,', 'maxsplit=0)\n', '', '', '', 'Split',
'the', 'sour
ce string by the occurrences of the pattern,\n returning a list
containing th
e resulting substrings.\n']['split(pattern,', 'string,', 'maxsplit=0)\n', '', '', '', 'Split',
'the', 'sour
ce', 'string', 'by', 'the', 'occurrences', 'of', 'the', 'pattern,\n',
'', '', ''
, 'returning', 'a', 'list', 'containing', 'the', 'resulting',
'substrings.\n']


Iain
 
J

Jeremy

re.split doesn't take re.MULTILINE as a flag: it doesn't take any
flags. It does take a maxsplit parameter, which you are passing the
value of re.MULTILINE (which happens to be 8 in my implementation).
Since your pattern is looking for line starts, and your first line
presumably has more splits than the maxsplits you are specifying, your
re.split never finds more than 1.

Yep. Thanks for pointing that out. I guess I just assumed that
re.split was similar to re.search/match/findall in what it accepted as
function parameters. I guess I'll have to use a \n instead of a ^ for
split.

Thanks,
Jeremy
 
A

Arnaud Delobelle

Hello all,

I am using re.split to separate some text into logical structures.
The trouble is that re.split doesn't find everything while re.findall
does; i.e.:




Can someone explain why these two commands are giving different
results?  I thought I should have the same number of matches (or maybe
different by 1, but not 6000!)

Thanks,
Jeremy

When in doubt, the documentation is a good place to start :)

http://docs.python.org/library/re.html#re.split

re.split(pattern, string[, maxsplit=0])

http://docs.python.org/library/re.html#re.findall

re.findall(pattern, string[, flags])

Notice that re.split's optional third argument is not for passing
flags.

HTH
 
P

Peter Otten

Jeremy said:
Yep. Thanks for pointing that out. I guess I just assumed that
re.split was similar to re.search/match/findall in what it accepted as
function parameters. I guess I'll have to use a \n instead of a ^ for
split.

You can precompile the pattern and then invoke the split() method:
.... beta
.... X gamma
.... delta X
.... X
.... zeta
.... """)
['', ' alpha\nbeta\n', ' gamma\ndelta X\n', '\nzeta\n']

Peter
 
M

MRAB

Jeremy said:
Yep. Thanks for pointing that out. I guess I just assumed that
re.split was similar to re.search/match/findall in what it accepted as
function parameters. I guess I'll have to use a \n instead of a ^ for
split.
You could use the .split method of a pattern object instead:

tables = re.compile('^ 1', re.MULTILINE).split(line)
 
S

Steve Holden

Jeremy said:
Yep. Thanks for pointing that out. I guess I just assumed that
re.split was similar to re.search/match/findall in what it accepted as
function parameters. I guess I'll have to use a \n instead of a ^ for
split.

Thanks,
Jeremy

Remember you can specify flags inside the pattern itself.

regards
Steve
 
S

Steve Holden

Jeremy said:
Hello all,

I am using re.split to separate some text into logical structures.
The trouble is that re.split doesn't find everything while re.findall
does; i.e.:


Can someone explain why these two commands are giving different
results? I thought I should have the same number of matches (or maybe
different by 1, but not 6000!)
re.MULTLINE is apprently 1, and you are providing it as the "maxsplit"
argument. Check the API in the documentation.

regards
Steve
 
S

Steve Holden

Steve Holden wrote:
[...]
re.MULTLINE is apprently 1, and you are providing it as the "maxsplit"
argument. Check the API in the documentation.
Sorry, I presume re.MULTILINE must actually be zero for the result of
re,split() to be of length 1 ...

regards
Steve
 
T

Tim Chase

Steve said:
Steve Holden wrote:
[...]
re.MULTLINE is apprently 1, and you are providing it as the "maxsplit"
argument. Check the API in the documentation.
Sorry, I presume re.MULTILINE must actually be zero for the result of
re,split() to be of length 1 ...

Because it's not doing a multiline split and it's anchored at the
beginning of the line, it only returns one result (there's
nothing before the start-of-line to return as the left-side of
the split):
.... abc
.... def
.... abc
.... def"""
>>> re.split('^', s, re.MULTILINE) ['\nabc\ndef\nabc\ndef']
>>> re.split('b', s, re.MULTILINE) ['\na', 'c\ndef\na', 'c\ndef']
>>> re.split('b', 'ab'*10, re.MULTILINE)
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'abab']


But your original logic is sound...the 3rd argument re.split() is
"maxsplit" not "flags", and if you want to use flags with
..split() you have to either specify it within the regexp or by
compiling the regexp and using the resulting compiled object as
detailed elsewhere in the thread by MRAB and Duncan.

-tkc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,534
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top