split on blank lines

J

Jan Burgy

Hi everyone,

can somebody tell me why (using Python 2.3.2)
['foo\n\nbar\n\nbaz']

? Being used to Perl semantics, I expect

['foo\n', 'bar\n', 'baz']

or something equivalent without the '\n' characters in the result
strings. I have found that
['foo\n', 'bar\n', 'baz']

I prefer the first version however because my intent is stated more
clearly. Could this be a bug in sre.py (I looked at the code for a
good two minutes but then my head started hurting)

Thanks for your help,

Jan
 
D

Duncan Booth

(e-mail address removed) (Jan Burgy) wrote in
can somebody tell me why (using Python 2.3.2)
['foo\n\nbar\n\nbaz']

? Being used to Perl semantics, I expect

['foo\n', 'bar\n', 'baz']

or something equivalent without the '\n' characters in the result
strings. I have found that
['foo\n', 'bar\n', 'baz']

I prefer the first version however because my intent is stated more
clearly. Could this be a bug in sre.py (I looked at the code for a
good two minutes but then my head started hurting)

Given that re.compile("^$", re.MULTILINE).findall("foo\n\nbar\n\nbaz")
returns ['', ''] I would agree this looks like a bug. You could submit a
bug report on Sourceforge.

Of course, if you really want to state your intentions, you could just use:
['foo', 'bar', 'baz']

as you aren't doing anything here that obviously benefits from regex
obfuscation.
 
H

Hans Nowak

Duncan said:
Given that re.compile("^$", re.MULTILINE).findall("foo\n\nbar\n\nbaz")
returns ['', ''] I would agree this looks like a bug. You could submit a
bug report on Sourceforge.

I may be wrong, but I would think that the behavior is correct. "^$" matches an
empty line. This is exactly what findall returns... two empty lines.
 
D

Duncan Booth

Duncan said:
Given that re.compile("^$",
re.MULTILINE).findall("foo\n\nbar\n\nbaz") returns ['', ''] I would
agree this looks like a bug. You could submit a bug report on
Sourceforge.

I may be wrong, but I would think that the behavior is correct. "^$"
matches an empty line. This is exactly what findall returns... two
empty lines.
Perhaps you trimmed too much of the original context, but you have
misunderstood the original poster's intent.

The original post said:
can somebody tell me why (using Python 2.3.2)
['foo\n\nbar\n\nbaz']

Notice that the string they are splitting contains two empty lines. I
pointed out that re.findall correctly spots the two empty lines, and
therefore you would expect that the split should correctly split the string
there, but it doesn't.

For the avoidance of doubt: there is an inconsistency of behaviour between
re.findall and re.split. It looks to me like a bug in the str.split method.
 
J

Jan Burgy

Duncan Booth said:
(e-mail address removed) (Jan Burgy) wrote in
can somebody tell me why (using Python 2.3.2)
import re
re.compile(r"^$", re.MULTILINE).split("foo\n\nbar\n\nbaz")
['foo\n\nbar\n\nbaz']

? Being used to Perl semantics, I expect

['foo\n', 'bar\n', 'baz']

or something equivalent without the '\n' characters in the result
strings. I have found that
re.compile(r"^\n", re.MULTILINE).split("foo\n\nbar\n\nbaz")
['foo\n', 'bar\n', 'baz']

I prefer the first version however because my intent is stated more
clearly. Could this be a bug in sre.py (I looked at the code for a
good two minutes but then my head started hurting)

Given that re.compile("^$", re.MULTILINE).findall("foo\n\nbar\n\nbaz")
returns ['', ''] I would agree this looks like a bug. You could submit a
bug report on Sourceforge.

Of course, if you really want to state your intentions, you could just use:
['foo', 'bar', 'baz']

as you aren't doing anything here that obviously benefits from regex
obfuscation.

Thank you Duncan for your input. You're right, I will post a bug
report on sourceforge. Why, you ask, do I split on "^$" and not simply
"\n\n"? Simply because I'm dealing with an idiotic file format (not my
own mind you) and that I really want to split on "^\t*$" (I agree with
you that it's a rather arbitrary definition of a blank line, once
again, not mine). When the above didn't work, I spent a long time
questioning my understanding of regular expressions until I could
simplify my code to the minimal amount that still yielded the error.
Sometimes I wish that Python contained more elements from AWK (in
particularly "RS" for instance)

Cheers,

Jan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top