Splitting on '^' ?

K

kj

Sometimes I want to split a string into lines, preserving the
end-of-line markers. In Perl this is really easy to do, by splitting
on the beginning-of-line anchor:

@lines = split /^/, $string;

But I can't figure out how to do the same thing with Python. E.g.:
import re
re.split('^', 'spam\nham\neggs\n') ['spam\nham\neggs\n']
re.split('(?m)^', 'spam\nham\neggs\n') ['spam\nham\neggs\n']
bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')
['spam\nham\neggs\n']

Am I doing something wrong?

kynn
 
E

Ethan Furman

kj said:
Sometimes I want to split a string into lines, preserving the
end-of-line markers. In Perl this is really easy to do, by splitting
on the beginning-of-line anchor:

@lines = split /^/, $string;

But I can't figure out how to do the same thing with Python. E.g.:

import re
re.split('^', 'spam\nham\neggs\n')
['spam\nham\neggs\n']
re.split('(?m)^', 'spam\nham\neggs\n')
['spam\nham\neggs\n']
bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')

['spam\nham\neggs\n']

Am I doing something wrong?

kynn

As you probably noticed from the other responses: No, you can't split
on _and_ keep the splitby text.

Looks like you'll have to roll your own.

def splitat(text, sep):
result = [line + sep for line in text.split(sep)]
if result[-1] == sep: # either remove extra element
result.pop()
else: # or extra sep from last element
result[-1] = result[-1][:-len(sep)]
return result
 
M

MRAB

Gary said:
kj said:
Sometimes I want to split a string into lines, preserving the
end-of-line markers. In Perl this is really easy to do, by splitting
on the beginning-of-line anchor:

@lines = split /^/, $string;

But I can't figure out how to do the same thing with Python. E.g.:

import re
re.split('^', 'spam\nham\neggs\n')
['spam\nham\neggs\n']

re.split('(?m)^', 'spam\nham\neggs\n')
['spam\nham\neggs\n']

bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')
['spam\nham\neggs\n']

Am I doing something wrong?
Just split on the EOL character: the "\n":
re.split('\n', 'spam\nham\neggs\n')
['spam', 'ham', 'eggs', '']

The "^" and "$" characters do not match END-OF-LINE, but rather the
END-OF-STRING, which was doing you no good.
With the MULTLINE flag "^" matches START-OF-LINE and "$" matches
END-OF-LINE or END-OF-STRING.

The current re module won't split on a zero-width match.
 
P

Piet van Oostrum

kj said:
k> Sometimes I want to split a string into lines, preserving the
k> end-of-line markers. In Perl this is really easy to do, by splitting
k> on the beginning-of-line anchor:
k> @lines = split /^/, $string;
k> But I can't figure out how to do the same thing with Python. E.g.:
import re
re.split('^', 'spam\nham\neggs\n') k> ['spam\nham\neggs\n']
re.split('(?m)^', 'spam\nham\neggs\n') k> ['spam\nham\neggs\n']
bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')
k> ['spam\nham\neggs\n']
k> Am I doing something wrong?

It says that in the doc of 're':
Note that split will never split a string on an empty pattern match. For
example:
re.split('x*', 'foo') ['foo']
re.split("(?m)^$", "foo\n\nbar\n")
['foo\n\nbar\n']
 
R

rurpy

Sometimes I want to split a string into lines, preserving the
end-of-line markers.  In Perl this is really easy to do, by splitting
on the beginning-of-line anchor:

  @lines = split /^/, $string;

But I can't figure out how to do the same thing with Python.  E.g.:

Why not this?
['spam\n', 'ham\n', 'eggs\n']
 
M

MRAB

Ethan said:
kj said:
Sometimes I want to split a string into lines, preserving the
end-of-line markers. In Perl this is really easy to do, by splitting
on the beginning-of-line anchor:

@lines = split /^/, $string;

But I can't figure out how to do the same thing with Python. E.g.:

import re
re.split('^', 'spam\nham\neggs\n')
['spam\nham\neggs\n']

re.split('(?m)^', 'spam\nham\neggs\n')
['spam\nham\neggs\n']

bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')

['spam\nham\neggs\n']

Am I doing something wrong?

kynn

As you probably noticed from the other responses: No, you can't split
['ab', 'x', 'cd']

You _can't_ split on a zero-width match:
['ab', 'x', 'cd']

but you can use re.sub to replace zero-width matches with something
that's not zero-width and then split on that (best with str.split):
['', 'a', 'b', 'c', 'd', '']
 
E

Ethan Furman

MRAB said:
Ethan said:
kj said:
Sometimes I want to split a string into lines, preserving the
end-of-line markers. In Perl this is really easy to do, by splitting
on the beginning-of-line anchor:

@lines = split /^/, $string;

But I can't figure out how to do the same thing with Python. E.g.:


import re
re.split('^', 'spam\nham\neggs\n')


['spam\nham\neggs\n']

re.split('(?m)^', 'spam\nham\neggs\n')


['spam\nham\neggs\n']

bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')


['spam\nham\neggs\n']

Am I doing something wrong?

kynn


As you probably noticed from the other responses: No, you can't split
['ab', 'x', 'cd']

You _can't_ split on a zero-width match:
['ab', 'x', 'cd']

but you can use re.sub to replace zero-width matches with something
that's not zero-width and then split on that (best with str.split):
['', 'a', 'b', 'c', 'd', '']

Wow! I stand corrected, although I'm in danger of falling over from the
dizziness! :)

As impressive as that is, I don't think it does what the OP is looking
for. rurpy reminded us (or at least me ;) of .splitlines(), which seems
to do exactly what the OP is looking for. I do take some comfort that
my little snippet works for more than newlines alone, although I'm not
aware of any other use-cases. :(

~Ethan~

Oh, hey, how about this?

re.compile('(^[^\n]*\n?)', re.M).findall('text\ntext\ntext)

Although this does give me an extra blank segment at the end... oh well.
 
K

kj

Why not this? ['spam\n', 'ham\n', 'eggs\n']

That's perfect.

And .splitlines seems to be able to handle all "standard" end-of-line
markers without any special direction (which, ironically, strikes
me as a *little* Perlish, somehow):
['spam\r\n', 'ham\r', 'eggs\n']

Amazing. I'm not sure this is the *best* way to do this in general
(I would have preferred it, and IMHO it would have been more
Pythonic, if .splitlines accepted an additional optional argument
where one could specify the end-of-line sequence to be used for
the splitting, defaulting to the OS's conventional sequence, and
then it split *strictly* on that sequence).

But for now this .splitlines will do nicely.

Thanks!

kynn
 
J

John Yeung

And .splitlines seems to be able to handle all
"standard" end-of-line markers without any special
direction (which, ironically, strikes
me as a *little* Perlish, somehow):

It's Pythonic. Universal newline-handling for text has been a staple
of Python for as long as I can remember (very possibly since the very
beginning).
['spam\r\n', 'ham\r', 'eggs\n']

Amazing.  I'm not sure this is the *best* way to do
this in general (I would have preferred it, and IMHO
it would have been more Pythonic, if .splitlines
accepted an additional optional argument [...]).

I believe it's the best way. When you can use a string method instead
of a regex, it's definitely most Pythonic to use the string method.

I would argue that this particular string method is Pythonic in
design. Remember, Python strives not only for explicitness, but
simplicity and ease of use. When dealing with text, universal
newlines are much more often than not simpler and easier for the
programmer.

John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,062
Latest member
OrderKetozenseACV

Latest Threads

Top