Splitting on '^' ?

kj · Aug 14, 2009

Sometimes I want to split a string into lines, preserving the
end-of-line markers. In Perl this is really easy to do, by splitting
on the beginning-of-line anchor:

@lines = split /^/, $string;

But I can't figure out how to do the same thing with Python. E.g.:

import re
re.split('^', 'spam\nham\neggs\n') ['spam\nham\neggs\n']
re.split('(?m)^', 'spam\nham\neggs\n') ['spam\nham\neggs\n']
bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')

Click to expand...

Click to expand...

['spam\nham\neggs\n']

Am I doing something wrong?

kynn

Tycho Andersen · Aug 14, 2009

[snip]

import re
re.split('^', 'spam\nham\neggs\n') ['spam\nham\neggs\n']
re.split('(?m)^', 'spam\nham\neggs\n') ['spam\nham\neggs\n']
bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')

Click to expand...

Click to expand...

['spam\nham\neggs\n']

Am I doing something wrong?

Why not just:

\t

Gabriel · Aug 14, 2009

import re
re.split('^', 'spam\nham\neggs\n') ['spam\nham\neggs\n']
re.split('(?m)^', 'spam\nham\neggs\n') ['spam\nham\neggs\n']
bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')

Click to expand...

Click to expand...

['spam\nham\neggs\n']

Am I doing something wrong?

Maybe this:
['spam', 'ham', 'eggs', '']

Ethan Furman · Aug 14, 2009

kj said:
Sometimes I want to split a string into lines, preserving the
end-of-line markers. In Perl this is really easy to do, by splitting
on the beginning-of-line anchor:

@lines = split /^/, $string;

But I can't figure out how to do the same thing with Python. E.g.:

import re
re.split('^', 'spam\nham\neggs\n')

Click to expand...

['spam\nham\neggs\n']

re.split('(?m)^', 'spam\nham\neggs\n')

Click to expand...

['spam\nham\neggs\n']

bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')

Click to expand...

Click to expand...

['spam\nham\neggs\n']

Am I doing something wrong?

kynn

As you probably noticed from the other responses: No, you can't split
on _and_ keep the splitby text.

Looks like you'll have to roll your own.

def splitat(text, sep):
result = [line + sep for line in text.split(sep)]
if result[-1] == sep: # either remove extra element
result.pop()
else: # or extra sep from last element
result[-1] = result[-1][:-len(sep)]
return result

MRAB · Aug 14, 2009

Gary said:
kj said:

Sometimes I want to split a string into lines, preserving the
end-of-line markers. In Perl this is really easy to do, by splitting
on the beginning-of-line anchor:

@lines = split /^/, $string;

But I can't figure out how to do the same thing with Python. E.g.:

import re
re.split('^', 'spam\nham\neggs\n')
['spam\nham\neggs\n']

re.split('(?m)^', 'spam\nham\neggs\n')
['spam\nham\neggs\n']

bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')

Click to expand...

['spam\nham\neggs\n']

Am I doing something wrong?

Click to expand...

Just split on the EOL character: the "\n":
re.split('\n', 'spam\nham\neggs\n')
['spam', 'ham', 'eggs', '']

The "^" and "$" characters do not match END-OF-LINE, but rather the
END-OF-STRING, which was doing you no good.

With the MULTLINE flag "^" matches START-OF-LINE and "$" matches
END-OF-LINE or END-OF-STRING.

The current re module won't split on a zero-width match.

Piet van Oostrum · Aug 14, 2009

kj said:
k> Sometimes I want to split a string into lines, preserving the
k> end-of-line markers. In Perl this is really easy to do, by splitting
k> on the beginning-of-line anchor:

k> @lines = split /^/, $string;

k> But I can't figure out how to do the same thing with Python. E.g.:

import re
re.split('^', 'spam\nham\neggs\n') k> ['spam\nham\neggs\n']
re.split('(?m)^', 'spam\nham\neggs\n') k> ['spam\nham\neggs\n']
bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')

Click to expand...

Click to expand...

k> ['spam\nham\neggs\n']

k> Am I doing something wrong?

It says that in the doc of 're':
Note that split will never split a string on an empty pattern match. For
example:

re.split('x*', 'foo') ['foo']
re.split("(?m)^$", "foo\n\nbar\n")

Click to expand...

Click to expand...

['foo\n\nbar\n']

rurpy · Aug 14, 2009

Sometimes I want to split a string into lines, preserving the
end-of-line markers. In Perl this is really easy to do, by splitting
on the beginning-of-line anchor:

@lines = split /^/, $string;

But I can't figure out how to do the same thing with Python. E.g.:

Why not this?
['spam\n', 'ham\n', 'eggs\n']

MRAB · Aug 14, 2009

Ethan said:
kj said:

Sometimes I want to split a string into lines, preserving the
end-of-line markers. In Perl this is really easy to do, by splitting
on the beginning-of-line anchor:

@lines = split /^/, $string;

But I can't figure out how to do the same thing with Python. E.g.:

import re
re.split('^', 'spam\nham\neggs\n')
['spam\nham\neggs\n']

re.split('(?m)^', 'spam\nham\neggs\n')
['spam\nham\neggs\n']

bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')

Click to expand...

['spam\nham\neggs\n']

Am I doing something wrong?

kynn

Click to expand...

As you probably noticed from the other responses: No, you can't split
['ab', 'x', 'cd']

You _can't_ split on a zero-width match:
['ab', 'x', 'cd']

but you can use re.sub to replace zero-width matches with something
that's not zero-width and then split on that (best with str.split):
['', 'a', 'b', 'c', 'd', '']

Ethan Furman · Aug 15, 2009

MRAB said:
Ethan said:

kj said:

Sometimes I want to split a string into lines, preserving the
end-of-line markers. In Perl this is really easy to do, by splitting
on the beginning-of-line anchor:

@lines = split /^/, $string;

But I can't figure out how to do the same thing with Python. E.g.:

import re
re.split('^', 'spam\nham\neggs\n')

['spam\nham\neggs\n']

re.split('(?m)^', 'spam\nham\neggs\n')

['spam\nham\neggs\n']

bol_re = re.compile('^', re.M)
bol_re.split('spam\nham\neggs\n')

['spam\nham\neggs\n']

Am I doing something wrong?

kynn

Click to expand...

As you probably noticed from the other responses: No, you can't split
['ab', 'x', 'cd']

You _can't_ split on a zero-width match:
['ab', 'x', 'cd']

but you can use re.sub to replace zero-width matches with something
that's not zero-width and then split on that (best with str.split):
['', 'a', 'b', 'c', 'd', '']

Click to expand...

Wow! I stand corrected, although I'm in danger of falling over from the
dizziness!

As impressive as that is, I don't think it does what the OP is looking
for. rurpy reminded us (or at least me of .splitlines(), which seems
to do exactly what the OP is looking for. I do take some comfort that
my little snippet works for more than newlines alone, although I'm not
aware of any other use-cases.

~Ethan~

Oh, hey, how about this?

re.compile('(^[^\n]*\n?)', re.M).findall('text\ntext\ntext)

Although this does give me an extra blank segment at the end... oh well.

kj · Aug 16, 2009

Why not this? ['spam\n', 'ham\n', 'eggs\n']

That's perfect.

And .splitlines seems to be able to handle all "standard" end-of-line
markers without any special direction (which, ironically, strikes
me as a *little* Perlish, somehow):
['spam\r\n', 'ham\r', 'eggs\n']

Amazing. I'm not sure this is the *best* way to do this in general
(I would have preferred it, and IMHO it would have been more
Pythonic, if .splitlines accepted an additional optional argument
where one could specify the end-of-line sequence to be used for
the splitting, defaulting to the OS's conventional sequence, and
then it split *strictly* on that sequence).

But for now this .splitlines will do nicely.

Thanks!

kynn

John Yeung · Aug 16, 2009

And .splitlines seems to be able to handle all
"standard" end-of-line markers without any special
direction (which, ironically, strikes
me as a *little* Perlish, somehow):

It's Pythonic. Universal newline-handling for text has been a staple
of Python for as long as I can remember (very possibly since the very
beginning).

['spam\r\n', 'ham\r', 'eggs\n']

Amazing. I'm not sure this is the *best* way to do
this in general (I would have preferred it, and IMHO
it would have been more Pythonic, if .splitlines
accepted an additional optional argument [...]).

I believe it's the best way. When you can use a string method instead
of a regex, it's definitely most Pythonic to use the string method.

I would argue that this particular string method is Pythonic in
design. Remember, Python strives not only for explicitness, but
simplicity and ease of use. When dealing with text, universal
newlines are much more often than not simpler and easier for the
programmer.

John

Problem Splitting Text String	2	Dec 29, 2022
readline trick needed	19	Oct 13, 2012
how to split text into lines?	5	Jul 30, 2008
Re for Apache log file format	4	Oct 8, 2013
Must we include urllib just to decode a URL-encoded string, whenusing Requests?	0	Jun 13, 2013
Why is regex so slow?	21	Jun 18, 2013
Python point location of intersect between two lines	0	Feb 28, 2018
print header for output	0	Jun 19, 2011

Splitting on '^' ?

kj

Tycho Andersen

Gabriel

Ethan Furman

MRAB

Piet van Oostrum

rurpy

MRAB

Ethan Furman

kj

John Yeung

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads