Behavior of re.split on empty strings is unexpected

John Nagle · Aug 2, 2010

The regular expression "split" behaves slightly differently than string
split:

>>> import re
>>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE)
>>> kresplit2.split(" HELLO THERE ")

Click to expand...

Click to expand...

['', 'HELLO', 'THERE', '']
['VERISIGN', 'INC', '']

I'd thought that "split" would never produce an empty string, but
it will.

The regular string split operation doesn't yield empty strings:
['HELLO', 'THERE']

If I try to get the functionality of string split with re:
['', 'HELLO', 'THERE', '']

I still get empty strings.

The documentation just describes re.split as "Split string by the
occurrences of pattern", which is not too helpful.

John Nagle

Peter Otten · Aug 2, 2010

John said:
The regular string split operation doesn't yield empty strings:
['HELLO', 'THERE']

Note that invocation without separator argument (or None as the separator)
is special in that respect:
['', 'hello', 'there', '']

Peter

MRAB · Aug 2, 2010

John said:
The regular expression "split" behaves slightly differently than string
split:

import re
kresplit = re.compile(r'[^\w\&]+',re.UNICODE)
kresplit2.split(" HELLO THERE ")

Click to expand...

Click to expand...

['', 'HELLO', 'THERE', '']
['VERISIGN', 'INC', '']

I'd thought that "split" would never produce an empty string, but
it will.

The regular string split operation doesn't yield empty strings:
['HELLO', 'THERE']

Yes it does.
['', '', '', 'HELLO', '', '', '', 'THERE', '', '', '']

If I try to get the functionality of string split with re:
['', 'HELLO', 'THERE', '']

I still get empty strings.

The documentation just describes re.split as "Split string by the
occurrences of pattern", which is not too helpful.

It's the plain str.split() which is unusual in that:

1. it splits on sequences of whitespace instead of one per occurrence;

2. it discards leading and trailing sequences of whitespace.

Compare:
['', '', 'A', '', 'B', '', '']

with:
['A', 'B']

It just happens that the unusual one is the most commonly used one, if
you see what I mean!

John Nagle · Aug 2, 2010

It's the plain str.split() which is unusual in that:

1. it splits on sequences of whitespace instead of one per occurrence;

That can be emulated with the obvious regular expression:

re.compile(r'\W+')

2. it discards leading and trailing sequences of whitespace.

But that can't, or at least I can't figure out how to do it.

It just happens that the unusual one is the most commonly used one, if
you see what I mean!

The no-argument form of "split" shouldn't be that much of a special
case.

John Nagle

Thomas Jollans · Aug 2, 2010

That can be emulated with the obvious regular expression:

re.compile(r'\W+')

But that can't, or at least I can't figure out how to do it.

[ s in rexp.split(long_s) if s ]

John Nagle · Aug 2, 2010

occurrences of pattern", which is not too helpful.

That can be emulated with the obvious regular expression:

re.compile(r'\W+')

But that can't, or at least I can't figure out how to do it.

Click to expand...

[ s in rexp.split(long_s) if s ]

Of course I can discard the blank strings afterward, but
is there some way to do it in the "split" operation? If
not, then the default case for "split()" is too non-standard.

(Also, "if s" won't work; if s != '' might)

John Nagle

Thomas Jollans · Aug 2, 2010

[ s in rexp.split(long_s) if s ]

Click to expand...

Of course I can discard the blank strings afterward, but
is there some way to do it in the "split" operation? If
not, then the default case for "split()" is too non-standard.

(Also, "if s" won't work; if s != '' might)

Of course it will work. Empty sequences are considered false in Python.

Python 3.1.2 (release31-maint, Jul 8 2010, 09:18:08)
[GCC 4.4.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import re
sprexp = re.compile(r'\s+')
[s for s in sprexp.split(' spaces every where ! ') if s] ['spaces', 'every', 'where', '!']
list(filter(bool, sprexp.split(' more spaces \r\n\t\t '))) ['more', 'spaces']

Click to expand...

Click to expand...

(of course, the list comprehension I posted earlier was missing a couple
of words, which was very careless of me)

samwyse · Aug 3, 2010

The regular expression "split" behaves slightly differently than string
split:

I'm going to argue that it's the string split that's behaving oddly.
To see why, let's first look at some simple CSV values:
cat,dog
,missing,,values,

How many fields are on each line and what are they? Here's what
re.split(',') says:

re.split(',', 'cat,dog') ['cat', 'dog']
re.split(',', ',missing,,values,')

Click to expand...

Click to expand...

['', 'missing', '', 'values', '']

Note that the presence of missing values is clearly flagged via the
presence of empty strings in the results. Now let's look at string
split:

'cat,dog'.split(',') ['cat', 'dog']
',missing,,values,'.split(',')

Click to expand...

Click to expand...

['', 'missing', '', 'values', '']

It's the same results. Let's try it again, but replacing the commas
with spaces.

re.split(' ', 'cat dog') ['cat', 'dog']
re.split(' ', ' missing values ') ['', 'missing', '', 'values', '']
'cat dog'.split(' ') ['cat', 'dog']
' missing values '.split(' ')

Click to expand...

Click to expand...

['', 'missing', '', 'values', '']

It's the same results; however many people don't like these results
because they feel that whitespace occupies a privileged role. People
generally agree that a string of consecutive commas means missing
values, but a string of consecutive spaces just means someone held the
space-bar down too long. To accommodate this viewpoint, the string
split is special-cased to behave differently when None is passed as a
separator. First, it splits on any number of whitespace characters,
like this:

re.split('\s+', ' missing values ') ['', 'missing', 'values', '']
re.split('\s+', 'cat dog')

Click to expand...

Click to expand...

['cat', 'dog']

But it also eliminates any empty strings from the head and tail of the
list, because that's what people generally expect when splitting on
whitespace:

'cat dog'.split(None) ['cat', 'dog']
' missing values '.split(None)

Click to expand...

Click to expand...

['missing', 'values']

rantingrick · Aug 3, 2010

It's the same results; however many people don't like these results
because they feel that whitespace occupies a privileged role. People
generally agree that a string of consecutive commas means missing
values, but a string of consecutive spaces just means someone held the
space-bar down too long. To accommodate this viewpoint, the string
split is special-cased to behave differently when None is passed as a
separator. First, it splits on any number of whitespace characters,
like this:

Well we could have created another method like "splitstrip()". However
then folks would complain that they must remember two methods that are
almost identical. Uggh, you just can't win. There is always the
naysayers no matter what you do!

PS: Great post by the way. Highly informative for the pynoobs.

John Nagle · Aug 3, 2010

I'm going to argue that it's the string split that's behaving oddly.

I tend to agree.

It doesn't seem to be possible to get the same semantics with
any regular expression split. The default "split" has a special
case for head and tail whitespace, and there's no way to express
that with a regular expression split. Applying "strip" first
will work, of course. The documentation should reflect
that.

John Nagle

jhermann · Aug 5, 2010

>>> s2 = " HELLO THERE "
>>> kresplit4 = re.compile(r'\W+', re.UNICODE)
>>> kresplit4.split(s2)
['', 'HELLO', 'THERE', '']

I still get empty strings.

['a', 'b', 'c']

Strange re behavior: normal?	5	Aug 14, 2003
beginners question about return value of re.split	5	Mar 21, 2008
a splitting headache	29	Oct 16, 2009
Efficient way of generating original alphabetic strings like unix file "split"	6	Jun 14, 2007
My very first python program, need help	1	Aug 10, 2008
Nothing to repeat	4	Jan 9, 2011
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
KirbyBase : replacing string exceptions	2	Nov 23, 2009

Behavior of re.split on empty strings is unexpected

John Nagle

Peter Otten

MRAB

John Nagle

Thomas Jollans

John Nagle

Thomas Jollans

samwyse

rantingrick

John Nagle

jhermann

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads