s.split() on multiple separators

mrkafk · Sep 30, 2007

Hello everyone,

OK, so I want to split a string c into words using several different
separators from a list (dels).

I can do this the following C-like way:
cp=[]
for i in xrange(0,len(c)-1):
cp.extend(c.split(j))
c=cp

['ab', 'd', '', 'ab', '', '']

But. Surely there is a more Pythonic way to do this?

I cannot do this:
c=[x.split(i) for x in c]

because x.split(i) is a list.

Francesco Guerrieri · Sep 30, 2007

Hello everyone,

OK, so I want to split a string c into words using several different
separators from a list (dels).

Have a look at this recipe:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/303342

which contains several ways to solve the problem. You could both
translate all your separators to a single one, and then split over it,
or (maybe the simpler solution) going for the list comprehension
solution.

francesco

Francesco Guerrieri · Sep 30, 2007

Hello everyone,

OK, so I want to split a string c into words using several different
separators from a list (dels).

Have a look at this recipe:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/303342

which contains several ways to solve the problem. You could both
translate all your separators to a single one, and then split over it,
or (maybe the simpler solution) going for the list comprehension
solution.

francesco

Tim Chase · Sep 30, 2007

OK, so I want to split a string c into words using several different

separators from a list (dels).

I can do this the following C-like way:
cp=[]
for i in xrange(0,len(c)-1):
cp.extend(c.split(j))
c=cp

['ab', 'd', '', 'ab', '', '']

Given your original string, I'm not sure how that would be the
expected result of "split c on the characters in dels".

While there's a certain faction of pythonistas that don't esteem
regular expressions (or at least find them overused/misused,
which I'd certainly agree to), they may be able to serve your
purposes well:

>>> c=' abcde abc cba fdsa bcd '
>>> import re
>>> r = re.compile('[ce ]')
>>> r.split(c)

Click to expand...

Click to expand...

Click to expand...

['', 'ab', 'd', '', 'ab', '', '', 'ba', 'fdsa', 'b', 'd', '']

given that a regexp object has a split() method.

-tkc

Bryan Olson · Sep 30, 2007

Hello everyone,

OK, so I want to split a string c into words using several different
separators from a list (dels).

I can do this the following C-like way:

c=' abcde abc cba fdsa bcd '.split()
dels='ce '
for j in dels:
cp=[]
for i in xrange(0,len(c)-1):

The "-1" looks like a bug; remember in Python 'stop' bounds
are exclusive. The indexes of c are simply xrange(len(c)).

Python 2.3 and up offers: for (i, word) in enumerate(c):

cp.extend(c.split(j))
c=cp

c
['ab', 'd', '', 'ab', '', '']

The bug lost some words, such as 'fdsa'.

But. Surely there is a more Pythonic way to do this?

Click to expand...

When string.split() doesn't quite cut it, try re.split(), or
maybe re.findall(). Is one of these what you want?

import re

c = ' abcde abc cba fdsa bcd '

print re.split('[ce ]', c)

print re.split('[ce ]+', c)

print re.findall('[^ce ]+', c)

William James · Sep 30, 2007

Hello everyone,

OK, so I want to split a string c into words using several different
separators from a list (dels).

I can do this the following C-like way:

cp=[]
for i in xrange(0,len(c)-1):
cp.extend(c.split(j))
c=cp

['ab', 'd', '', 'ab', '', '']

But. Surely there is a more Pythonic way to do this?

I cannot do this:

c=[x.split(i) for x in c]

because x.split(i) is a list.

E:\Ruby>irb
irb(main):001:0> ' abcde abc cba fdsa bcd '.split(/[ce ]/)
=> ["", "ab", "d", "", "ab", "", "", "ba", "fdsa", "b", "d"]

mrkafk · Sep 30, 2007

['ab', 'd', '', 'ab', '', '']

Click to expand...

Given your original string, I'm not sure how that would be the
expected result of "split c on the characters in dels".

Oops, the inner loop should be:

for i in xrange(0,len(c)):

Now it works.

c=' abcde abc cba fdsa bcd '
import re
r = re.compile('[ce ]')
r.split(c)

Click to expand...

Click to expand...

['', 'ab', 'd', '', 'ab', '', '', 'ba', 'fdsa', 'b', 'd', '']

given that a regexp object has a split() method.

That's probably optimum solution. Thanks!

Regards,
Marcin

mrkafk · Sep 30, 2007

On Sep 30, 8:53 am, (e-mail address removed) wrote:

E:\Ruby>irb
irb(main):001:0> ' abcde abc cba fdsa bcd '.split(/[ce ]/)
=> ["", "ab", "d", "", "ab", "", "", "ba", "fdsa", "b", "d"]

That's acceptable only if you write perfect ruby-to-python
translator. ;-P

Regards,
Marcin

mrkafk · Sep 30, 2007

c=' abcde abc cba fdsa bcd '.split()

dels='ce '
for j in dels:
cp=[]
for i in xrange(0,len(c)-1):

Click to expand...

The "-1" looks like a bug; remember in Python 'stop' bounds
are exclusive. The indexes of c are simply xrange(len(c)).

Yep. Just found it out, though this seems a bit counterintuitive to
me, even if it makes for more elegant code: I forgot about the high
stop bound.

From my POV, if I want sequence from here to there, it should include

both here and there.

I do understand the consequences of making high bound exclusive, which
is more elegant code: xrange(len(c)). But it does seem a bit
illogical...

print re.split('[ce ]', c)

Yes, that does the job. Thanks.

Regards,
Marcin

Paul Hankin · Oct 1, 2007

c=' abcde abc cba fdsa bcd '.split()
dels='ce '
for j in dels:
cp=[]
for i in xrange(0,len(c)-1):

Click to expand...

Click to expand...

The "-1" looks like a bug; remember in Python 'stop' bounds
are exclusive. The indexes of c are simply xrange(len(c)).

Click to expand...

Yep. Just found it out, though this seems a bit counterintuitive to
me, even if it makes for more elegant code: I forgot about the high
stop bound.

You made a common mistake of using a loop index instead of iterating
directly.
Instead of:
for i in xrange(len(c)):
cp.extend(c.split(j))

Just write:
for words in c:
cp.extend(words.split(j))

Then you won't make a bounds mistake, and this snippet becomes a LOT
more readable.

(Of course, you're better using re.split instead here, but the
principle is good).

Gabriel Genellina · Oct 1, 2007

En Sun said:
both here and there.

I do understand the consequences of making high bound exclusive, which
is more elegant code: xrange(len(c)). But it does seem a bit
illogical...

See this note from E.W.Dijkstra in 1982 where he says that the Python
convention is the best choice.
http://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html

Antoon Pardon · Oct 2, 2007

See this note from E.W.Dijkstra in 1982 where he says that the Python
convention is the best choice.
http://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html

It may be convincing if you only consider natural numbers in ascending
order. Suppose you have the sequence a .. b and you want the reverse.
If you work with included bounds the reverse is just b .. a. If you use
the python convention, things become more complicated.

Another problem is if you are working with floats. Suppose you have a
set of floats. Now you want the subset of numbers that are between a and
b included. If you want to follow the convention that means you have to
find the smallest float that is bigger than b, not a trivial task.

Hrvoje Niksic · Oct 2, 2007

Antoon Pardon said:
It may be convincing if you only consider natural numbers in
ascending order. Suppose you have the sequence a .. b and you want
the reverse. If you work with included bounds the reverse is just b
.. a. If you use the python convention, things become more
complicated.

It's a tradeoff. The convention used by Python (and Lisp, Java and
others) is more convenient for other things. Length of the sequence
x[a:b] is simply b-a. Empty sequence is denoted simply with x[a:a],
where you would need to use the weird x[a:a-1] with inclusive bounds.
Subsequences such as x[a:b] and x[b:c] merge smoothly into x[a:c],
making it natural to iterate over subsequences without visiting an
element twice.

Another problem is if you are working with floats. Suppose you have
a set of floats. Now you want the subset of numbers that are between
a and b included. If you want to follow the convention that means
you have to find the smallest float that is bigger than b, not a
trivial task.

The exact same argument can be used against the other convention: if
you are working with inclusive bounds, and you need to represent the
subset [a, b), you need to find the largest float that is smaller than
b.

Antoon Pardon · Oct 2, 2007

Antoon Pardon said:
Antoon Pardon said:

It may be convincing if you only consider natural numbers in
ascending order. Suppose you have the sequence a .. b and you want
the reverse. If you work with included bounds the reverse is just b
.. a. If you use the python convention, things become more
complicated.

Click to expand...

It's a tradeoff. The convention used by Python (and Lisp, Java and
others) is more convenient for other things. Length of the sequence
x[a:b] is simply b-a. Empty sequence is denoted simply with x[a:a],
where you would need to use the weird x[a:a-1] with inclusive bounds.
Subsequences such as x[a:b] and x[b:c] merge smoothly into x[a:c],
making it natural to iterate over subsequences without visiting an
element twice.

Sure it is a tradeoff and the python choice may in the end still turn
out the best. But that doesn't contradict that a number of
considerations were simply not mentioned in the article refered to.

Another problem is if you are working with floats. Suppose you have
a set of floats. Now you want the subset of numbers that are between
a and b included. If you want to follow the convention that means
you have to find the smallest float that is bigger than b, not a
trivial task.

Click to expand...

The exact same argument can be used against the other convention: if
you are working with inclusive bounds, and you need to represent the
subset [a, b), you need to find the largest float that is smaller than
b.

Which I think is a good argument against using any convention and
having explict conditions for the boundaries to include or exclude
them.

So instead of writing xrange(2,6) you have to write something like
xrange(2 <= x < 6) which explictly states 2 is included and 6 is
excluded. If someone wants both boundaries include he can write
xrange(2 <= x <= 5).

A slice notation that would somehow indicate which boundaries are included
and which are excluded would be usefull IMO.

[david] · Oct 3, 2007

Gabriel said:
See this note from E.W.Dijkstra in 1982 where he says that the Python
convention is the best choice.
http://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html

The only thing I agreed with was his conclusion. Clever man.

[david]

Lawrence D'Oliveiro · Oct 4, 2007

It may be convincing if you only consider natural numbers in ascending
order. Suppose you have the sequence a .. b and you want the reverse.
If you work with included bounds the reverse is just b .. a. If you use
the python convention, things become more complicated.

Nothing complicated about it:

>>> a = 1
>>> b = 5
>>> range(a, b) [1, 2, 3, 4]
>>> range(b - 1, a - 1, -1)

Click to expand...

Click to expand...

[4, 3, 2, 1]

Another problem is if you are working with floats. Suppose you have a
set of floats. Now you want the subset of numbers that are between a and
b included. If you want to follow the convention that means you have to
find the smallest float that is bigger than b, not a trivial task.

Due to precision limitations, the set will probably not be what you think it
is, anyway.

Filter table rows based on multiple checkboxes value	2	Jan 13, 2023
[C#] Extend main interface on child level	0	Aug 31, 2023
How to automate repetitive tasks on firefox ???	6	Dec 25, 2022
Insight on a coding project.	1	Jun 19, 2022
How can I hide a div using an event listener on multiple checkboxes?	6	Dec 23, 2022
Translater + module + tkinter	1	Feb 16, 2023
Can't solve problems! please Help	0	Sep 26, 2022
Multiple disjoint sample sets?	3	Jan 11, 2013

s.split() on multiple separators

mrkafk

Francesco Guerrieri

Francesco Guerrieri

Tim Chase

Bryan Olson

William James

mrkafk

mrkafk

mrkafk

Paul Hankin

Gabriel Genellina

Antoon Pardon

Hrvoje Niksic

Antoon Pardon

[david]

Lawrence D'Oliveiro

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads