s.split() on multiple separators

M

mrkafk

Hello everyone,

OK, so I want to split a string c into words using several different
separators from a list (dels).

I can do this the following C-like way:
cp=[]
for i in xrange(0,len(c)-1):
cp.extend(c.split(j))
c=cp

['ab', 'd', '', 'ab', '', '']

But. Surely there is a more Pythonic way to do this?

I cannot do this:
c=[x.split(i) for x in c]

because x.split(i) is a list.
 
T

Tim Chase

OK, so I want to split a string c into words using several different
separators from a list (dels).

I can do this the following C-like way:
cp=[]
for i in xrange(0,len(c)-1):
cp.extend(c.split(j))
c=cp

['ab', 'd', '', 'ab', '', '']



Given your original string, I'm not sure how that would be the
expected result of "split c on the characters in dels".

While there's a certain faction of pythonistas that don't esteem
regular expressions (or at least find them overused/misused,
which I'd certainly agree to), they may be able to serve your
purposes well:
>>> c=' abcde abc cba fdsa bcd '
>>> import re
>>> r = re.compile('[ce ]')
>>> r.split(c)
['', 'ab', 'd', '', 'ab', '', '', 'ba', 'fdsa', 'b', 'd', '']

given that a regexp object has a split() method.

-tkc
 
B

Bryan Olson

Hello everyone,

OK, so I want to split a string c into words using several different
separators from a list (dels).

I can do this the following C-like way:

c=' abcde abc cba fdsa bcd '.split()
dels='ce '
for j in dels:
cp=[]
for i in xrange(0,len(c)-1):

The "-1" looks like a bug; remember in Python 'stop' bounds
are exclusive. The indexes of c are simply xrange(len(c)).

Python 2.3 and up offers: for (i, word) in enumerate(c):
cp.extend(c.split(j))
c=cp


c
['ab', 'd', '', 'ab', '', '']


The bug lost some words, such as 'fdsa'.

But. Surely there is a more Pythonic way to do this?

When string.split() doesn't quite cut it, try re.split(), or
maybe re.findall(). Is one of these what you want?

import re

c = ' abcde abc cba fdsa bcd '

print re.split('[ce ]', c)

print re.split('[ce ]+', c)

print re.findall('[^ce ]+', c)
 
W

William James

Hello everyone,

OK, so I want to split a string c into words using several different
separators from a list (dels).

I can do this the following C-like way:

cp=[]
for i in xrange(0,len(c)-1):
cp.extend(c.split(j))
c=cp

['ab', 'd', '', 'ab', '', '']

But. Surely there is a more Pythonic way to do this?

I cannot do this:

c=[x.split(i) for x in c]

because x.split(i) is a list.


E:\Ruby>irb
irb(main):001:0> ' abcde abc cba fdsa bcd '.split(/[ce ]/)
=> ["", "ab", "d", "", "ab", "", "", "ba", "fdsa", "b", "d"]
 
M

mrkafk

['ab', 'd', '', 'ab', '', '']

Given your original string, I'm not sure how that would be the
expected result of "split c on the characters in dels".

Oops, the inner loop should be:

for i in xrange(0,len(c)):

Now it works.

c=' abcde abc cba fdsa bcd '
import re
r = re.compile('[ce ]')
r.split(c)
['', 'ab', 'd', '', 'ab', '', '', 'ba', 'fdsa', 'b', 'd', '']

given that a regexp object has a split() method.

That's probably optimum solution. Thanks!

Regards,
Marcin
 
M

mrkafk

On Sep 30, 8:53 am, (e-mail address removed) wrote:
E:\Ruby>irb
irb(main):001:0> ' abcde abc cba fdsa bcd '.split(/[ce ]/)
=> ["", "ab", "d", "", "ab", "", "", "ba", "fdsa", "b", "d"]

That's acceptable only if you write perfect ruby-to-python
translator. ;-P

Regards,
Marcin
 
M

mrkafk

c=' abcde abc cba fdsa bcd '.split()
dels='ce '
for j in dels:
cp=[]
for i in xrange(0,len(c)-1):

The "-1" looks like a bug; remember in Python 'stop' bounds
are exclusive. The indexes of c are simply xrange(len(c)).

Yep. Just found it out, though this seems a bit counterintuitive to
me, even if it makes for more elegant code: I forgot about the high
stop bound.
From my POV, if I want sequence from here to there, it should include
both here and there.

I do understand the consequences of making high bound exclusive, which
is more elegant code: xrange(len(c)). But it does seem a bit
illogical...
print re.split('[ce ]', c)

Yes, that does the job. Thanks.

Regards,
Marcin
 
P

Paul Hankin

c=' abcde abc cba fdsa bcd '.split()
dels='ce '
for j in dels:
cp=[]
for i in xrange(0,len(c)-1):
The "-1" looks like a bug; remember in Python 'stop' bounds
are exclusive. The indexes of c are simply xrange(len(c)).

Yep. Just found it out, though this seems a bit counterintuitive to
me, even if it makes for more elegant code: I forgot about the high
stop bound.

You made a common mistake of using a loop index instead of iterating
directly.
Instead of:
for i in xrange(len(c)):
cp.extend(c.split(j))

Just write:
for words in c:
cp.extend(words.split(j))

Then you won't make a bounds mistake, and this snippet becomes a LOT
more readable.

(Of course, you're better using re.split instead here, but the
principle is good).
 
A

Antoon Pardon

See this note from E.W.Dijkstra in 1982 where he says that the Python
convention is the best choice.
http://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html

It may be convincing if you only consider natural numbers in ascending
order. Suppose you have the sequence a .. b and you want the reverse.
If you work with included bounds the reverse is just b .. a. If you use
the python convention, things become more complicated.

Another problem is if you are working with floats. Suppose you have a
set of floats. Now you want the subset of numbers that are between a and
b included. If you want to follow the convention that means you have to
find the smallest float that is bigger than b, not a trivial task.
 
H

Hrvoje Niksic

Antoon Pardon said:
It may be convincing if you only consider natural numbers in
ascending order. Suppose you have the sequence a .. b and you want
the reverse. If you work with included bounds the reverse is just b
.. a. If you use the python convention, things become more
complicated.

It's a tradeoff. The convention used by Python (and Lisp, Java and
others) is more convenient for other things. Length of the sequence
x[a:b] is simply b-a. Empty sequence is denoted simply with x[a:a],
where you would need to use the weird x[a:a-1] with inclusive bounds.
Subsequences such as x[a:b] and x[b:c] merge smoothly into x[a:c],
making it natural to iterate over subsequences without visiting an
element twice.
Another problem is if you are working with floats. Suppose you have
a set of floats. Now you want the subset of numbers that are between
a and b included. If you want to follow the convention that means
you have to find the smallest float that is bigger than b, not a
trivial task.

The exact same argument can be used against the other convention: if
you are working with inclusive bounds, and you need to represent the
subset [a, b), you need to find the largest float that is smaller than
b.
 
A

Antoon Pardon

Antoon Pardon said:
It may be convincing if you only consider natural numbers in
ascending order. Suppose you have the sequence a .. b and you want
the reverse. If you work with included bounds the reverse is just b
.. a. If you use the python convention, things become more
complicated.

It's a tradeoff. The convention used by Python (and Lisp, Java and
others) is more convenient for other things. Length of the sequence
x[a:b] is simply b-a. Empty sequence is denoted simply with x[a:a],
where you would need to use the weird x[a:a-1] with inclusive bounds.
Subsequences such as x[a:b] and x[b:c] merge smoothly into x[a:c],
making it natural to iterate over subsequences without visiting an
element twice.

Sure it is a tradeoff and the python choice may in the end still turn
out the best. But that doesn't contradict that a number of
considerations were simply not mentioned in the article refered to.
Another problem is if you are working with floats. Suppose you have
a set of floats. Now you want the subset of numbers that are between
a and b included. If you want to follow the convention that means
you have to find the smallest float that is bigger than b, not a
trivial task.

The exact same argument can be used against the other convention: if
you are working with inclusive bounds, and you need to represent the
subset [a, b), you need to find the largest float that is smaller than
b.

Which I think is a good argument against using any convention and
having explict conditions for the boundaries to include or exclude
them.

So instead of writing xrange(2,6) you have to write something like
xrange(2 <= x < 6) which explictly states 2 is included and 6 is
excluded. If someone wants both boundaries include he can write
xrange(2 <= x <= 5).

A slice notation that would somehow indicate which boundaries are included
and which are excluded would be usefull IMO.
 
L

Lawrence D'Oliveiro

It may be convincing if you only consider natural numbers in ascending
order. Suppose you have the sequence a .. b and you want the reverse.
If you work with included bounds the reverse is just b .. a. If you use
the python convention, things become more complicated.

Nothing complicated about it:
>>> a = 1
>>> b = 5
>>> range(a, b) [1, 2, 3, 4]
>>> range(b - 1, a - 1, -1)
[4, 3, 2, 1]
Another problem is if you are working with floats. Suppose you have a
set of floats. Now you want the subset of numbers that are between a and
b included. If you want to follow the convention that means you have to
find the smallest float that is bigger than b, not a trivial task.

Due to precision limitations, the set will probably not be what you think it
is, anyway.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top