Problem splitting a string

A

Anthony Liu

I have this simple string:

mystr = 'this_NP is_VL funny_JJ'

I want to split it and give me a list as

['this', 'NP', 'is', 'VL', 'funny', 'JJ']

1. I tried mystr.split('_| '), but this gave me:

['this_NP is_VL funny_JJ']

It is not splitted at all.

2. I tried mystr.split('_'), and this gave me:

['this', 'NP is', 'VL funny', 'JJ']

in which, space is not used as a delimiter.

3. I tried mystr.split(' '), and this gave me:

['this_NP', 'is_VL', 'funny_JJ']

in which, '_' is not used as delimiter.

I think the documentation does say that the
separator/delimiter can be a string representing all
delimiters we want to use.

I do I split the string by using both ' ' and '_' as
the delimiters at once?

Thanks.






__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
 
E

Erik Max Francis

Anthony said:
I have this simple string:

mystr = 'this_NP is_VL funny_JJ'

I want to split it and give me a list as

['this', 'NP', 'is', 'VL', 'funny', 'JJ']

1. I tried mystr.split('_| '), but this gave me:

['this_NP is_VL funny_JJ']

It is not splitted at all.

Use re.split:
['this', 'NP', 'is', 'VL', 'funny', 'JJ']
 
S

Steven D'Aprano

I have this simple string:

mystr = 'this_NP is_VL funny_JJ'

I want to split it and give me a list as

['this', 'NP', 'is', 'VL', 'funny', 'JJ']
I think the documentation does say that the
separator/delimiter can be a string representing all
delimiters we want to use.

No, the delimiter is the delimiter, not a list of delimiters.

The only exception is delimiter=None, which splits on any whitespace.

[Aside: I think a split-on-any-delimiter function would be useful.]
I do I split the string by using both ' ' and '_' as
the delimiters at once?

Something like this:

mystr = 'this_NP is_VL funny_JJ'
L1 = mystr.split() # splits on whitespace
L2 = []
for item in L1:
L2.extend(item.split('_')

You can *almost* do that as a one-liner:

L2 = [item.split('_') for item in mystr.split()]

except that gives a list like this:

[['this', 'NP'], ['is', 'VL'], ['funny', 'JJ']]

which needs flattening.
 
A

Alex Martelli

Steven D'Aprano said:
You can *almost* do that as a one-liner:

No 'almost' about it...
L2 = [item.split('_') for item in mystr.split()]

except that gives a list like this:

[['this', 'NP'], ['is', 'VL'], ['funny', 'JJ']]

which needs flattening.

.....because the flattening is easy:

[ x for x in y.split('_') for y in z.split(' ') ]


Alex
 
S

SPE - Stani's Python Editor

Use re.split, as this is the fastest and cleanest way.
However, iff you have to split a lot of strings, the best is:

import re
delimiters = re.compile('_| ')

def split(x):
return delimiters.split(x)
['this', 'NP', 'is', 'VL', 'funny', 'JJ']

Stani
 
S

Steven D'Aprano

Steven D'Aprano said:
You can *almost* do that as a one-liner:

No 'almost' about it...
L2 = [item.split('_') for item in mystr.split()]

except that gives a list like this:

[['this', 'NP'], ['is', 'VL'], ['funny', 'JJ']]

which needs flattening.

....because the flattening is easy:

[ x for x in y.split('_') for y in z.split(' ') ]


py> mystr = 'this_NP is_VL funny_JJ'
py> [x for x in y.split('_') for y in mystr.split(' ')]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'y' is not defined


This works, but isn't flattened:

py> [x for x in [y.split('_') for y in mystr.split(' ')]]
[['this', 'NP'], ['is', 'VL'], ['funny', 'JJ']]
 
F

Fredrik Lundh

SPE - Stani's Python Editor said:
Use re.split, as this is the fastest and cleanest way.
However, iff you have to split a lot of strings, the best is:

import re
delimiters = re.compile('_| ')

def split(x):
return delimiters.split(x)

or, shorter:

import re
split = re.compile('_| ').split

to quickly build a splitter for an arbitrary set of separator characters, use

separators = "_ :+"

split = re.compile("[" + re.escape(separators) + "]").split

to deal with arbitrary separators, you need to be a little bit more careful
when you prepare the pattern:

separators = sep1, sep2, sep3, sep4, ...

pattern = "|".join(re.escape(p) for p in reversed(sorted(separators)))
split = re.compile(pattern).split

</F>
 
K

Kent Johnson

Steven said:
[ x for x in y.split('_') for y in z.split(' ') ]

py> mystr = 'this_NP is_VL funny_JJ'
py> [x for x in y.split('_') for y in mystr.split(' ')]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'y' is not defined

The order of the 'for' clauses is backwards:
>>> [x for y in mystr.split(' ') for x in y.split('_')]
['this', 'NP', 'is', 'VL', 'funny', 'JJ']

Kent
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top