split() and string.whitespace

C

Chaim Krause

I am unable to figure out why the first two statements work as I
expect them to and the next two do not. Namely, the first two spit the
sentence into its component words, while the latter two return the
whole sentence entact.

import string
from string import whitespace
mytext = "The quick brown fox jumped over the lazy dog.\n"

print mytext.split()
print mytext.split(' ')
print mytext.split(whitespace)
print string.split(mytext, sep=whitespace)
 
M

Marc 'BlackJack' Rintsch

I am unable to figure out why the first two statements work as I expect
them to and the next two do not. Namely, the first two spit the sentence
into its component words, while the latter two return the whole sentence
entact.

import string
from string import whitespace
mytext = "The quick brown fox jumped over the lazy dog.\n"

print mytext.split()
print mytext.split(' ')

This splits at the string ' '.
print mytext.split(whitespace)

This splits at the string '\t\n\x0b\x0c\r ' which doesn't occur in
`mytext`. The argument is a string not a set of characters.
print string.split(mytext, sep=whitespace)

Same here.

Ciao,
Marc 'BlackJack' Rintsch
 
T

Tim Chase

I am unable to figure out why the first two statements work as I
expect them to and the next two do not. Namely, the first two spit the
sentence into its component words, while the latter two return the
whole sentence entact.

import string
from string import whitespace
mytext = "The quick brown fox jumped over the lazy dog.\n"

print mytext.split()
print mytext.split(' ')
print mytext.split(whitespace)
print string.split(mytext, sep=whitespace)


Split does its work on literal strings, or if a separator is not
specified, on a set of data, splits on arbitrary whitespace.

For an example, try

s = "abcdefgbcdefgh"
s.split("c") # ['ab', 'defgb', 'defgh']
s.split("fgb") # ['abcde', 'cdefgh']


string.whitespace is a string, so split() tries to use split on
the literal whitespace, not a set of whitespace.

-tkc
 
C

Chris Rebert

I am unable to figure out why the first two statements work as I
expect them to and the next two do not. Namely, the first two spit the
sentence into its component words, while the latter two return the
whole sentence entact.

import string
from string import whitespace
mytext = "The quick brown fox jumped over the lazy dog.\n"

print mytext.split()
print mytext.split(' ')
print mytext.split(whitespace)
print string.split(mytext, sep=whitespace)

Also note that a plain 'mytext.split()' with no arguments will split
on any whitespace character like you're trying to do here.

Cheers,
Chris[/QUOTE]
 
C

Chaim Krause

The documentation I am referencing states...

The sep argument may consist of multiple characters (for example, "'1,
2, 3'.split(', ')" returns "['1', '2', '3']").

So why doesn't the latter two split on *any* whitespace character, and
is instead looking for the sep string as a whole?
 
C

Chaim Krause

I have arrived here while attempting to break down a larger problem. I
got to this question when attempting to split a line on any whitespace
character so that I could then add several other characters like ';'
and ':'. Ultimately splitting a line on any char in a union of
string.whitespace and some pre-designated chars.

I am now beginning to think that I have outgrown split() and must move
up to regular expressions. If that is the case, I will go off and RTFM
on RegEx.
 
C

Chaim Krause

The documentation I am referencing states...

The sep argument may consist of multiple characters (for example, "'1,
2, 3'.split(', ')" returns "['1', '2', '3']").

So why doesn't the latter two split on *any* whitespace character, and
is instead looking for the sep string as a whole?

Now, rereading the documentation in light of the replies to my
origional posting, I see that I misinterpreted the example as using
"comma OR space" when it was actually "commaspace". I am now properly
enlightened.

Thank you all for your help.
 
M

MRAB

This splits at the string ' '.


This splits at the string '\t\n\x0b\x0c\r ' which doesn't occur in
`mytext`.  The argument is a string not a set of characters.


Same here.
<muse>
It's interesting, if you think about it, that here we have someone who
wants to split on a set of characters but 'split' splits on a string,
and others sometimes want to strip off a string but 'strip' strips on
a set of characters (passed as a string). You could imagine that if
Python had had (character) sets from the start then 'split' and
'strip' could have accepted a string or a set depending on whether you
wanted to split on or stripping off a string or a set.
</muse>
 
S

Steven D'Aprano

I have arrived here while attempting to break down a larger problem. I
got to this question when attempting to split a line on any whitespace
character so that I could then add several other characters like ';' and
':'. Ultimately splitting a line on any char in a union of
string.whitespace and some pre-designated chars.

I am now beginning to think that I have outgrown split() and must move
up to regular expressions. If that is the case, I will go off and RTFM
on RegEx.

Or just do this:

s = "the quick brown\tdog\njumps over\r\n\t the lazy dog"
s = s.replace('\t', ' ').replace('\n', ' ').replace('\r', ' ')
s.split(' ')


or even simpler:

s.split()
 
B

bearophileHUGS

MRAB:
It's interesting, if you think about it, that here we have someone who
wants to split on a set of characters but 'split' splits on a string,
and others sometimes want to strip off a string but 'strip' strips on
a set of characters (passed as a string).

That can be seen as a little inconsistency in the language. But with
some practice you learn it.

You could imagine that if
Python had had (character) sets from the start then 'split' and
'strip' could have accepted a string or a set depending on whether you
wanted to split on or stripping off a string or a set.

Too bad you haven't suggested this when they were designing Python
3 :)
This may be suggested for Python 3.1.

Bye,
bearophile
 
M

MRAB

MRAB:


That can be seen as a little inconsistency in the language. But with
some practice you learn it.


Too bad you haven't suggested this when they were designing Python
3 :)
This may be suggested for Python 3.1.
I might also add that str.startswith can accept a tuple of strings;
shouldn't that have been a set? :)

I also had the thought that the backtick (`), which is not used in
Python 3, could be used to form character set literals (`aeiou` =>
set("aeiou")), although that might only be worth while if character
sets were introduced as an specialised form of set.
 
B

bearophileHUGS

MRAB:
I also had the thought that the backtick (`), which is not used in
Python 3, could be used to form character set literals (`aeiou` =>
set("aeiou")), although that might only be worth while if character
sets were introduced as an specialised form of set.

Python developers have removed it from the syntax mostly because lot
of keyboards (probably most in the world) don't have "`" on them.

Bye,
bearophile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top