split on NO-BREAK SPACE

C

Carsten Haese

Is this a bug or a feature?


Python 2.4.4 (#1, Oct 19 2006, 11:55:22)
[GCC 2.95.3 20010315 (SuSE)] on linux2
a = 'a b c\240d e'
a 'a b c\xa0d e'
a.split() ['a', 'b', 'c\xa0d', 'e']
a = a.decode('latin-1')
a u'a b c\xa0d e'
a.split()
[u'a', u'b', u'c', u'd', u'e']

It's a feature. See help(str.split): "If sep is not specified or is
None, any whitespace string is a separator."
 
P

Peter Kleiweg

Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:
Is this a bug or a feature?


Python 2.4.4 (#1, Oct 19 2006, 11:55:22)
[GCC 2.95.3 20010315 (SuSE)] on linux2
a = 'a b c\240d e'
a 'a b c\xa0d e'
a.split() ['a', 'b', 'c\xa0d', 'e']
a = a.decode('latin-1')
a u'a b c\xa0d e'
a.split()
[u'a', u'b', u'c', u'd', u'e']

It's a feature. See help(str.split): "If sep is not specified or is
None, any whitespace string is a separator."

Define "any whitespace".
Why is it different in <type 'str'> and <type 'unicode'>?
Why does split() split when it says NO-BREAK?
 
C

Carsten Haese

Define "any whitespace".

Any string for which isspace returns True.
Why is it different in <type 'str'> and <type 'unicode'>?
True

For byte strings, Python doesn't know whether 0xA0 is a whitespace
because it depends on the encoding whether the number 160 corresponds to
a whitespace character. For unicode strings, code point 160 is
unquestionably a whitespace, because it is a no-break SPACE.
Why does split() split when it says NO-BREAK?

Precisely. It says NO-BREAK. It doesn't say NO-SPLIT.
 
P

Peter Kleiweg

Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:
Any string for which isspace returns True.

Define white space to isspace()

Here is another "space":
False

isspace() is inconsistent
For byte strings, Python doesn't know whether 0xA0 is a whitespace
because it depends on the encoding whether the number 160 corresponds to
a whitespace character. For unicode strings, code point 160 is
unquestionably a whitespace, because it is a no-break SPACE.

I question it. And so does the sre module:

\s Matches any whitespace character; equivalent to [ \t\n\r\f\v]

Where is the NO-BREAK SPACE in there?

Precisely. It says NO-BREAK. It doesn't say NO-SPLIT.

That is a stupid answer.
 
W

Wildemar Wildenburger

Peter said:
Define white space to isspace()
Explain that phrase.
Here is another "space":

False

isspace() is inconsistent
I don't really know much about unicode, but google tells me that \uFEFF
is a byte order mark. I thought we we're implicitly in unison that
"whitespace" (whatever the formal definition) means "the stuff we put
into text to visually separate words".
So what is *your* definition of whitespace?

That is a stupid answer.
I fail to see why you deem it a good idea to become insulting at this point.
It is a very valid answer: NO-BREAK means "when wrapping characters into
paragraphs do not break at this space".
split() however does not wrap text, it /splits/ it (at whitespace
characters, as it happens). The NO-BREAK semantic has no meaning here.


/W
 
S

Steve Holden

Jean-Paul Calderone wrote:
> On Sun, 22 Jul 2007 21:13:02 +0200, Peter Kleiweg
It's only inconsistent if you think it should behave based on the
name of a unicode code point. It doesn't use the name, though. It
uses the category. NO-BREAK SPACE is in the Zs category (Separator, Space).
ZERO WIDTH NO-BREAK SPACE is in the Cf category (Other, Format).

Maybe that makes unicode inconsistent (I won't try to argue either way),
but it's pretty clear that isspace is being consistent based on the data
it has to work with.
Well, if you're going to start answering questions with FACTS, how can
questioners reply on their prejudices to guide them any more?

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------
 
I

I V

Here is another "space":

False

isspace() is inconsistent

Well, U+00A0 is in the category "Separator, Space" while U+FEFF is in the
category "Other, Format", so it doesn't seem unreasonable that one is
treated as a space and the other isn't.
 
B

Ben Finney

Steve Holden said:
Well, if you're going to start answering questions with FACTS, how
can questioners reply on their prejudices to guide them any more?

You clearly underestimate the capacity for such people to choose only
the particular facts that support those prejudices.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,571
Members
45,045
Latest member
DRCM

Latest Threads

Top