split on NO-BREAK SPACE

Peter Kleiweg · Jul 22, 2007

Is this a bug or a feature?

Python 2.4.4 (#1, Oct 19 2006, 11:55:22)
[GCC 2.95.3 20010315 (SuSE)] on linux2

>>> a = 'a b c\240d e'
>>> a 'a b c\xa0d e'
>>> a.split() ['a', 'b', 'c\xa0d', 'e']
>>> a = a.decode('latin-1')
>>> a u'a b c\xa0d e'
>>> a.split()

Click to expand...

Click to expand...

[u'a', u'b', u'c', u'd', u'e']

Carsten Haese · Jul 22, 2007

Is this a bug or a feature?

Python 2.4.4 (#1, Oct 19 2006, 11:55:22)
[GCC 2.95.3 20010315 (SuSE)] on linux2

a = 'a b c\240d e'
a 'a b c\xa0d e'
a.split() ['a', 'b', 'c\xa0d', 'e']
a = a.decode('latin-1')
a u'a b c\xa0d e'
a.split()

Click to expand...

Click to expand...

[u'a', u'b', u'c', u'd', u'e']

It's a feature. See help(str.split): "If sep is not specified or is
None, any whitespace string is a separator."

Peter Kleiweg · Jul 22, 2007

Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:

Is this a bug or a feature?

Python 2.4.4 (#1, Oct 19 2006, 11:55:22)
[GCC 2.95.3 20010315 (SuSE)] on linux2

a = 'a b c\240d e'
a 'a b c\xa0d e'
a.split() ['a', 'b', 'c\xa0d', 'e']
a = a.decode('latin-1')
a u'a b c\xa0d e'
a.split()

Click to expand...

[u'a', u'b', u'c', u'd', u'e']

Click to expand...

It's a feature. See help(str.split): "If sep is not specified or is
None, any whitespace string is a separator."

Define "any whitespace".
Why is it different in <type 'str'> and <type 'unicode'>?
Why does split() split when it says NO-BREAK?

Carsten Haese · Jul 22, 2007

Define "any whitespace".

Any string for which isspace returns True.

Why is it different in <type 'str'> and <type 'unicode'>?

True

For byte strings, Python doesn't know whether 0xA0 is a whitespace
because it depends on the encoding whether the number 160 corresponds to
a whitespace character. For unicode strings, code point 160 is
unquestionably a whitespace, because it is a no-break SPACE.

Why does split() split when it says NO-BREAK?

Precisely. It says NO-BREAK. It doesn't say NO-SPLIT.

Peter Kleiweg · Jul 22, 2007

Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:

Any string for which isspace returns True.

Define white space to isspace()

True

Here is another "space":
False

isspace() is inconsistent

For byte strings, Python doesn't know whether 0xA0 is a whitespace
because it depends on the encoding whether the number 160 corresponds to
a whitespace character. For unicode strings, code point 160 is
unquestionably a whitespace, because it is a no-break SPACE.

I question it. And so does the sre module:

\s Matches any whitespace character; equivalent to [ \t\n\r\f\v]

Where is the NO-BREAK SPACE in there?

Precisely. It says NO-BREAK. It doesn't say NO-SPLIT.

That is a stupid answer.

Wildemar Wildenburger · Jul 22, 2007

Peter said:
Define white space to isspace()

Explain that phrase.

Here is another "space":

False

isspace() is inconsistent

I don't really know much about unicode, but google tells me that \uFEFF
is a byte order mark. I thought we we're implicitly in unison that
"whitespace" (whatever the formal definition) means "the stuff we put
into text to visually separate words".
So what is *your* definition of whitespace?

That is a stupid answer.

I fail to see why you deem it a good idea to become insulting at this point.
It is a very valid answer: NO-BREAK means "when wrapping characters into
paragraphs do not break at this space".
split() however does not wrap text, it /splits/ it (at whitespace
characters, as it happens). The NO-BREAK semantic has no meaning here.

/W

Steve Holden · Jul 22, 2007

Jean-Paul Calderone wrote:
> On Sun, 22 Jul 2007 21:13:02 +0200, Peter Kleiweg

It's only inconsistent if you think it should behave based on the
name of a unicode code point. It doesn't use the name, though. It
uses the category. NO-BREAK SPACE is in the Zs category (Separator, Space).
ZERO WIDTH NO-BREAK SPACE is in the Cf category (Other, Format).

Maybe that makes unicode inconsistent (I won't try to argue either way),
but it's pretty clear that isspace is being consistent based on the data
it has to work with.

Well, if you're going to start answering questions with FACTS, how can
questioners reply on their prejudices to guide them any more?

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------

I V · Jul 22, 2007

Here is another "space":

False

isspace() is inconsistent

Well, U+00A0 is in the category "Separator, Space" while U+FEFF is in the
category "Other, Format", so it doesn't seem unreasonable that one is
treated as a space and the other isn't.

Ben Finney · Jul 23, 2007

Steve Holden said:
Well, if you're going to start answering questions with FACTS, how
can questioners reply on their prejudices to guide them any more?

You clearly underestimate the capacity for such people to choose only
the particular facts that support those prejudices.

Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Dont work, it´s something whit the loops?	1	Jun 30, 2021
problem with gethostbyaddr with intranet addresses on MAC	1	Jan 25, 2008
Need help with first C# console program	0	Sep 4, 2015
objectoriented -?- functional	3	Mar 17, 2009
python 6 compilation failure on RHEL	0	Aug 20, 2012
Problem with MySQLdb and mod_python	4	Jul 15, 2008
Python Path	2	Jul 27, 2006

split on NO-BREAK SPACE

Peter Kleiweg

Carsten Haese

Peter Kleiweg

Carsten Haese

Peter Kleiweg

Wildemar Wildenburger

Steve Holden

I V

Ben Finney

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads