split on NO-BREAK SPACE

Discussion in 'Python' started by Peter Kleiweg, Jul 22, 2007.

  1. Is this a bug or a feature?


    Python 2.4.4 (#1, Oct 19 2006, 11:55:22)
    [GCC 2.95.3 20010315 (SuSE)] on linux2

    >>> a = 'a b c\240d e'
    >>> a

    'a b c\xa0d e'
    >>> a.split()

    ['a', 'b', 'c\xa0d', 'e']
    >>> a = a.decode('latin-1')
    >>> a

    u'a b c\xa0d e'
    >>> a.split()

    [u'a', u'b', u'c', u'd', u'e']



    --
    Peter Kleiweg L:NL,af,da,de,en,ia,nds,no,sv,(fr,it) S:NL,de,en,(da,ia)
    info: http://www.let.rug.nl/kleiweg/ls.html
     
    Peter Kleiweg, Jul 22, 2007
    #1
    1. Advertising

  2. On Sun, 2007-07-22 at 17:15 +0200, Peter Kleiweg wrote:
    > Is this a bug or a feature?
    >
    >
    > Python 2.4.4 (#1, Oct 19 2006, 11:55:22)
    > [GCC 2.95.3 20010315 (SuSE)] on linux2
    >
    > >>> a = 'a b c\240d e'
    > >>> a

    > 'a b c\xa0d e'
    > >>> a.split()

    > ['a', 'b', 'c\xa0d', 'e']
    > >>> a = a.decode('latin-1')
    > >>> a

    > u'a b c\xa0d e'
    > >>> a.split()

    > [u'a', u'b', u'c', u'd', u'e']


    It's a feature. See help(str.split): "If sep is not specified or is
    None, any whitespace string is a separator."

    --
    Carsten Haese
    http://informixdb.sourceforge.net
     
    Carsten Haese, Jul 22, 2007
    #2
    1. Advertising

  3. Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:

    > On Sun, 2007-07-22 at 17:15 +0200, Peter Kleiweg wrote:
    > > Is this a bug or a feature?
    > >
    > >
    > > Python 2.4.4 (#1, Oct 19 2006, 11:55:22)
    > > [GCC 2.95.3 20010315 (SuSE)] on linux2
    > >
    > > >>> a = 'a b c\240d e'
    > > >>> a

    > > 'a b c\xa0d e'
    > > >>> a.split()

    > > ['a', 'b', 'c\xa0d', 'e']
    > > >>> a = a.decode('latin-1')
    > > >>> a

    > > u'a b c\xa0d e'
    > > >>> a.split()

    > > [u'a', u'b', u'c', u'd', u'e']

    >
    > It's a feature. See help(str.split): "If sep is not specified or is
    > None, any whitespace string is a separator."


    Define "any whitespace".
    Why is it different in <type 'str'> and <type 'unicode'>?
    Why does split() split when it says NO-BREAK?

    --
    Peter Kleiweg L:NL,af,da,de,en,ia,nds,no,sv,(fr,it) S:NL,de,en,(da,ia)
    info: http://www.let.rug.nl/kleiweg/ls.html
     
    Peter Kleiweg, Jul 22, 2007
    #3
  4. On Sun, 2007-07-22 at 17:44 +0200, Peter Kleiweg wrote:
    > > It's a feature. See help(str.split): "If sep is not specified or is
    > > None, any whitespace string is a separator."

    >
    > Define "any whitespace".


    Any string for which isspace returns True.

    > Why is it different in <type 'str'> and <type 'unicode'>?


    >>> '\xa0'.isspace()

    False
    >>> u'\xa0'.isspace()

    True

    For byte strings, Python doesn't know whether 0xA0 is a whitespace
    because it depends on the encoding whether the number 160 corresponds to
    a whitespace character. For unicode strings, code point 160 is
    unquestionably a whitespace, because it is a no-break SPACE.

    > Why does split() split when it says NO-BREAK?


    Precisely. It says NO-BREAK. It doesn't say NO-SPLIT.

    --
    Carsten Haese
    http://informixdb.sourceforge.net
     
    Carsten Haese, Jul 22, 2007
    #4
  5. Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:

    > On Sun, 2007-07-22 at 17:44 +0200, Peter Kleiweg wrote:
    > > > It's a feature. See help(str.split): "If sep is not specified or is
    > > > None, any whitespace string is a separator."

    > >
    > > Define "any whitespace".

    >
    > Any string for which isspace returns True.


    Define white space to isspace()

    > > Why is it different in <type 'str'> and <type 'unicode'>?

    >
    > >>> '\xa0'.isspace()

    > False
    > >>> u'\xa0'.isspace()

    > True


    Here is another "space":

    >>> u'\uFEFF'.isspace()

    False

    isspace() is inconsistent

    > For byte strings, Python doesn't know whether 0xA0 is a whitespace
    > because it depends on the encoding whether the number 160 corresponds to
    > a whitespace character. For unicode strings, code point 160 is
    > unquestionably a whitespace, because it is a no-break SPACE.


    I question it. And so does the sre module:

    \s Matches any whitespace character; equivalent to [ \t\n\r\f\v]

    Where is the NO-BREAK SPACE in there?


    > > Why does split() split when it says NO-BREAK?

    >
    > Precisely. It says NO-BREAK. It doesn't say NO-SPLIT.


    That is a stupid answer.


    --
    Peter Kleiweg L:NL,af,da,de,en,ia,nds,no,sv,(fr,it) S:NL,de,en,(da,ia)
    info: http://www.let.rug.nl/kleiweg/ls.html
     
    Peter Kleiweg, Jul 22, 2007
    #5
  6. Peter Kleiweg wrote:
    >
    > Define white space to isspace()
    >
    >

    Explain that phrase.

    >
    > Here is another "space":
    >
    > >>> u'\uFEFF'.isspace()

    > False
    >
    > isspace() is inconsistent
    >

    I don't really know much about unicode, but google tells me that \uFEFF
    is a byte order mark. I thought we we're implicitly in unison that
    "whitespace" (whatever the formal definition) means "the stuff we put
    into text to visually separate words".
    So what is *your* definition of whitespace?


    >>> Why does split() split when it says NO-BREAK?
    >>>

    >> Precisely. It says NO-BREAK. It doesn't say NO-SPLIT.
    >>

    >
    > That is a stupid answer.
    >
    >

    I fail to see why you deem it a good idea to become insulting at this point.
    It is a very valid answer: NO-BREAK means "when wrapping characters into
    paragraphs do not break at this space".
    split() however does not wrap text, it /splits/ it (at whitespace
    characters, as it happens). The NO-BREAK semantic has no meaning here.


    /W
     
    Wildemar Wildenburger, Jul 22, 2007
    #6
  7. Peter Kleiweg

    Steve Holden Guest

    Jean-Paul Calderone wrote:
    > On Sun, 22 Jul 2007 21:13:02 +0200, Peter Kleiweg
    <> wrote:
    >> Carsten Haese schreef op de 22e dag van de hooimaand van het jaar 2007:
    >>
    >>> On Sun, 2007-07-22 at 17:44 +0200, Peter Kleiweg wrote:
    >>>>> It's a feature. See help(str.split): "If sep is not specified or is
    >>>>> None, any whitespace string is a separator."
    >>>> Define "any whitespace".
    >>> Any string for which isspace returns True.

    >> Define white space to isspace()
    >>
    >>>> Why is it different in <type 'str'> and <type 'unicode'>?
    >>>>>> '\xa0'.isspace()
    >>> False
    >>>>>> u'\xa0'.isspace()
    >>> True

    >> Here is another "space":
    >>
    >> >>> u'\uFEFF'.isspace()

    >> False
    >>
    >> isspace() is inconsistent

    >
    > It's only inconsistent if you think it should behave based on the
    > name of a unicode code point. It doesn't use the name, though. It
    > uses the category. NO-BREAK SPACE is in the Zs category (Separator, Space).
    > ZERO WIDTH NO-BREAK SPACE is in the Cf category (Other, Format).
    >
    > Maybe that makes unicode inconsistent (I won't try to argue either way),
    > but it's pretty clear that isspace is being consistent based on the data
    > it has to work with.
    >

    Well, if you're going to start answering questions with FACTS, how can
    questioners reply on their prejudices to guide them any more?

    regards
    Steve
    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    Holden Web LLC/Ltd http://www.holdenweb.com
    Skype: holdenweb http://del.icio.us/steve.holden
    --------------- Asciimercial ------------------
    Get on the web: Blog, lens and tag the Internet
    Many services currently offer free registration
    ----------- Thank You for Reading -------------
     
    Steve Holden, Jul 22, 2007
    #7
  8. Peter Kleiweg

    I V Guest

    On Sun, 22 Jul 2007 21:13:02 +0200, Peter Kleiweg wrote:
    > Here is another "space":
    >
    > >>> u'\uFEFF'.isspace()

    > False
    >
    > isspace() is inconsistent


    Well, U+00A0 is in the category "Separator, Space" while U+FEFF is in the
    category "Other, Format", so it doesn't seem unreasonable that one is
    treated as a space and the other isn't.
     
    I V, Jul 22, 2007
    #8
  9. Peter Kleiweg

    Ben Finney Guest

    Steve Holden <> writes:

    > Well, if you're going to start answering questions with FACTS, how
    > can questioners reply on their prejudices to guide them any more?


    You clearly underestimate the capacity for such people to choose only
    the particular facts that support those prejudices.

    --
    \ "Are you pondering what I'm pondering?" "I think so, Brain, but |
    `\ I don't think Kay Ballard's in the union." -- _Pinky and The |
    _o__) Brain_ |
    Ben Finney
     
    Ben Finney, Jul 23, 2007
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Shuo Xiang

    Stack space, global space, heap space

    Shuo Xiang, Jul 9, 2003, in forum: C Programming
    Replies:
    10
    Views:
    2,984
    Bryan Bullard
    Jul 11, 2003
  2. Christian Seberino
    Replies:
    21
    Views:
    1,806
    Stephen Horne
    Oct 27, 2003
  3. Ian Bicking
    Replies:
    2
    Views:
    1,110
    Steve Lamb
    Oct 23, 2003
  4. Ian Bicking
    Replies:
    2
    Views:
    785
    Michael Hudson
    Oct 24, 2003
  5. Replies:
    12
    Views:
    1,011
Loading...

Share This Page