Is Unicode support so hard...

Discussion in 'Python' started by jmfauth, Apr 20, 2013.

  1. jmfauth

    jmfauth Guest

    In a previous post,

    http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
    ,

    Chris “Kwpolska” Warrick wrote:

    “Is Unicode support so hard, especially in the 21st century?”

    --

    Unicode is not really complicate and it works very well (more
    than two decades of development if you take into account
    iso-14****).

    But, - I can say, "as usual" - people prefer to spend their
    time to make a "better Unicode than Unicode" and it usually
    fails. Python does not escape to this rule.

    -----

    I'm "busy" with TeX (unicode engine variant), fonts and typography.
    This gives me plenty of ideas to test the "flexible string
    representation" (FSR). I should recognize this FSR is failing
    particulary very well...

    I can almost say, a delight.

    jmf
    Unicode lover
     
    jmfauth, Apr 20, 2013
    #1
    1. Advertising

  2. On 4/20/2013 1:12 PM, jmfauth wrote:
    > In a previous post,
    >
    > http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
    > ,
    >
    > Chris “Kwpolska” Warrick wrote:
    >
    > “Is Unicode support so hard, especially in the 21st century?”
    >
    > --
    >
    > Unicode is not really complicate and it works very well (more
    > than two decades of development if you take into account
    > iso-14****).
    >
    > But, - I can say, "as usual" - people prefer to spend their
    > time to make a "better Unicode than Unicode" and it usually
    > fails. Python does not escape to this rule.
    >
    > -----
    >
    > I'm "busy" with TeX (unicode engine variant), fonts and typography.
    > This gives me plenty of ideas to test the "flexible string
    > representation" (FSR). I should recognize this FSR is failing
    > particulary very well...
    >
    > I can almost say, a delight.
    >
    > jmf
    > Unicode lover

    I'm totally confused about what you are saying. What does "make a
    better Unicode than Unicode" mean? Are you saying that Python is guilty
    of this? In what way? Can you provide specifics? Or are you saying
    that you like how Python has implemented it? "FSR is failing ... a
    delight"? I don't know what you mean.

    --Ned.
     
    Ned Batchelder, Apr 20, 2013
    #2
    1. Advertising

  3. On Sat, Apr 20, 2013 at 10:22 AM, Ned Batchelder <> wrote:
    > On 4/20/2013 1:12 PM, jmfauth wrote:
    >>
    >> In a previous post,
    >>
    >>
    >> http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
    >> ,
    >>
    >> Chris “Kwpolska†Warrick wrote:
    >>
    >> “Is Unicode support so hard, especially in the 21st century?â€
    >>
    >> --
    >>
    >> Unicode is not really complicate and it works very well (more
    >> than two decades of development if you take into account
    >> iso-14****).
    >>
    >> But, - I can say, "as usual" - people prefer to spend their
    >> time to make a "better Unicode than Unicode" and it usually
    >> fails. Python does not escape to this rule.
    >>
    >> -----
    >>
    >> I'm "busy" with TeX (unicode engine variant), fonts and typography.
    >> This gives me plenty of ideas to test the "flexible string
    >> representation" (FSR). I should recognize this FSR is failing
    >> particulary very well...
    >>
    >> I can almost say, a delight.
    >>
    >> jmf
    >> Unicode lover

    >
    > I'm totally confused about what you are saying. What does "make a better
    > Unicode than Unicode" mean? Are you saying that Python is guilty of this?
    > In what way? Can you provide specifics? Or are you saying that you like
    > how Python has implemented it? "FSR is failing ... a delight"? I don't
    > know what you mean.
    >
    > --Ned.


    Don't bother trying to figure this out. jmfauth has been hijacking
    every thread that mentions Unicode to complain about the flexible
    string representation introduced in Python 3.3. Apparently, having
    proper Unicode semantics (indexing is based on characters, not code
    points) at the expense of performance when calling .replace on the
    only non-ASCII or BMP character in the string is a horrible bug.
     
    Benjamin Kaplan, Apr 20, 2013
    #3
  4. On Sun, Apr 21, 2013 at 3:22 AM, Ned Batchelder <> wrote:
    > I'm totally confused about what you are saying. What does "make a better
    > Unicode than Unicode" mean? Are you saying that Python is guilty of this?
    > In what way? Can you provide specifics? Or are you saying that you like
    > how Python has implemented it? "FSR is failing ... a delight"? I don't
    > know what you mean.


    You're not familiar with jmf? He's one of our resident trolls. Allow
    me to summarize Python 3's Unicode support...

    >From 3.0 up to and including 3.2.x, Python could be built as either

    "narrow" or "wide". A wide build consumes four bytes per character in
    every string, which is rather wasteful (given that very few strings
    actually NEED that); a narrow build gets some things wrong. (I'm using
    a 2.7 here as I don't have a narrow-build 3.x handy; the same
    considerations apply, though.)

    Python 2.7.4 (default, Apr 6 2013, 19:54:46) [MSC v.1500 32 bit
    (Intel)] on win32
    Type "copyright", "credits" or "license()" for more information.
    >>> len(u"asdf\U00012345qwer")

    10
    >>> u"asdf\U00012345qwer"[8]

    u'e'

    In a narrow build, strings are stored in UTF-16, so astral characters
    count as two. This means that a program will behave unexpectedly
    differently on different platforms (other languages, such as
    ECMAScript, actually *mandate* UTF-16; at least this means you can
    depend on this otherwise-bizarre behaviour regardless of what platform
    you're on), and I have to say this is counter-intuitive.

    Enter Python 3.3 and PEP 393 strings. Now *EVERY* Python build is,
    conceptually, wide. (I'm not sure how PEP 393 applies to other Pythons
    - Jython, PyPy, etc - so assume that whenever I refer to Python, I'm
    restricting this to CPython.) The underlying representation might be
    more efficient, but to the script, it's exactly the same as a wide
    build. If a string has no characters that demand more width, it'll be
    stored nice and narrow. (It's the same technique that Pike has been
    using for a while, so it's a proven system; in any case, we know that
    this is going to work, it's just a question of performance - it adds a
    fixed overhead.) Great! We save memory in Python programs. Wonderful!
    Right?

    Enter jmf. No, it's not wonderful, because OBVIOUSLY Python is now
    America-centric, because now the full Unicode range is divided into
    "these ones get stored in 1 byte per char, these in 2, these in 4".
    Clearly that's making life way worse for everyone else. Also, compared
    to the narrow build that jmf was previously using, this uses heaps
    MORE space in the stupid micro-benchmarks that he keeps on trotting
    out, because he has just one astral character in a sea of ASCII. And
    that's totally what programs are doing all the time, too. Never mind
    that basic operations like length, slicing, etc are no longer buggy,
    no, Python has taken a terrible step backwards here.

    Oh, and check this out:

    >>> def munge(s):

    """Move characters around in a string."""
    l=len(s)//4
    return s[:l]+s[l*2:l*3]+s[l:l*2]+s[l*3:]

    >>> munge("asdfqwerzxcv1234")

    'asdfzxcvqwer1234'

    Looks fine.

    >>> munge(u"asd\U00012345we\U00034567xc\U00023456bla")

    u'asd\U00012167xc\U00023745we\U00034456bla'

    Where'd those characters come from? I was just moving stuff around,
    right? I can't get new characters out of it... can I?

    Flash forward to current date, and jmf has hijacked so many threads to
    moan about PEP 393 that I'm actually happy about this one, simply
    because he gave it a new subject line and one appropriate to a
    discussion about Unicode.

    ChrisA
     
    Chris Angelico, Apr 20, 2013
    #4
  5. On Sat, Apr 20, 2013 at 8:02 PM, Benjamin Kaplan
    <> wrote:
    > On Sat, Apr 20, 2013 at 10:22 AM, Ned Batchelder <> wrote:
    >> On 4/20/2013 1:12 PM, jmfauth wrote:
    >>>
    >>> In a previous post,
    >>>
    >>>
    >>> http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
    >>> ,
    >>>
    >>> Chris “Kwpolska†Warrick wrote:
    >>>
    >>> “Is Unicode support so hard, especially in the 21st century?â€
    >>>
    >>> --
    >>>
    >>> Unicode is not really complicate and it works very well (more
    >>> than two decades of development if you take into account
    >>> iso-14****).
    >>>
    >>> But, - I can say, "as usual" - people prefer to spend their
    >>> time to make a "better Unicode than Unicode" and it usually
    >>> fails. Python does not escape to this rule.
    >>>
    >>> -----
    >>>
    >>> I'm "busy" with TeX (unicode engine variant), fonts and typography.
    >>> This gives me plenty of ideas to test the "flexible string
    >>> representation" (FSR). I should recognize this FSR is failing
    >>> particulary very well...
    >>>
    >>> I can almost say, a delight.
    >>>
    >>> jmf
    >>> Unicode lover

    >>
    >> I'm totally confused about what you are saying. What does "make a better
    >> Unicode than Unicode" mean? Are you saying that Python is guilty of this?
    >> In what way? Can you provide specifics? Or are you saying that you like
    >> how Python has implemented it? "FSR is failing ... a delight"? I don't
    >> know what you mean.
    >>
    >> --Ned.

    >
    > Don't bother trying to figure this out. jmfauth has been hijacking
    > every thread that mentions Unicode to complain about the flexible
    > string representation introduced in Python 3.3. Apparently, having
    > proper Unicode semantics (indexing is based on characters, not code
    > points) at the expense of performance when calling .replace on the
    > only non-ASCII or BMP character in the string is a horrible bug.
    > --
    > http://mail.python.org/mailman/listinfo/python-list


    Don’t forget the original context: this was a short remark to a guyI
    was responding to. His newsgroups software (slrn according to the
    headers) mangled the encoding of U+201C and U+201D in my From field,
    turning them into three question marks each. And jmf started a rant,
    as usual…

    PS. There are two fancy Unicode characters around. Can you find both
    of them, jmf?

    --
    Kwpolska <http://kwpolska.tk> | GPG KEY: 5EAAEA16
    stop html mail | always bottom-post
    http://asciiribbon.org | http://caliburn.nl/topposting.html
     
    Chris “Kwpolska†Warrick, Apr 20, 2013
    #5
  6. On 20/04/2013 19:02, Benjamin Kaplan wrote:
    > On Sat, Apr 20, 2013 at 10:22 AM, Ned Batchelder <> wrote:
    >> On 4/20/2013 1:12 PM, jmfauth wrote:
    >>>
    >>> In a previous post,
    >>>
    >>>
    >>> http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
    >>> ,
    >>>
    >>> Chris “Kwpolska†Warrick wrote:
    >>>
    >>> “Is Unicode support so hard, especially in the 21st century?â€
    >>>
    >>> --
    >>>
    >>> Unicode is not really complicate and it works very well (more
    >>> than two decades of development if you take into account
    >>> iso-14****).
    >>>
    >>> But, - I can say, "as usual" - people prefer to spend their
    >>> time to make a "better Unicode than Unicode" and it usually
    >>> fails. Python does not escape to this rule.
    >>>
    >>> -----
    >>>
    >>> I'm "busy" with TeX (unicode engine variant), fonts and typography.
    >>> This gives me plenty of ideas to test the "flexible string
    >>> representation" (FSR). I should recognize this FSR is failing
    >>> particulary very well...
    >>>
    >>> I can almost say, a delight.
    >>>
    >>> jmf
    >>> Unicode lover

    >>
    >> I'm totally confused about what you are saying. What does "make a better
    >> Unicode than Unicode" mean? Are you saying that Python is guilty of this?
    >> In what way? Can you provide specifics? Or are you saying that you like
    >> how Python has implemented it? "FSR is failing ... a delight"? I don't
    >> know what you mean.
    >>
    >> --Ned.

    >
    > Don't bother trying to figure this out. jmfauth has been hijacking
    > every thread that mentions Unicode to complain about the flexible
    > string representation introduced in Python 3.3. Apparently, having
    > proper Unicode semantics (indexing is based on characters, not code
    > points) at the expense of performance when calling .replace on the
    > only non-ASCII or BMP character in the string is a horrible bug.
    >


    He can't complain about performance for the .replace issue any more as
    it's been fixed http://bugs.python.org/issue16061

    Sadly he'll almost certainly have more edge cases up his sleeve while
    continuing to ignore minor issues like memory saving and correctness.

    --
    If you're using GoogleCrapâ„¢ please read this
    http://wiki.python.org/moin/GoogleGroupsPython.

    Mark Lawrence
     
    Mark Lawrence, Apr 20, 2013
    #6
  7. jmfauth

    Ethan Furman Guest

    On 04/20/2013 11:14 AM, Chris Angelico wrote:
    > Flash forward to current date, and jmf has hijacked so many threads to
    > moan about PEP 393 that I'm actually happy about this one, simply
    > because he gave it a new subject line and one appropriate to a
    > discussion about Unicode.


    +1000
     
    Ethan Furman, Apr 21, 2013
    #7
  8. jmfauth

    rusi Guest

    On Apr 21, 4:03 am, Neil Hodgson <> wrote:
    >     Hi jmf,
    >
    > > This gives me plenty of ideas to test the "flexible string
    > > representation" (FSR). I should recognize this FSR is failing
    > > particulary very well...

    >
    >     This is too vague for me.
    >
    >     Which string representation should Python use?
    > 1) UTF-32
    > 2) UTF-8
    > 3) Python 3.3 -- 1, 2, or 4 bytes per character decided at runtime
    > 4) Python 3.2 -- 2 or 4 bytes per character decided at Python build time
    > 5) Something else


    jmf recommends UTF-8.

    Apart from the fact the UTF-8 would be less (time) performant in all
    cases and more extremely so in cases like indexing, the fact that jmf
    says so makes it more ridiculous.
    According to jmf python sucks up to ASCII (those big bad Americans… of
    whom Steven is the first…) whereas unicode is the true international/
    universal standard.

    I guess the irony is clear to all (except jmf) given that:
    - its unicode that sucks up to ASCII by carefully conforming in the
    first 127 positions including the completely useless control chars;
    python just implements the standard
    - UTF-8 is an ASCII-biased unicode-compression method viz UTF-8 is
    most space-efficient on ASCII at the cost of being generally time-
    inefficient
    - All jmf's beefs (as far as I remember) are variations on the theme:
    "time-inefficiency is equivalent to non-unicode-compliant"

    In short he manifests a dog-in-the-manger mindset:
    "Since the whole world will never speak french (grief, mope, grumble,
    thrash…) everyone should pay for the Chinese character set's size even
    if they are monolingually English"

    All that said…

    I believe that the recent correction in unicode performance followed
    jmf's grumbles
    (Mark please correct me if I am wrong)
    So python community can be thankful to jmf even if he insists on
    laboring under bizarre political hallucinations.

    [Written from India where a monolingual person is as rare as a
    palmtree on a polecap]
     
    rusi, Apr 21, 2013
    #8
  9. On Sat, 20 Apr 2013 18:37:00 -0700, rusi wrote:

    > According to jmf python sucks up to ASCII (those big bad Americans… of
    > whom Steven is the first…)


    Watch who you're calling an American, mate.


    --
    Steven
     
    Steven D'Aprano, Apr 21, 2013
    #9
  10. On Sun, Apr 21, 2013 at 1:36 PM, Steven D'Aprano
    <> wrote:
    > On Sat, 20 Apr 2013 18:37:00 -0700, rusi wrote:
    >
    >> According to jmf python sucks up to ASCII (those big bad Americans… of
    >> whom Steven is the first…)

    >
    > Watch who you're calling an American, mate.


    I think he knows, and that's why he said it. You and I are foremost
    among Americans who are destroying Python.

    ChrisA
     
    Chris Angelico, Apr 21, 2013
    #10
  11. jmfauthæ–¼ 2013å¹´4月21日星期日UTC+8上åˆ1時12分43秒寫é“:
    > In a previous post,
    >
    >
    >
    > http://groups.google.com/group/comp.lang.python/browse_thread/thread/6aec70817705c226#
    >
    > ,
    >
    >
    >
    > Chris “Kwpolska†Warrick wrote:
    >
    >
    >
    > “Is Unicode support so hard, especially in the 21st century?â€
    >
    >
    >
    > --
    >
    >
    >
    > Unicode is not really complicate and it works very well (more
    >
    > than two decades of development if you take into account
    >
    > iso-14****).
    >
    >
    >
    > But, - I can say, "as usual" - people prefer to spend their
    >
    > time to make a "better Unicode than Unicode" and it usually
    >
    > fails. Python does not escape to this rule.
    >
    >
    >
    > -----
    >
    >
    >
    > I'm "busy" with TeX (unicode engine variant), fonts and typography.
    >
    > This gives me plenty of ideas to test the "flexible string
    >
    > representation" (FSR). I should recognize this FSR is failing
    >
    > particulary very well...
    >
    >
    >
    > I can almost say, a delight.
    >
    >
    >
    > jmf
    >
    > Unicode lover


    To support the unicode is easy in the language part.
    But to support the unicode in a platform involves
    the OS and the display and input hardware devices
    which are not suitable to be free most of the time.
     
    88888 Dihedral, Apr 21, 2013
    #11
  12. On 4/20/2013 9:37 PM, rusi wrote:

    > I believe that the recent correction in unicode performance followed
    > jmf's grumbles


    No, the correction followed upon his accurate report of a regression,
    last August, which was unfortunately mixed in with grumbles and
    inaccurate claims. Others separated out and verified the accurate
    report. I reported it to pydev and enquired as to its necessity, I
    believe Mark opened the tracker issue, and the two people who worked on
    optimizing 3.3 a year ago fairly quickly came up with two different
    patches. The several month delay after was a matter of testing and
    picking the best approach.
     
    Terry Jan Reedy, Apr 21, 2013
    #12
  13. On 21/04/2013 10:02, Terry Jan Reedy wrote:
    > On 4/20/2013 9:37 PM, rusi wrote:
    >
    >> I believe that the recent correction in unicode performance followed
    >> jmf's grumbles

    >
    > No, the correction followed upon his accurate report of a regression,
    > last August, which was unfortunately mixed in with grumbles and
    > inaccurate claims. Others separated out and verified the accurate
    > report. I reported it to pydev and enquired as to its necessity, I
    > believe Mark opened the tracker issue, and the two people who worked on
    > optimizing 3.3 a year ago fairly quickly came up with two different
    > patches. The several month delay after was a matter of testing and
    > picking the best approach.
    >
    >


    I'd again like to point out that all I did was raise the issue. It was
    based on data provided by Steven D'Aprano and confirmed by Serhiy Storchaka.

    --
    If you're using GoogleCrap™ please read this
    http://wiki.python.org/moin/GoogleGroupsPython.

    Mark Lawrence
     
    Mark Lawrence, Apr 21, 2013
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page