Everything you did not want to know about Unicode in Python 3

Discussion in 'Python' started by Mark Lawrence, May 12, 2014.

  1. Mark Lawrence, May 12, 2014
    1. Advertisements

  2. Mark Lawrence

    Ian Kelly Guest

    The _is_binary_reader and _is_binary_writer functions look like they
    could be simplified by calling isinstance on the io object itself
    against io.TextIOBase, io.BufferedIOBase or io.RawIOBase, rather than
    doing those odd 0-length reads and writes. And then perhaps those
    exception-swallowing try-excepts wouldn't be necessary. But perhaps
    there's a non-obvious reason why it's written the way it is.

    And there appears to be a bug where everything *except* the filename
    '-' is treated as stdin, so the script probably hasn't been tested at
    This is an ad hominem. Just because his code sucks doesn't mean he's
    wrong about the state of Unicode and UNIX in Python 3.
    Ian Kelly, May 12, 2014
    1. Advertisements

  3. Mark Lawrence

    MRAB Guest

    How about checking sys.stdin.mode and sys.stdout.mode?
    MRAB, May 12, 2014
  4. Mark Lawrence

    Ian Kelly Guest

    Seems to work, but I notice that the docs only define the mode
    attribute for the FileIO class, which sys.stdin and sys.stdout are not
    instances of.
    Ian Kelly, May 12, 2014
  5. Uhm... I think wrongness of code is generally fairly indicative of
    wrongness of thinking :) If I write a rant about how Python's list
    type sucks and it turns out my code is using it like a cons cell and
    never putting more than two elements into a list, then you would
    accurately conclude that I'm wrong about the state of data type
    support in Python.

    I don't have a problem with someone coming to the list here with
    misconceptions. That's what discussions are for. But rants like that,
    on blogs, I quickly get weary of reading. The tone is always "Look
    what's so wrong", not inviting dialogue, and I can't be bothered
    digging into the details to compose a full response. Chances are the
    author's (a) not looking at what 3.4 and what's happened to improve
    things (and certainly not 3.5 and what's going to happen), and (b) not
    listening to responses anyway.

    Chris Angelico, May 13, 2014
  6. Feel free to show us your version of "cat" for Python then. Feel free to
    target any version you like. Don't forget to test it against files with
    names and content that:

    - aren't valid UTF-8;

    - are valid UTF-8, but not valid in the local encoding.

    Armin Ronacher is an extremely experienced and knowledgeable Python
    developer, and a Python core developer. He might be wrong, but he's not
    *obviously* wrong.

    Unicode is hard, not because Unicode is hard, but because of legacy
    problems. I can create a file on a machine that uses ISO-8859-7 for the
    file name, put JShift-JIS encoded text inside it, transfer it to a
    machine that uses Windows-1251 as the file system encoding, then SSH into
    that machine from a system using Big5, and try to make sense of it. If
    everybody used UTF-8 any time data touched a disk or network, we'd be
    laughing. It would all be so simple.

    Reading Armin's post, I think that all that is needed to simplify his
    Python 3 version is:

    - have a bytes version of sys.argv (bargv? argvb?) and read
    the file names from that;

    - have a simple way to write bytes to stdout and stderr.

    Most programs won't need either of those, but file system utilities will.
    Steven D'Aprano, May 13, 2014
  7. argb? :)
    I'm not sure how that goes with I/O redirection, but sure.

    Chris Angelico, May 13, 2014
  8. Yes. To put a finer point on that, Unicode (which is only a
    specification constantly being improved upon) is harder to implement
    when it hasn't been on the design board from the ground up; Python in
    this case.

    Julia has Unicode support from the ground up, and it was easier for
    those guys to implement (in beta release) than for the Python crew when
    they undertook the Unicode work that had to be done for Python3.x (just
    an observation).

    Anytime there are legacy code issues, regression testing problems, and a
    host of domain issues that weren't thought through from the get-go there
    are going to be more problematic hurdles; not to mention bugs.

    Having said that, I still think Unicode is somewhat harder than you're

    Mark H Harris, May 13, 2014
  9. I think http://bugs.python.org/issue8776 and
    http://bugs.python.org/issue8775 are relevant but both were placed in
    the small round filing cabinet.
    Mark Lawrence, May 13, 2014
  10. Mark Lawrence

    Rustom Mody Guest

    Thanks for a non-defensive appraisal!
    I think the most helpful way forward is to accept two things:
    a. Unicode is a headache
    b. No-unicode is a non-option
    About the technical merits of Armin's post and your suggestions, Ive
    nothing to say, since I am an ignoramus on (the mechanics of) unicode

    [Consider me an eager, early, ignorant adopter :) ]

    Its however good to note that unicode is rather unique in the history
    not just of IT/CS but of humanity, in the sense that no one (to the best
    of my knowledge) has ever tried to come up with an all-encompassing umbrella
    for all humanity's scripts/writing systems etc.

    So hiccups and mistakes are only to be expected. The absence of these would
    be much more surprising!
    Rustom Mody, May 13, 2014
  11. QOTW (so far...)
    Mark H Harris, May 13, 2014
  12. Mark Lawrence

    Gene Heskett Guest

    But its early yet, only Tuesday & its just barely started... :)

    Cheers, Gene
    "There are four boxes to be used in defense of liberty:
    soap, ballot, jury, and ammo. Please use in that order."
    -Ed Howdershelt (Author)
    Genes Web page <http://geneslinuxbox.net:6309/gene>
    US V Castleman, SCOTUS, Mar 2014 is grounds for Impeaching SCOTUS
    Gene Heskett, May 13, 2014
  13. Mark Lawrence

    Rustom Mody Guest

    I said that getting unicode right straight off is unrealistic.

    I should have added this:
    Armin makes a (sarcastic?) dig about the fact that python (3) goofs because
    its mismatched with the assumptions of unix.

    | UNIX is bytes, has been defined that way and will always be that way. To

    | Unicode on UNIX is only madness if you force it on everything. But that's not
    | how Unicode on UNIX works. UNIX does not have a distinction between unicode
    | and byte APIs. They are one and the same which makes them easy to deal with.]

    | Python 3 takes a very difference stance on Unicode than UNIX does. Python 3
    | says: everything is Unicode ...

    This may be right...
    Or it may be the other way round as I claim at

    At this point I dont believe that anyone is very clear what is the
    right way and and wrong way
    Rustom Mody, May 13, 2014
  14. Ironic that this should come up in a discussion on Unicode, given that
    Unicode's fundamental purpose is to welcome that whole rest of the
    world instead of yelling "LALALALALA America is everything" and
    pretending that ASCII, or Latin-1, or something, is all you need.

    Currently enjoying "Monday Night Flagging" on Threshold RPG... at 4pm
    on Tuesday.
    Chris Angelico, May 13, 2014
  15. Mark Lawrence

    alex23 Guest

    I tried and failed to come up with an "argy bargy" joke here so decided
    to go for a meta-reference instead.
    alex23, May 13, 2014
  16. I'm just waiting for someone to have need for arguments in both
    network byte order and host byte order. The latter, of course, would
    be "argh".

    Chris Angelico, May 13, 2014
  17. .... it isn't?

    Mark H Harris, May 13, 2014
  18. .... it isn't?

    Mark H Harris, May 13, 2014
  19. Mark Lawrence

    gregor Guest

    gregor, May 13, 2014
  20. He's correct about file name encodings. Which can be fixed really easily
    wihtout messing everything up (sys.argv binary variant, open accepting
    binary filenames). But that he suggests that Go would be superior:
    Is just a horrible idea. An obviously horrible idea, too.

    Having dealt with the UTF-8 problems on Python2 I can safely say that I
    never, never ever want to go back to that freaky hell. If I deal with
    strings, I want to be able to sanely manipulate them and I want to be
    sure that after manipulation they're still valid strings. Manipulating
    the bytes representation of unicode data just doesn't work.

    And I'm very very glad that some people felt the same way and
    implemented a sane, consistent way of dealing with Unicode in Python3.
    It's one of the reasons why I switched to Py3 very early and I love it.


    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
    Johannes Bauer, May 13, 2014
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.