Everything you did not want to know about Unicode in Python 3

Mark Lawrence · May 12, 2014

This was *NOT* written by our resident unicode expert
http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

Posted as I thought it would make a rather pleasant change from
interminable threads about names vs values vs variables vs objects.

Ian Kelly · May 12, 2014

Surely those example programs are not the pythonoic way to do things or
am i missing something?

The _is_binary_reader and _is_binary_writer functions look like they
could be simplified by calling isinstance on the io object itself
against io.TextIOBase, io.BufferedIOBase or io.RawIOBase, rather than
doing those odd 0-length reads and writes. And then perhaps those
exception-swallowing try-excepts wouldn't be necessary. But perhaps
there's a non-obvious reason why it's written the way it is.

And there appears to be a bug where everything *except* the filename
'-' is treated as stdin, so the script probably hasn't been tested at
all.

if those code samples are anything to go by this guy makes JMF look
sensible.

This is an ad hominem. Just because his code sucks doesn't mean he's
wrong about the state of Unicode and UNIX in Python 3.

MRAB · May 12, 2014

The _is_binary_reader and _is_binary_writer functions look like they
could be simplified by calling isinstance on the io object itself
against io.TextIOBase, io.BufferedIOBase or io.RawIOBase, rather than
doing those odd 0-length reads and writes. And then perhaps those
exception-swallowing try-excepts wouldn't be necessary. But perhaps
there's a non-obvious reason why it's written the way it is.

How about checking sys.stdin.mode and sys.stdout.mode?

Ian Kelly · May 12, 2014

How about checking sys.stdin.mode and sys.stdout.mode?

Seems to work, but I notice that the docs only define the mode
attribute for the FileIO class, which sys.stdin and sys.stdout are not
instances of.

Chris Angelico · May 12, 2014

Just because his code sucks doesn't mean he's
wrong about the state of Unicode and UNIX in Python 3.

Uhm... I think wrongness of code is generally fairly indicative of
wrongness of thinking

If I write a rant about how Python's list
type sucks and it turns out my code is using it like a cons cell and
never putting more than two elements into a list, then you would
accurately conclude that I'm wrong about the state of data type
support in Python.

I don't have a problem with someone coming to the list here with
misconceptions. That's what discussions are for. But rants like that,
on blogs, I quickly get weary of reading. The tone is always "Look
what's so wrong", not inviting dialogue, and I can't be bothered
digging into the details to compose a full response. Chances are the
author's (a) not looking at what 3.4 and what's happened to improve
things (and certainly not 3.5 and what's going to happen), and (b) not
listening to responses anyway.

ChrisA

Steven D'Aprano · May 12, 2014

Surely those example programs are not the pythonoic way to do things or
am i missing something?

Feel free to show us your version of "cat" for Python then. Feel free to
target any version you like. Don't forget to test it against files with
names and content that:

- aren't valid UTF-8;

- are valid UTF-8, but not valid in the local encoding.

if those code samples are anything to go by this guy makes JMF look
sensible.

Armin Ronacher is an extremely experienced and knowledgeable Python
developer, and a Python core developer. He might be wrong, but he's not
*obviously* wrong.

Unicode is hard, not because Unicode is hard, but because of legacy
problems. I can create a file on a machine that uses ISO-8859-7 for the
file name, put JShift-JIS encoded text inside it, transfer it to a
machine that uses Windows-1251 as the file system encoding, then SSH into
that machine from a system using Big5, and try to make sense of it. If
everybody used UTF-8 any time data touched a disk or network, we'd be
laughing. It would all be so simple.

Reading Armin's post, I think that all that is needed to simplify his
Python 3 version is:

- have a bytes version of sys.argv (bargv? argvb?) and read
the file names from that;

- have a simple way to write bytes to stdout and stderr.

Most programs won't need either of those, but file system utilities will.

Chris Angelico · May 12, 2014

Reading Armin's post, I think that all that is needed to simplify his
Python 3 version is:

- have a bytes version of sys.argv (bargv? argvb?) and read
the file names from that;

argb?

- have a simple way to write bytes to stdout and stderr.

I'm not sure how that goes with I/O redirection, but sure.

ChrisA

Mark H Harris · May 12, 2014

Unicode is hard, not because Unicode is hard, but because of legacy
problems.

Yes. To put a finer point on that, Unicode (which is only a
specification constantly being improved upon) is harder to implement
when it hasn't been on the design board from the ground up; Python in
this case.

Julia has Unicode support from the ground up, and it was easier for
those guys to implement (in beta release) than for the Python crew when
they undertook the Unicode work that had to be done for Python3.x (just
an observation).

Anytime there are legacy code issues, regression testing problems, and a
host of domain issues that weren't thought through from the get-go there
are going to be more problematic hurdles; not to mention bugs.

Having said that, I still think Unicode is somewhat harder than you're
admitting.

marcus

Mark Lawrence · May 12, 2014

Feel free to show us your version of "cat" for Python then. Feel free to
target any version you like. Don't forget to test it against files with
names and content that:

- aren't valid UTF-8;

- are valid UTF-8, but not valid in the local encoding.

Armin Ronacher is an extremely experienced and knowledgeable Python
developer, and a Python core developer. He might be wrong, but he's not
*obviously* wrong.

Unicode is hard, not because Unicode is hard, but because of legacy
problems. I can create a file on a machine that uses ISO-8859-7 for the
file name, put JShift-JIS encoded text inside it, transfer it to a
machine that uses Windows-1251 as the file system encoding, then SSH into
that machine from a system using Big5, and try to make sense of it. If
everybody used UTF-8 any time data touched a disk or network, we'd be
laughing. It would all be so simple.

Reading Armin's post, I think that all that is needed to simplify his
Python 3 version is:

- have a bytes version of sys.argv (bargv? argvb?) and read
the file names from that;

- have a simple way to write bytes to stdout and stderr.

Most programs won't need either of those, but file system utilities will.

I think http://bugs.python.org/issue8776 and
http://bugs.python.org/issue8775 are relevant but both were placed in
the small round filing cabinet.

Rustom Mody · May 13, 2014

Feel free to show us your version of "cat" for Python then. Feel free to
target any version you like. Don't forget to test it against files with
names and content that:

- aren't valid UTF-8;

- are valid UTF-8, but not valid in the local encoding.

Thanks for a non-defensive appraisal!

Armin Ronacher is an extremely experienced and knowledgeable Python
developer, and a Python core developer. He might be wrong, but he's not
*obviously* wrong.

Unicode is hard, not because Unicode is hard, but because of legacy
problems. I can create a file on a machine that uses ISO-8859-7 for the
file name, put JShift-JIS encoded text inside it, transfer it to a
machine that uses Windows-1251 as the file system encoding, then SSH into
that machine from a system using Big5, and try to make sense of it. If
everybody used UTF-8 any time data touched a disk or network, we'd be
laughing. It would all be so simple.

I think the most helpful way forward is to accept two things:
a. Unicode is a headache
b. No-unicode is a non-option

Reading Armin's post, I think that all that is needed to simplify his
Python 3 version is:

- have a bytes version of sys.argv (bargv? argvb?) and read
the file names from that;

- have a simple way to write bytes to stdout and stderr.

Most programs won't need either of those, but file system utilities will.

About the technical merits of Armin's post and your suggestions, Ive
nothing to say, since I am an ignoramus on (the mechanics of) unicode

[Consider me an eager, early, ignorant adopter

]

Its however good to note that unicode is rather unique in the history
not just of IT/CS but of humanity, in the sense that no one (to the best
of my knowledge) has ever tried to come up with an all-encompassing umbrella
for all humanity's scripts/writing systems etc.

So hiccups and mistakes are only to be expected. The absence of these would
be much more surprising!

Mark H Harris · May 13, 2014

I think the most helpful way forward is to accept two things:
a. Unicode is a headache
b. No-unicode is a non-option

QOTW (so far...)

Gene Heskett · May 13, 2014

QOTW (so far...)

But its early yet, only Tuesday & its just barely started...

Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>
US V Castleman, SCOTUS, Mar 2014 is grounds for Impeaching SCOTUS

Rustom Mody · May 13, 2014

QOTW (so far...)

I said that getting unicode right straight off is unrealistic.

I should have added this:
Armin makes a (sarcastic?) dig about the fact that python (3) goofs because
its mismatched with the assumptions of unix.

| UNIX is bytes, has been defined that way and will always be that way. To

| Unicode on UNIX is only madness if you force it on everything. But that's not
| how Unicode on UNIX works. UNIX does not have a distinction between unicode
| and byte APIs. They are one and the same which makes them easy to deal with.]

| Python 3 takes a very difference stance on Unicode than UNIX does. Python 3
| says: everything is Unicode ...

This may be right...
Or it may be the other way round as I claim at
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

At this point I dont believe that anyone is very clear what is the
right way and and wrong way

Chris Angelico · May 13, 2014

(It's always a good day to remind people that the rest of the world
exists.)

Ironic that this should come up in a discussion on Unicode, given that
Unicode's fundamental purpose is to welcome that whole rest of the
world instead of yelling "LALALALALA America is everything" and
pretending that ASCII, or Latin-1, or something, is all you need.

ChrisA
Currently enjoying "Monday Night Flagging" on Threshold RPG... at 4pm
on Tuesday.

alex23 · May 13, 2014

argb?

I tried and failed to come up with an "argy bargy" joke here so decided
to go for a meta-reference instead.

Chris Angelico · May 13, 2014

I tried and failed to come up with an "argy bargy" joke here so decided to
go for a meta-reference instead.

I'm just waiting for someone to have need for arguments in both
network byte order and host byte order. The latter, of course, would
be "argh".

ChrisA

Mark H Harris · May 13, 2014

instead of yelling "LALALALALA America is everything" and
pretending that ASCII, or Latin-1, or something, is all you need.

.... it isn't?

LALALALALALALALALA

)

Mark H Harris · May 13, 2014

instead of yelling "LALALALALA America is everything" and
pretending that ASCII, or Latin-1, or something, is all you need.

.... it isn't?

LALALALALALALALALA

)

gregor · May 13, 2014

Am 13 May 2014 01:18:35 GMT

schrieb Steven D'Aprano said:
- have a simple way to write bytes to stdout and stderr.

there is the underlying binary buffer:

https://docs.python.org/3/library/sys.html#sys.stdin

greg

Johannes Bauer · May 13, 2014

Armin Ronacher is an extremely experienced and knowledgeable Python
developer, and a Python core developer. He might be wrong, but he's not
*obviously* wrong.

He's correct about file name encodings. Which can be fixed really easily
wihtout messing everything up (sys.argv binary variant, open accepting
binary filenames). But that he suggests that Go would be superior:

Which uses an even simpler model than Python 2: everything is a byte string. The assumed encoding is UTF-8. End of the story.

Is just a horrible idea. An obviously horrible idea, too.

Having dealt with the UTF-8 problems on Python2 I can safely say that I
never, never ever want to go back to that freaky hell. If I deal with
strings, I want to be able to sanely manipulate them and I want to be
sure that after manipulation they're still valid strings. Manipulating
the bytes representation of unicode data just doesn't work.

And I'm very very glad that some people felt the same way and
implemented a sane, consistent way of dealing with Unicode in Python3.
It's one of the reasons why I switched to Py3 very early and I love it.

Cheers,
Johannes

--

Zumindest nicht Ã¶ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groÃŸen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos Ã¼ber RÃ¼diger Thomas in dsa <[email protected]>

Why did Quora choose Python for its development?	99	May 20, 2011
ANN: eGenix mxODBC Connect 2.1.0 - Python ODBC Database Interface	0	May 28, 2014
Long rant about Python in Education	2	Aug 12, 2010
Can you introduce some book about python?	3	May 20, 2005
ANN: eGenix mxODBC 3.2.0 - Python ODBC Database Interface	0	Aug 28, 2012
have you read emacs manual cover to cover?; (was Do we need a"Stevens" book?)	0	Jul 31, 2010
python-dev Summary for 2004-08-01 through 2004-08-15	17	Aug 24, 2004
Elise Mooney reports on Channel 9 about Maths Worldwide and the fraudthat it is	1	Apr 16, 2010

Everything you did not want to know about Unicode in Python 3

Mark Lawrence

Ian Kelly

MRAB

Ian Kelly

Chris Angelico

Steven D'Aprano

Chris Angelico

Mark H Harris

Mark Lawrence

Rustom Mody

Mark H Harris

Gene Heskett

Rustom Mody

Chris Angelico

alex23

Chris Angelico

Mark H Harris

Mark H Harris

gregor

Johannes Bauer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads