Everything you did not want to know about Unicode in Python 3

I

Ian Kelly

Surely those example programs are not the pythonoic way to do things or
am i missing something?

The _is_binary_reader and _is_binary_writer functions look like they
could be simplified by calling isinstance on the io object itself
against io.TextIOBase, io.BufferedIOBase or io.RawIOBase, rather than
doing those odd 0-length reads and writes. And then perhaps those
exception-swallowing try-excepts wouldn't be necessary. But perhaps
there's a non-obvious reason why it's written the way it is.

And there appears to be a bug where everything *except* the filename
'-' is treated as stdin, so the script probably hasn't been tested at
all.
if those code samples are anything to go by this guy makes JMF look
sensible.

This is an ad hominem. Just because his code sucks doesn't mean he's
wrong about the state of Unicode and UNIX in Python 3.
 
M

MRAB

The _is_binary_reader and _is_binary_writer functions look like they
could be simplified by calling isinstance on the io object itself
against io.TextIOBase, io.BufferedIOBase or io.RawIOBase, rather than
doing those odd 0-length reads and writes. And then perhaps those
exception-swallowing try-excepts wouldn't be necessary. But perhaps
there's a non-obvious reason why it's written the way it is.
How about checking sys.stdin.mode and sys.stdout.mode?
 
I

Ian Kelly

How about checking sys.stdin.mode and sys.stdout.mode?

Seems to work, but I notice that the docs only define the mode
attribute for the FileIO class, which sys.stdin and sys.stdout are not
instances of.
 
C

Chris Angelico

Just because his code sucks doesn't mean he's
wrong about the state of Unicode and UNIX in Python 3.

Uhm... I think wrongness of code is generally fairly indicative of
wrongness of thinking :) If I write a rant about how Python's list
type sucks and it turns out my code is using it like a cons cell and
never putting more than two elements into a list, then you would
accurately conclude that I'm wrong about the state of data type
support in Python.

I don't have a problem with someone coming to the list here with
misconceptions. That's what discussions are for. But rants like that,
on blogs, I quickly get weary of reading. The tone is always "Look
what's so wrong", not inviting dialogue, and I can't be bothered
digging into the details to compose a full response. Chances are the
author's (a) not looking at what 3.4 and what's happened to improve
things (and certainly not 3.5 and what's going to happen), and (b) not
listening to responses anyway.

ChrisA
 
S

Steven D'Aprano

Surely those example programs are not the pythonoic way to do things or
am i missing something?

Feel free to show us your version of "cat" for Python then. Feel free to
target any version you like. Don't forget to test it against files with
names and content that:

- aren't valid UTF-8;

- are valid UTF-8, but not valid in the local encoding.


if those code samples are anything to go by this guy makes JMF look
sensible.

Armin Ronacher is an extremely experienced and knowledgeable Python
developer, and a Python core developer. He might be wrong, but he's not
*obviously* wrong.

Unicode is hard, not because Unicode is hard, but because of legacy
problems. I can create a file on a machine that uses ISO-8859-7 for the
file name, put JShift-JIS encoded text inside it, transfer it to a
machine that uses Windows-1251 as the file system encoding, then SSH into
that machine from a system using Big5, and try to make sense of it. If
everybody used UTF-8 any time data touched a disk or network, we'd be
laughing. It would all be so simple.

Reading Armin's post, I think that all that is needed to simplify his
Python 3 version is:

- have a bytes version of sys.argv (bargv? argvb?) and read
the file names from that;

- have a simple way to write bytes to stdout and stderr.

Most programs won't need either of those, but file system utilities will.
 
C

Chris Angelico

Reading Armin's post, I think that all that is needed to simplify his
Python 3 version is:

- have a bytes version of sys.argv (bargv? argvb?) and read
the file names from that;

argb? :)
- have a simple way to write bytes to stdout and stderr.

I'm not sure how that goes with I/O redirection, but sure.

ChrisA
 
M

Mark H Harris

Unicode is hard, not because Unicode is hard, but because of legacy
problems.

Yes. To put a finer point on that, Unicode (which is only a
specification constantly being improved upon) is harder to implement
when it hasn't been on the design board from the ground up; Python in
this case.

Julia has Unicode support from the ground up, and it was easier for
those guys to implement (in beta release) than for the Python crew when
they undertook the Unicode work that had to be done for Python3.x (just
an observation).

Anytime there are legacy code issues, regression testing problems, and a
host of domain issues that weren't thought through from the get-go there
are going to be more problematic hurdles; not to mention bugs.

Having said that, I still think Unicode is somewhat harder than you're
admitting.

marcus
 
M

Mark Lawrence

Feel free to show us your version of "cat" for Python then. Feel free to
target any version you like. Don't forget to test it against files with
names and content that:

- aren't valid UTF-8;

- are valid UTF-8, but not valid in the local encoding.




Armin Ronacher is an extremely experienced and knowledgeable Python
developer, and a Python core developer. He might be wrong, but he's not
*obviously* wrong.

Unicode is hard, not because Unicode is hard, but because of legacy
problems. I can create a file on a machine that uses ISO-8859-7 for the
file name, put JShift-JIS encoded text inside it, transfer it to a
machine that uses Windows-1251 as the file system encoding, then SSH into
that machine from a system using Big5, and try to make sense of it. If
everybody used UTF-8 any time data touched a disk or network, we'd be
laughing. It would all be so simple.

Reading Armin's post, I think that all that is needed to simplify his
Python 3 version is:

- have a bytes version of sys.argv (bargv? argvb?) and read
the file names from that;

- have a simple way to write bytes to stdout and stderr.

Most programs won't need either of those, but file system utilities will.

I think http://bugs.python.org/issue8776 and
http://bugs.python.org/issue8775 are relevant but both were placed in
the small round filing cabinet.
 
R

Rustom Mody

Feel free to show us your version of "cat" for Python then. Feel free to
target any version you like. Don't forget to test it against files with
names and content that:


- aren't valid UTF-8;


- are valid UTF-8, but not valid in the local encoding.

Thanks for a non-defensive appraisal!
Armin Ronacher is an extremely experienced and knowledgeable Python
developer, and a Python core developer. He might be wrong, but he's not
*obviously* wrong.



Unicode is hard, not because Unicode is hard, but because of legacy
problems. I can create a file on a machine that uses ISO-8859-7 for the
file name, put JShift-JIS encoded text inside it, transfer it to a
machine that uses Windows-1251 as the file system encoding, then SSH into
that machine from a system using Big5, and try to make sense of it. If
everybody used UTF-8 any time data touched a disk or network, we'd be
laughing. It would all be so simple.

I think the most helpful way forward is to accept two things:
a. Unicode is a headache
b. No-unicode is a non-option
Reading Armin's post, I think that all that is needed to simplify his
Python 3 version is:



- have a bytes version of sys.argv (bargv? argvb?) and read
the file names from that;

- have a simple way to write bytes to stdout and stderr.


Most programs won't need either of those, but file system utilities will.

About the technical merits of Armin's post and your suggestions, Ive
nothing to say, since I am an ignoramus on (the mechanics of) unicode

[Consider me an eager, early, ignorant adopter :) ]

Its however good to note that unicode is rather unique in the history
not just of IT/CS but of humanity, in the sense that no one (to the best
of my knowledge) has ever tried to come up with an all-encompassing umbrella
for all humanity's scripts/writing systems etc.

So hiccups and mistakes are only to be expected. The absence of these would
be much more surprising!
 
G

Gene Heskett

QOTW (so far...)

But its early yet, only Tuesday & its just barely started... :)

Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>
US V Castleman, SCOTUS, Mar 2014 is grounds for Impeaching SCOTUS
 
R

Rustom Mody

QOTW (so far...)

I said that getting unicode right straight off is unrealistic.

I should have added this:
Armin makes a (sarcastic?) dig about the fact that python (3) goofs because
its mismatched with the assumptions of unix.

| UNIX is bytes, has been defined that way and will always be that way. To

| Unicode on UNIX is only madness if you force it on everything. But that's not
| how Unicode on UNIX works. UNIX does not have a distinction between unicode
| and byte APIs. They are one and the same which makes them easy to deal with.]

| Python 3 takes a very difference stance on Unicode than UNIX does. Python 3
| says: everything is Unicode ...

This may be right...
Or it may be the other way round as I claim at
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

At this point I dont believe that anyone is very clear what is the
right way and and wrong way
 
C

Chris Angelico

(It's always a good day to remind people that the rest of the world
exists.)

Ironic that this should come up in a discussion on Unicode, given that
Unicode's fundamental purpose is to welcome that whole rest of the
world instead of yelling "LALALALALA America is everything" and
pretending that ASCII, or Latin-1, or something, is all you need.

ChrisA
Currently enjoying "Monday Night Flagging" on Threshold RPG... at 4pm
on Tuesday.
 
C

Chris Angelico

I tried and failed to come up with an "argy bargy" joke here so decided to
go for a meta-reference instead.

I'm just waiting for someone to have need for arguments in both
network byte order and host byte order. The latter, of course, would
be "argh".

ChrisA
 
M

Mark H Harris

instead of yelling "LALALALALA America is everything" and
pretending that ASCII, or Latin-1, or something, is all you need.

.... it isn't?



LALALALALALALALALA :))
 
M

Mark H Harris

instead of yelling "LALALALALA America is everything" and
pretending that ASCII, or Latin-1, or something, is all you need.

.... it isn't?



LALALALALALALALALA :))
 
J

Johannes Bauer

Armin Ronacher is an extremely experienced and knowledgeable Python
developer, and a Python core developer. He might be wrong, but he's not
*obviously* wrong.

He's correct about file name encodings. Which can be fixed really easily
wihtout messing everything up (sys.argv binary variant, open accepting
binary filenames). But that he suggests that Go would be superior:
Which uses an even simpler model than Python 2: everything is a byte string. The assumed encoding is UTF-8. End of the story.

Is just a horrible idea. An obviously horrible idea, too.

Having dealt with the UTF-8 problems on Python2 I can safely say that I
never, never ever want to go back to that freaky hell. If I deal with
strings, I want to be able to sanely manipulate them and I want to be
sure that after manipulation they're still valid strings. Manipulating
the bytes representation of unicode data just doesn't work.

And I'm very very glad that some people felt the same way and
implemented a sane, consistent way of dealing with Unicode in Python3.
It's one of the reasons why I switched to Py3 very early and I love it.

Cheers,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top