Everything you did not want to know about Unicode in Python 3

Marko Rauhamaa · May 13, 2014

Johannes Bauer said:
Having dealt with the UTF-8 problems on Python2 I can safely say that
I never, never ever want to go back to that freaky hell. If I deal
with strings, I want to be able to sanely manipulate them and I want
to be sure that after manipulation they're still valid strings.
Manipulating the bytes representation of unicode data just doesn't
work.

Based on my background (network and system programming), I'm a bit
suspicious of strings, that is, text. For example, is the stuff that
goes to syslog bytes or text? Does an XML file contain bytes or
(encoded) text? The answers are not obvious to me. Modern computing is
full of ASCII-esque binary communication standards and formats.

Python 2's ambiguity allows me not to answer the tough philosophical
questions. I'm not saying it's necessarily a good thing, but it has its
benefits.

Marko

Chris Angelico · May 13, 2014

Based on my background (network and system programming), I'm a bit
suspicious of strings, that is, text. For example, is the stuff that
goes to syslog bytes or text? Does an XML file contain bytes or
(encoded) text? The answers are not obvious to me. Modern computing is
full of ASCII-esque binary communication standards and formats.

These are problems that Unicode can't solve. In theory, XML should
contain text in a known encoding (defaulting to UTF-8). With syslog,
it's problematic - I don't remember what it's meant to be, but I know
there are issues. Same with other log files.

Python 2's ambiguity allows me not to answer the tough philosophical
questions. I'm not saying it's necessarily a good thing, but it has its
benefits.

It's not a good thing. It means that you have the convenience of
pretending there's no problem, which means you don't notice trouble
until something happens... and then, in all probability, your app is
in production and you have no idea why stuff went wrong.

ChrisA

Marko Rauhamaa · May 13, 2014

Chris Angelico said:
These are problems that Unicode can't solve.

I actually think the problem has little to do with Unicode. Text is an
abstract data type just like any class. If I have an object (say, a
subprocess or a dictionary) in memory, I don't expect the object to have
any existence independently of the Python virtual machine. I have the
same feeling about Py3 strings: they only exist inside the Python
virtual machine.

An abstract object like a subprocess or dictionary justifies its
existence through its behaviour (its quacking). Now, do strings quack or
are they silent? I guess if you are writing a word processor they might
quack to you. Otherwise, they are just an esoteric storage format.

What I'm saying is that strings definitely have an important application
in the human interface. However, I feel strings might be overused in the
Py3 API. Case in point: are pathnames bytes objects or strings? The
linux position is that they are bytes objects. Py3 supports both
interpretations seemingly throughout:

open(b"/bin/ls") vs open("/bin/ls")
os.path.join(b"a", b"b") vs os.path.join("a", "b")

Marko

Chris Angelico · May 13, 2014

I actually think the problem has little to do with Unicode. Text is an
abstract data type just like any class. If I have an object (say, a
subprocess or a dictionary) in memory, I don't expect the object to have
any existence independently of the Python virtual machine. I have the
same feeling about Py3 strings: they only exist inside the Python
virtual machine.

That's true; the only difference is that text is extremely prevalent.
You can share a dict with another program, or store it in a file, or
whatever, simply by agreeing on an encoding - for instance, JSON. As
long as you and the other program know that this file is JSON encoded,
you can write it and he can read it, and you'll get the right data at
the far end. It's no different; there are encodings that are easy to
handle and have limitations, and there are encodings that are
elaborate and have lots of features (XML comes to mind, although
technically you can't encode a dict in XML).

Case in point: are pathnames bytes objects or strings? The
linux position is that they are bytes objects. Py3 supports both
interpretations seemingly throughout:

open(b"/bin/ls") vs open("/bin/ls")
os.path.join(b"a", b"b") vs os.path.join("a", "b")

That's a problem that comes from the underlying file systems. If every
FS in the world worked with Unicode file names, it would be easy.
(Most would encode them onto the platters in UTF-8 or maybe UTF-16;
some might choose to use a PEP 393 or Pike string structure, with the
size_shift being a file mode just like the 'directory' bit; others
might use a limited encoding for legacy reasons, storing uppercased
CP437 on the disk, and returning an error if the desired name didn't
fit.) But since they don't, we have to cope with that. What happens if
you're running on Linux, and you have a mounted drive from an OS/2
share, and inside that, you access an aliased drive that represents a
Windows share, on which you've mounted a remote-backup share? A single
path name could have components parsed by each of those systems, so
what's its encoding? How do you handle that? There's no solution.
(Well, okay. There is a solution: don't do something so stupidly
convoluted. But there's no law against cackling admins making circular
mounts. In fact, I just mounted my own home directory as a
subdirectory under my home directory, via sshfs. I can now encrypt my
own file reads and writes exactly as many times as I choose to. I also
cackled.)

ChrisA

Johannes Bauer · May 13, 2014

It's not a good thing. It means that you have the convenience of
pretending there's no problem, which means you don't notice trouble
until something happens... and then, in all probability, your app is
in production and you have no idea why stuff went wrong.

Exactly. With Py2 "strings" you never know what encoding they are, if
they already have been converted or something like that. And it's very
well possible to mix already converted strings with other, not yet
encoded strings. What a mess!

All these issues are avoided by Py3. There is a very clear distinction
between strings and string representation (data bytes), which is
beautiful. Accidental mixing is not possible. And you have some thing
*guaranteed* for the string type which aren't guaranteed for the bytes
type (for example when doing string manipulation).

Regards,
Johannes

--

Zumindest nicht Ã¶ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groÃŸen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos Ã¼ber RÃ¼diger Thomas in dsa <[email protected]>

Steven D'Aprano · May 13, 2014

I actually think the problem has little to do with Unicode. Text is an
abstract data type just like any class. If I have an object (say, a
subprocess or a dictionary) in memory, I don't expect the object to have
any existence independently of the Python virtual machine. I have the
same feeling about Py3 strings: they only exist inside the Python
virtual machine.

And you would be correct. When you write them to a device (say, push them
over a network, or write them to a file) they need to be serialized. If
you're lucky, you have an API that takes a string and serializes it for
you, and then all you have to deal with is:

- am I happy with the default encoding?

- if not, what encoding do I want?

Otherwise you ought to have an API that requires bytes, not strings, and
you have to perform your own serialization by encoding it.

But abstractions leak, and this abstraction leaks because *right now*
there isn't a single serialization for text strings. There are HUNDREDS,
and sometimes you don't know which one is being used.

[...]

What I'm saying is that strings definitely have an important application
in the human interface. However, I feel strings might be overused in the
Py3 API. Case in point: are pathnames bytes objects or strings?

Yes. On POSIX systems, file names are sequences of bytes, with a very few
restrictions. On recent Windows file systems (NTFS I believe?), file
names are Unicode strings encoded to UTF-16, but with a whole lot of
other restrictions imposed by the OS.

The
linux position is that they are bytes objects. Py3 supports both
interpretations seemingly throughout:

open(b"/bin/ls") vs open("/bin/ls") os.path.join(b"a", b"b")
vs os.path.join("a", "b")

Because it has to, otherwise there will be files that are unreachable on
one platform or another.

Johannes Bauer · May 13, 2014

Based on my background (network and system programming), I'm a bit
suspicious of strings, that is, text. For example, is the stuff that
goes to syslog bytes or text? Does an XML file contain bytes or
(encoded) text? The answers are not obvious to me. Modern computing is
full of ASCII-esque binary communication standards and formats.

Traditional Unix programs (syslog for example) are notorious for being
clear, ambiguous and/or ignorant of character encodings altogether. And
this works, unfortunately, for the most time because many encodings
share a common subset. If they wouldn't, the problems would be VERY
apparent and people would be forced to handle the issues not so sloppily.

Which is the route that Py3 chose. Don't be sloppy, make a great
distinction between "text" (which handles naturally as strings) and its
respective encoding.

The only people who are angered by this now is people who always treated
encodings sloppily and it "just worked". Well, there's a good chance it
has worked by pure chance so far. It's a good thing that Python does
this now more strictly as it gives developers *guarantees* about what
they can and cannot do with text datatypes without having to deal with
encoding issues in many places. Just one place: The interface where text
is read or written, just as it should be.

Regards,
Johannes

--

Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>

Marko Rauhamaa · May 13, 2014

Johannes Bauer said:
The only people who are angered by this now is people who always
treated encodings sloppily and it "just worked". Well, there's a good
chance it has worked by pure chance so far. It's a good thing that
Python does this now more strictly as it gives developers *guarantees*
about what they can and cannot do with text datatypes without having
to deal with encoding issues in many places. Just one place: The
interface where text is read or written, just as it should be.

I'm not angered by text. I'm just wondering if it has any practical use
that is not misuse...

For example, Py3 should not make any pretense that there is a "default"
encoding for strings. Locale's are an abhorrent invention from the early
8-bit days. IOW, you should never input or output text without explicit
serialization.

I get the feeling that Py3 would like to present a world where strings
are first-class I/O objects that can exist in files, in filenames,
inside pipes. You say, "text is read or written." I'm saying text is
never read or written. It only exists as an abstraction (not even
unicode) inside the virtual machine.

Marko

Roy Smith · May 13, 2014

Chris Angelico said:
Ironic that this should come up in a discussion on Unicode, given that
Unicode's fundamental purpose is to welcome that whole rest of the
world instead of yelling "LALALALALA America is everything" and
pretending that ASCII, or Latin-1, or something, is all you need.

ASCII *is* all I need. The problem is, it's not all that other people
need, and I need to interact with those other people.

Mark Lawrence · May 13, 2014

It's not a good thing. It means that you have the convenience of
pretending there's no problem, which means you don't notice trouble
until something happens... and then, in all probability, your app is
in production and you have no idea why stuff went wrong.

Unless you're (un)lucky enough to be working on IIRC the 1/3 of major IT
projects that deliver nothing

Chris Angelico · May 13, 2014

Unless you're (un)lucky enough to be working on IIRC the 1/3 of major IT
projects that deliver nothing

Been there, done that. At least, most likely so... there is a chance,
albeit slim, that the boss/owner will either discover someone who'll
finish the project for him, or find the time to finish it himself. I
gather he's looking at ripping all my code out and replacing it with
PHP of his own design, which should be fun. On the plus side, that
does mean he can get any idiot straight out of a uni course to do the
work; much easier than finding someone who knows Python, Pike, bash,
and C++. The White King told Alice that cynicism is a disease that can
be cured... but it can also be inflicted, and a promising-looking
N-year project that collapses because the boss starts getting stupid
with code formatting rules and then ends up firing his last remaining
competent employee is a pretty effective means of instilling cynicism.

ChrisA

Steven D'Aprano · May 13, 2014

ASCII *is* all I need.

You've never needed to copyright something? Copyright Â© Roy Smith 2014...
I know some people use (c) instead, but that actually has no legal
standing. (Not that any reasonable judge would invalidate a copyright
based on a technicality like that, not these days.)

Or price something in cents? I suppose the days of the 25Â¢ steak dinner
are long gone, but you might need to sell something for 99Â¢ a pound...

The problem is, it's not all that other people
need, and I need to interact with those other people.

True, true.

Chris Angelico · May 13, 2014

You've never needed to copyright something? Copyright Â© Roy Smith 2014...
I know some people use (c) instead, but that actually has no legal
standing. (Not that any reasonable judge would invalidate a copyright
based on a technicality like that, not these days.)

Copyright Chris Angelico 2014. The full word "copyright" has legal
standing. I tend to stick with that in my README files; staying ASCII
makes it that bit safer for random text editors
(*cough*Notepad*cough*) that might otherwise misinterpret it (only a
bit, though [1]).

Or price something in cents? I suppose the days of the 25Â¢ steak dinner
are long gone, but you might need to sell something for 99Â¢ a pound....

$0.99/lb?

ChrisA

[1] https://en.wikipedia.org/wiki/Bush_hid_the_facts

Grant Edwards · May 13, 2014

Ironic that this should come up in a discussion on Unicode, given that
Unicode's fundamental purpose is to welcome that whole rest of the
world instead of yelling "LALALALALA America is everything" and
pretending that ASCII, or Latin-1, or something, is all you need.

Well, strictly speaking, it ASCII or Latin-1 _is_ all I need.

I will however admit to the existence of other people who might need
something else...

Grant Edwards · May 13, 2014

You've never needed to copyright something? Copyright Â© Roy Smith 2014...

Bah. You don't need the little copyright symbol at all. The
statement without the symbol has the exact same legal weight.

Skip Montanaro · May 13, 2014

It's not a good thing. It means that you have the convenience of
pretending there's no problem, which means you don't notice trouble
until something happens... and then, in all probability, your app is
in production and you have no idea why stuff went wrong.

BITD, when I still maintained and developed Musi-Cal (an early online
concert calendar, long since gone), I faced a challenge when I first
started encountering non-ASCII band names and cities. I resisted UTF-8.
After all, if I printed a string containing an "Ã©", it came out looking like

What kind of mess was that???

I tried to ignore it, or assume Latin-1 would cover all the bases (my first
non-ASCII inputs tended to come from Western Europe). If nothing else, at
least "Ã©" was legible.

Needless to say, those approaches didn't work well. After perhaps six
months or a year, I broke down and started converting everything coming in
â€‹ or going outâ€‹
to UTF-8 at the boundaries of my system (making educated guesses at
â€‹input
encodings if necessary). My life got a whole lot easier after that. The
distinction between bytes and text didn't really matter much, certainly not
compared to the mess I had before where strings of unknown data leaked into
my system and its database.

Skip

â€‹P.S. My apologies for the mess this message probably is. Amazing as it may
seem, Gmail in Chrome does a crappy job editing anything other than plain
text. Also, I'm surprised in this day and age that common tools like Gnome
Terminal have little or no encoding support. I wound up having to pop up
urxvt to get an encodings-flexible terminal emulator...â€‹

Rustom Mody · May 13, 2014

$0.99/lb?

Dollars Zeros Slashes Question marks Smileys...
Just alphabets is enough I think...

Come to think of it why have anything other than zeros and ones?

Chris Angelico · May 13, 2014

Come to think of it why have anything other than zeros and ones?

Obligatory: http://xkcd.com/257/

ChrisA

Grant Edwards · May 13, 2014

You do not need any statements at all, copyright is automaticly assigned
to anything you create (at least that is the case in UK Law)
although proving the creation date my be difficult.

Yep, it's the same in the US.

Ian Kelly · May 13, 2014

I am only an amateur python coder which is why I asked if I am missing
something

I could not see any reason to be using the shutil module if all that the
programm is doing is opening a file, reading it & then printing it.

is it python that causes the issue, the shutil module or just the OS not
liking the data it is being sent?

an explanation of why this approach is taken would be much appreciated.

No, that part is perfectly fine. This is exactly what the shutil
module is meant for: providing shell-like operations. Although in
this case the copyfileobj function is quite simple (have yourself a
look at the source -- it just reads from one file and writes to the
other in a loop), in general the Pythonic thing is to avoid
reinventing the wheel.

And since it's so simple, it shouldn't be hard to see that the use of
the shutil module has nothing to do with the Unicode woes here. The
crux of the issue is that a general-purpose command like cat typically
can't know the encoding of its input and can't assume anything about
it. In fact, there may not even be an encoding; cat can be used with
binary data. The only non-destructive approach then is to copy the
binary data straight from the source to the destination with no
decoding steps at all, and trust the user to ensure that the
destination will be able to accommodate the source encoding. Because
Python 3 presents stdin and stdout as text streams however, it makes
them more difficult to use with binary data, which is why Armin sets
up all that extra code to make sure his file objects are binary.

Why did Quora choose Python for its development?	99	May 20, 2011
ANN: eGenix mxODBC Connect 2.1.0 - Python ODBC Database Interface	0	May 28, 2014
Long rant about Python in Education	2	Aug 12, 2010
Can you introduce some book about python?	3	May 20, 2005
ANN: eGenix mxODBC 3.2.0 - Python ODBC Database Interface	0	Aug 28, 2012
have you read emacs manual cover to cover?; (was Do we need a"Stevens" book?)	0	Jul 31, 2010
python-dev Summary for 2004-08-01 through 2004-08-15	17	Aug 24, 2004
Elise Mooney reports on Channel 9 about Maths Worldwide and the fraudthat it is	1	Apr 16, 2010

Everything you did not want to know about Unicode in Python 3

Marko Rauhamaa

Chris Angelico

Marko Rauhamaa

Chris Angelico

Johannes Bauer

Steven D'Aprano

Johannes Bauer

Marko Rauhamaa

Roy Smith

Mark Lawrence

Chris Angelico

Steven D'Aprano

Chris Angelico

Grant Edwards

Grant Edwards

Skip Montanaro

Rustom Mody

Chris Angelico

Grant Edwards

Ian Kelly

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads