why isn't Unicode the default encoding?

J

John Salerno

Forgive my newbieness, but I don't quite understand why Unicode is still
something that needs special treatment in Python (and perhaps
elsewhere). I'm reading Dive Into Python right now, and it constantly
refers to a 'regular string' versus a 'Unicode string' and how you need
to convert back and forth. But why isn't Unicode considered a regular
string by now? Is it for historical reasons that we still use ASCII and
Latin-1? Why can't Unicode replace them so we no longer need the 'u'
prefix or the encoding tricks?
 
R

Robert Kern

John said:
Forgive my newbieness, but I don't quite understand why Unicode is still
something that needs special treatment in Python (and perhaps
elsewhere). I'm reading Dive Into Python right now, and it constantly
refers to a 'regular string' versus a 'Unicode string' and how you need
to convert back and forth. But why isn't Unicode considered a regular
string by now? Is it for historical reasons that we still use ASCII and
Latin-1?

Well, *I* use UTF-8, but that's neither here nor there.
Why can't Unicode replace them so we no longer need the 'u'
prefix or the encoding tricks?

It would break a hell of a lot of code. Try using the -U command line argument
to the Python interpreter. That makes unicode strings default.

[~]$ python -U
Python 2.4.1 (#2, Mar 31 2005, 00:05:10)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Python tries very hard to remain backwards compatible. Python 3.0 is the
designated "break compatibility so we can remove all of the cruft that's built
up" release. It is still several years away although Guido is starting to work
on it now.

--
Robert Kern
(e-mail address removed)

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
J

John Salerno

Robert said:
Well, *I* use UTF-8, but that's neither here nor there.

I see UTF-8 a lot, but this particular book also mentions that UTF-16 is
the most common. Is that true?
It would break a hell of a lot of code. Try using the -U command line argument
to the Python interpreter. That makes unicode strings default.

I figured this might have something to do with it, but then again I
thought that Unicode was created as a subset of ASCII and Latin-1 so
that they would be compatible...but I guess it's never that easy. :)
 
J

Jan Niklas Fingerle

John Salerno said:
to convert back and forth. But why isn't Unicode considered a regular
string by now? Is it for historical reasons that we still use ASCII and
Latin-1?

The point is, that, with a regular string, you don't know its encoding
or whether it has an encoding at all - it might as well be just a byte
buffer. The best thing would be to have byte buffer and a unicode string
type but, this can't happen as long as you don't want to break existing
code.
Why can't Unicode replace them so we no longer need the 'u'
prefix or the encoding tricks?

It's proposed for python 3000 (http://www.python.org/doc/peps/pep-3000/)
and I think it will make it into the language.

Cheers,
--Jan Niklas
 
R

Robert Kern

John said:
I see UTF-8 a lot, but this particular book also mentions that UTF-16 is
the most common. Is that true?

I think it unlikely, but I have no numbers to give. And I'll bet that that book
doesn't either.
I figured this might have something to do with it, but then again I
thought that Unicode was created as a subset of ASCII and Latin-1 so
that they would be compatible...but I guess it's never that easy. :)

No, it isn't. You seem to be somewhat confused about Unicode. At least you are
misusing terminology quite a bit. You may want to read the following articles:

http://www.joelonsoftware.com/articles/Unicode.html
http://effbot.org/zone/unicode-objects.htm

--
Robert Kern
(e-mail address removed)

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
J

Jan Niklas Fingerle

Robert Kern said:
I think it unlikely, but I have no numbers to give. And I'll bet that that book
doesn't either.

I haven't got any numbers, but my guess would be that many the chinese
will add their share to the UTF-16 numbers. I don't know about other
asian languages, though.

Cheers,
--Jan Niklas
 
J

John Salerno

Robert said:
No, it isn't. You seem to be somewhat confused about Unicode. At least you are
misusing terminology quite a bit. You may want to read the following articles:

I meant to say 'superset'
 
J

John Salerno

Robert said:

That was fascinating. Thank you. So as it turns out, Unicode and UTF-8
are not the same thing? Am I right to say that UTF-8 stores the first
128 Unicode code points in a single byte, and then stores higher code
points in however many bytes they may need? If so, I guess I had been
mislead by the '8' in the name, thinking that UTF-8 was another way of
storing characters in one byte (which would make it no different than
Latin-1, I suppose).
 
G

Guest

John said:
That was fascinating. Thank you. So as it turns out, Unicode and UTF-8
are not the same thing? Am I right to say that UTF-8 stores the first
128 Unicode code points in a single byte, and then stores higher code
points in however many bytes they may need? If so, I guess I had been
mislead by the '8' in the name, thinking that UTF-8 was another way of
storing characters in one byte (which would make it no different than
Latin-1, I suppose).

That's all correct, except for the last parenthetical remark: using
a single-byte character set isn't the same as using Latin-1. There
are various single-byte characters sets; they have names like Latin-2,
Latin-5, Latin-15, KOI8-R, CP437, windows-1252, and so on.

Regards,
Martin
 
G

Guest

I figured this might have something to do with it, but then again I
thought that Unicode was created as a subset of ASCII and Latin-1 so
that they would be compatible...but I guess it's never that easy. :)

The real problem is that the Python string type is used to represent
two very different concepts: bytes, and characters. You can't just drop
the current Python string type, and use the Unicode type instead - then
you would have no good way to represent sequences of bytes anymore.
Byte sequences occur more often than you might think: a ZIP file, a
MS Word file, a PDF file, and even an HTTP conversation are represented
through byte sequences.

So for a byte sequence, internal representation is important; for a
character string, it is not. Now, for historical reasons, the Python
string literals create byte strings, not character strings. Since we
cannot know whether a certain string literal is meant to denote bytes
or characters, we can't just change the interpretation.

Unicode is a superset of ASCII and Latin-1, but not of byte sequences.

Regards,
Martin
 
J

John Salerno

Martin said:
That's all correct, except for the last parenthetical remark: using
a single-byte character set isn't the same as using Latin-1. There
are various single-byte characters sets; they have names like Latin-2,
Latin-5, Latin-15, KOI8-R, CP437, windows-1252, and so on.

Regards,
Martin

Oh, I just meant that Latin-1 was an example of a one-byte character
set, right? So UTF-8 would be identical to it if it worked how I used to
think it did.
 
J

John Salerno

Martin said:
The real problem is that the Python string type is used to represent
two very different concepts: bytes, and characters. You can't just drop
the current Python string type, and use the Unicode type instead - then
you would have no good way to represent sequences of bytes anymore.
Byte sequences occur more often than you might think: a ZIP file, a
MS Word file, a PDF file, and even an HTTP conversation are represented
through byte sequences.

So for a byte sequence, internal representation is important; for a
character string, it is not. Now, for historical reasons, the Python
string literals create byte strings, not character strings. Since we
cannot know whether a certain string literal is meant to denote bytes
or characters, we can't just change the interpretation.

Interesting. So then the read() method, if given a numeric argument for
bytes to read, would act differently depending on if you were using
Unicode or not? As it is now, it seems to equate the bytes with number
of characters, but if the document was written using Unicode characters,
is it possible that read(2) might only pull out one character?
 
A

and-google

John said:
So as it turns out, Unicode and UTF-8 are not the same thing?

Well yes. UTF-8 is one scheme in which the whole Unicode character
repertoire can be represented as bytes.

Confusion arises because Windows uses the name 'Unicode' in character
encoding lists, to mean UTF-16_LE, which is another encoding that can
store the whole Unicode character repertoire as bytes. However
UTF-16_LE is not any more definitively 'Unicode' than UTF-8 is.

Further confusion arises because the encoding 'UTF-16' can actually
mean two things that are deceptively different:

- Unicode characters stored natively in 16-bit units (using two
UTF-16 characters to represent characters outside of the Basic
Multilingual Plane)

- Either of the 8-bit encodings UTF-16_LE and UTF-16_BE, detected
automatically using a Byte Order Mark when loaded, or chosen
arbitrarily when saving

Yet more confusion arises because UTF-32 (which can reference any
Unicode character directly) has the same problem. And though
wide-unicode builds of Python understand the first meaning (unicode()
strings are stored natively as UTF-32), they don't support the 8-bit
encodings UTF-32_LE and UTF-32_BE. Phew!

To summarise: confusion.
Am I right to say that UTF-8 stores the first 128 Unicode code points
in a single byte, and then stores higher code points in however many
bytes they may need?

That is correct.

To answer the original question, we're always going to need byte
strings. They're a fundamental part of computing and the need to
process them isn't going to go away. However as Unicode text
manipulation becomes a more common event than byte string processing,
it makes sense to change the default kind of string you get when you
type a literal.

Personally I would like to see byte strings available under an easy
syntax like b'...' and UTF-32 strings available as w'...', or something
like that - currently having u'...' mean either UTF-16 or UTF-32
depending on compile-time options is very very annoying to the few
kinds of programs that really do need to know the difference. But
whatever is chosen, it's all tasty Python 3000 future-soup and not
worth worrying about for the moment.
 
M

Matt Goodall

John said:
Interesting. So then the read() method, if given a numeric argument for
bytes to read, would act differently depending on if you were using
Unicode or not? As it is now, it seems to equate the bytes with number
of characters, but if the document was written using Unicode characters,
is it possible that read(2) might only pull out one character?

Exactly. read(2) might pull out one character, or only half a character.
It all depends on the encoding of the data you're reading.

If you're reading or writing text to a file (or anywhere, for that
matter) you need to know the unicode encoding of the file's content to
read it correctly.

Fortunately, the codecs module makes the whole process relatively painless:

The 'stream' works on unicode characters so 'c' is a unicode instance,
i.e. a whole textual character.

- Matt

--
__
/ \__ Matt Goodall, Pollenation Internet Ltd
\__/ \ w: http://www.pollenation.net
__/ \__/ e: (e-mail address removed)
/ \__/ \ t: +44 (0)113 2252500
\__/ \__/
/ \ Any views expressed are my own and do not necessarily
\__/ reflect the views of my employer.
 
G

Guest

John said:
Interesting. So then the read() method, if given a numeric argument for
bytes to read, would act differently depending on if you were using
Unicode or not?

The read method currently returns a byte string, not a Unicode string.
It's not clear to me how the numeric argument should be interpreted when
it returns characters some day; it might be best to take the number as
counting characters, then. However, not supporting a numeric argument
at all might also be reasonable.
As it is now, it seems to equate the bytes with number
of characters, but if the document was written using Unicode characters,
is it possible that read(2) might only pull out one character?

Unicode isn't a character coding (*all* documents in the world are
"written in Unicode", including those encoded with ASCII or
Latin-1).

In any case, it doesn't matter what encoding the document is in:
read(2) always returns two bytes. How many characters that constitutes
depends on the encoding - but read() doesn't return a character
string.

It might be that these two bytes are only part of a character,
e.g. if you need three bytes to encode a character, or it might
be that they are parts of two characters, e.g. when you get the
second byte of the first character and the first byte of the
second one. In some encodings (e.g. ISO-2022), these bytes
may indicate *no* character, e.g. when the bytes just indicate
an in-stream change of character set.

Regards,
Martin
 
J

Jon Ribbens

In any case, it doesn't matter what encoding the document is in:
read(2) always returns two bytes.

It returns *up to* two bytes. Sorry to be picky but I think it's
relevant to the topic because it illustrates how it's difficult
to change the definition of file.read() to return characters
instead of bytes (if the file is ready to read, there will always
be one or more bytes available (or EOF), but there won't always
be one or more characters available).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top