To unicode or not to unicode

Martin v. Löwis · Feb 22, 2009

Since when is "Google Groups" a newsreader? So far as I know, all

the display/formatting is handled by my web browser and GG merely stuffs
messages into an HTML wrapper...

It also transmits this HTML wrapper via HTTP, where it claims that the
charset of the HTML is UTF-8. To do that, it must have converted the
original message from Latin-1 to UTF-8, which must have required
interpreting it as Latin-1 in the first place.

Regards,
Martin

dineshv · Feb 22, 2009

re: "You should never have to rely on the default encoding. You should
explicitly decode and encode data."

What is the best practice for 1) doing this in Python and 2) for
unicode support ?

I want to standardize on unicode and want to put into place best
Python practice so that we don't have to worry. Thanks!

Dinesh

Denis Kasak · Feb 22, 2009

Ross Ridge (Sat, 21 Feb 2009 18:06:35 -0500)

No, the original post demonstrates you don't have include MIME headers for
ISO 8859-1 text to be properly displayed by many newsreaders. The fact
that your obscure newsreader didn't display it properly doesn't mean
that original poster's newsreader is broken.

And how is this kind of assuming better than clearly stating the used
encoding? Does the fact that the last official Usenet RFC doesn't
mandate content-type headers mean that all bets are off and that we
should rely on guesswork to determine the correct encoding of a
message? No, it means the RFC is outdated and no longer suitable for
current needs.

HTTP requires the assumption of ISO 8859-1 in the absense of any
specified encoding.

Which is, of course, completely irrelevant for this discussion. Or are
you saying that this fact should somehow obliterate the need for
specifying encodings?

Newsreaders assuming ISO 8859-1 instead of ASCII doesn't make it a guess.
It's just a different assumption, nor does making an assumption, ASCII
or ISO 8850-1, give you any certainty.

Assuming is another way of saying "I don't know, so I'm using this
arbitrary default", which is not that different from a completely wild
guess.

Which is reasonable given that Python is programming language where it's
better to have more conservative assumption about encodings so errors
can be more quickly diagnosed. A newsreader however is a different
beast, where it's better to make a less conservative assumption that's
more likely to display messages correctly to the user. Assuming ISO
8859-1 in the absense of any specified encoding allows the message to be
correctly displayed if the character set is either ISO 8859-1 or ASCII.
Doing things the "pythonic" way and assuming ASCII only allows such
messages to be displayed if ASCII is used.

Reading this paragraph, I've began thinking that we've misunderstood
each other. I agree that assuming ISO 8859-1 in the absence of
specification is a better guess than most (since it's more likely to
display the message correctly). However, not specifying the encoding
of a message is just asking for trouble and assuming anything is just
an attempt of cleaning someone's mess. Unfortunately, it is impossible
to detect the encoding scheme just by heuristics and with hundreds of
encodings in existence today, the only real solution to the problem is
clearly stating your content-type. Since MIME is the most accepted way
of doing this, it should be the preferred way, RFC'ed or not.

John Machin · Feb 23, 2009

So, yeah--back on the subject of programming in Python and supporting
charactersets beyond ASCII:

If you have to make an assumption, I'd really think that it'd be
better to use whatever the host OS's default is, if the host OS has
such a thing--using an assumption of ISO 8859-1 works only in select
regions on unix systems, and may fail even in those select regions on
Windows, Mac OS, and other systems; without the OS considerations,
just the regional constraints are likely to make an ISO-8859-1
assumption result in /incorrect/ results anywhere eastward of central
Europe. Is a user in Russia (or China, or Japan) *really* most likely
to be using ISO 8859-1?

As a point of reference, here's what's in the man-pages that I have
installed (note the /complete/ and conspicuous lack of references to
even some notable eastern languages or character-sets, such as Chinese
and Japanese, in the /entire/ ISO-8859 spectrum):

1. As a point of reference for what?
2. The ISO 8859 character sets were deliberately restricted to scripts
that would fit in 8 bits. So Chinese, Japanese, Korean and Vietnamese
aren't included. Note that Chinese and Japanese already each had
*multiple* legacy (i.e. non-Unicode) character sets ... they (and the
rest the world) don't want/need yet another character set for each
language and never did want/need one.

unicode by default	29	May 11, 2011
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Ascii to Unicode.	4	Jul 28, 2010
Unicode again ... default codec ...	0	Oct 20, 2009
Unicode confusion	0	Jul 14, 2008
Converting datetime.ctime() values to Unicode	0	May 17, 2010
Yet another unicode WTF	9	Jun 5, 2009
Unicode characters in btye-strings	5	Mar 12, 2010

To unicode or not to unicode

Martin v. Löwis

dineshv

Denis Kasak

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads