To unicode or not to unicode

Ron Garret · Feb 20, 2009

I'm writing a little wiki that I call µWiki. That's a lowercase Greek
mu at the beginning (it's pronounced micro-wiki). It's working, except
that I can't actually enter the name of the wiki into the wiki itself
because the default unicode encoding on my Python installation is
"ascii". So I'm trying to decide on a course of action. There seem to
be three possibilities:

1. Change the code to properly support unicode. Preliminary
investigations indicate that this is going to be a colossal pain in the
ass.

2. Change the default encoding on my Python installation to be latin-1
or UTF8. The disadvantage to this is that no one else will be able to
run my code without making the same change to their installation, since
you can't change default encodings once Python has started.

3. Punt and spell it 'uwiki' instead.

I'm feeling indecisive so I thought I'd ask other people's opinion.
What should I do?

rg

Benjamin Peterson · Feb 20, 2009

Ron Garret said:
I'm writing a little wiki that I call ÂµWiki. That's a lowercase Greek
mu at the beginning (it's pronounced micro-wiki). It's working, except
that I can't actually enter the name of the wiki into the wiki itself
because the default unicode encoding on my Python installation is
"ascii". So I'm trying to decide on a course of action. There seem to
be three possibilities:

You should never have to rely on the default encoding. You should explicitly
decode and encode data.

1. Change the code to properly support unicode. Preliminary
investigations indicate that this is going to be a colossal pain in the
ass.

Properly handling unicode may be painful at first, but it will surely pay off in
the future.

Thorsten Kampe · Feb 20, 2009

* Ron Garret (Thu, 19 Feb 2009 18:57:13 -0800)

I'm writing a little wiki that I call ÂµWiki. That's a lowercase Greek
mu at the beginning (it's pronounced micro-wiki).

No, it's not. I suggest you start your Unicode adventure by configuring
your newsreader.

Thorsten

MRAB · Feb 20, 2009

Thorsten said:
* Ron Garret (Thu, 19 Feb 2009 18:57:13 -0800)

No, it's not. I suggest you start your Unicode adventure by configuring
your newsreader.

It looked like mu to me, but you're correct: it's "MICRO SIGN", not
"GREEK SMALL LETTER MU".

Ron Garret · Feb 20, 2009

MRAB said:
It looked like mu to me, but you're correct: it's "MICRO SIGN", not
"GREEK SMALL LETTER MU".

Heh, I didn't know that those two things were distinct. Learn something
new every day.

rg

Martin v. LÃ¶wis · Feb 20, 2009

MRAB said:
It looked like mu to me, but you're correct: it's "MICRO SIGN", not
"GREEK SMALL LETTER MU".

I don't think that was the complaint. Instead, the complaint was
that the OP's original message did not have a Content-type header,
and that it was thus impossible to tell what the byte in front of
"Wiki" meant. To properly post either MICRO SIGN or GREEK SMALL LETTER
MU in a usenet or email message, you really must use MIME. (As both
your article and Thorsten's did, by choosing UTF-8)

Regards,
Martin

P.S. The difference between MICRO SIGN and GREEK SMALL LETTER MU
is nit-picking, IMO:

py> unicodedata.name(unicodedata.normalize("NFKC", u"\N{MICRO SIGN}"))
'GREEK SMALL LETTER MU'

Ron Garret · Feb 20, 2009

"Martin v. LÃ¶wis said:
I don't think that was the complaint. Instead, the complaint was
that the OP's original message did not have a Content-type header,

I'm the OP. I'm using MT-Newswatcher 3.5.1. I thought I had it
configured properly, but I guess I didn't. Under
Preferences->Languages->Send Messages with Encoding I had selected
latin-1. I didn't know I also needed to have MIME turned on for that to
work. I've turned it on now. Is this better?

This should be a micro sign: Âµ

rg

Martin v. Löwis · Feb 20, 2009

Ron said:
I'm the OP. I'm using MT-Newswatcher 3.5.1. I thought I had it
configured properly, but I guess I didn't.

Probably you did. However, it then means that the newsreader is crap.

Under
Preferences->Languages->Send Messages with Encoding I had selected
latin-1.

That sounds like early nineties, before the invention of MIME.

I didn't know I also needed to have MIME turned on for that to
work. I've turned it on now. Is this better?

This should be a micro sign: Âµ

Not really (it's worse, from my point of view - but might be better
for others). You are now sending in UTF-8, but there is still no
MIME declaration in the news headers. As a consequence, my newsreader
continues to interpret it as Latin-1 (which it assumes as the default
encoding), and it comes out as moji-bake (in responding, my reader
should declare the encoding properly, so you should see what I see,
namely A-circumflex, micro sign)

If you look at the message headers / message source as sent e.g.
by MRAB, you'll notice lines like

MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

These lines are missing from your posting.

Assuming the newsreader is not crap, it might help to set the default
send encoding to ASCII. When sending micro sign, the newsreader might
infer that ASCII is not good enough, and use MIME - although it then
still needs to pick an encoding.

Regards,
Martin

Ross Ridge · Feb 21, 2009

=?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?= said:
I don't think that was the complaint. Instead, the complaint was
that the OP's original message did not have a Content-type header,
and that it was thus impossible to tell what the byte in front of
"Wiki" meant. To properly post either MICRO SIGN or GREEK SMALL LETTER
MU in a usenet or email message, you really must use MIME. (As both
your article and Thorsten's did, by choosing UTF-8)

MIME only applies Internet e-mail messages. RFC 1036 doesn't require
nor give a meaning to a Content-Type header in a Usenet message, so
there's nothing wrong with the original poster's newsreader.

In any case what the original poster really should do is come up with
a better name for his program

Ross Ridge

Thorsten Kampe · Feb 21, 2009

* Ross Ridge (Sat, 21 Feb 2009 12:22:36 -0500)

MIME only applies Internet e-mail messages.

No, it doesn't: "MIME's use, however, has grown beyond describing the
content of e-mail to describing content type in general. [...]

The content types defined by MIME standards are also of importance
outside of e-mail, such as in communication protocols like HTTP [...]"

http://en.wikipedia.org/wiki/MIME

RFC 1036 doesn't require nor give a meaning to a Content-Type header
in a Usenet message

Well, /maybe/ the reason for that is that RFC 1036 was written in 1987
and the first MIME RFC in 1992...? The "Son of RFC 1036" mentions MIME
more often than you can count.

so there's nothing wrong with the original poster's newsreader.

If you follow RFC 1036 (who was written before anyone even thought of
MIME) then all content has to ASCII. The OP used non ASCII letters.

It's all about declaring your charset. In Python as well as in your
newsreader. If you don't declare your charset it's ASCII for you - in
Python as well as in your newsreader.

Thorsten

Ross Ridge · Feb 21, 2009

Thorsten Kampe said:
Well, /maybe/ the reason for that is that RFC 1036 was written in 1987
and the first MIME RFC in 1992...?
Obviously.

"Son of RFC 1036" mentions MIME more often than you can count.

Since it was never sumbitted and accepted, RFC 1036 remains current.

If you follow RFC 1036 (who was written before anyone even thought of
MIME) then all content has to ASCII. The OP used non ASCII letters.

RFC 1036 doesn't place any restrictions on the content on the body of
an article. On the other hand "Son of RFC 1036" does have restrictions
on characters used in the body of message:

Articles MUST not contain any octet with value exceeding 127,
i.e. any octet that is not an ASCII character

Which means that merely adding a Content-Encoding header wouldn't
be enough to conform to "Son of RFC 1036", the original poster would
also have had to either switch to a 7-bit character set or use a 7-bit
compatible transfer encoding. If you trying to claim that "Son of RFC
1036" is the new defacto standard, then that would mean your newsreader
is broken too.

It's all about declaring your charset. In Python as well as in your
newsreader. If you don't declare your charset it's ASCII for you - in
Python as well as in your newsreader.

Except in practice unlike Python, many newsreaders don't assume ASCII.
The original article displayed fine for me. Google Groups displays it
correctly too:

http://groups.google.com/group/comp.lang.python/msg/828fefd7040238bc

I could just as easily argue that assuming ISO 8859-1 is the defacto
standard, and that its your newsreader that's broken. The reality however
is that RFC 1036 is the only standard for Usenet messages, defacto or
otherwise, and so there's nothing wrong with anyone's newsreader.

Ross Ridge

Thorsten Kampe · Feb 21, 2009

* Ross Ridge (Sat, 21 Feb 2009 14:52:09 -0500)

Except in practice unlike Python, many newsreaders don't assume ASCII.

They assume ASCII - unless you declare your charset (the exception being
Outlook Express and a few Windows newsreaders). Everything else is
"guessing".

The original article displayed fine for me. Google Groups displays it
correctly too:

http://groups.google.com/group/comp.lang.python/msg/828fefd7040238bc

Your understanding of the principles of Unicode is as least as non-
existant as the OP's.

I could just as easily argue that assuming ISO 8859-1 is the defacto
standard, and that its your newsreader that's broken.

There is no "standard" in regard to guessing (this is what you call
"assuming"). The need for explicit declaration of an encoding is exactly
the same in Python as in any Usenet article.

The reality however is that RFC 1036 is the only standard for Usenet
messages, defacto or otherwise, and so there's nothing wrong with
anyone's newsreader.

The reality is that all non-broken newsreaders use MIME headers to
declare and interpret the charset being used. I suggest you read at
least http://www.joelonsoftware.com/articles/Unicode.html to get an idea
of Unicode and associated topics.

Thorsten

Ross Ridge · Feb 21, 2009

Ross Ridge (Sat, 21 Feb 2009 14:52:09 -0500)

Except in practice unlike Python, many newsreaders don't assume ASCII.

Thorsten Kampe said:
They assume ASCII - unless you declare your charset (the exception being
Outlook Express and a few Windows newsreaders). Everything else is
"guessing".

No, it's an assumption like the way Python by default assumes ASCII.

Your understanding of the principles of Unicode is as least as non-
existant as the OP's.

The link demonstrates that Google Groups doesn't assume ASCII like
Python does. Since popular newsreaders like Google Groups and Outlook
Express can display the message correctly without the MIME headers,
but your obscure one can't, there's a much stronger case to made that
it's your newsreader that's broken.

There is no "standard" in regard to guessing (this is what you call
"assuming"). The need for explicit declaration of an encoding is exactly
the same in Python as in any Usenet article.

No, many newsreaders don't assume ASCII by default like Python.

The reality is that all non-broken newsreaders use MIME headers to
declare and interpret the charset being used.

Since RFC 1036 doesn't require MIME headers a reader that doesn't generate
them is by definition not broken.

Ross Ridge

Carl Banks · Feb 21, 2009

I'm writing a little wiki that I call µWiki. That's a lowercase Greek
mu at the beginning (it's pronounced micro-wiki). It's working, except
that I can't actually enter the name of the wiki into the wiki itself
because the default unicode encoding on my Python installation is
"ascii". So I'm trying to decide on a course of action. There seem to
be three possibilities:

1. Change the code to properly support unicode. Preliminary
investigations indicate that this is going to be a colossal pain in the
ass.

2. Change the default encoding on my Python installation to be latin-1
or UTF8. The disadvantage to this is that no one else will be able to
run my code without making the same change to their installation, since
you can't change default encodings once Python has started.

3. Punt and spell it 'uwiki' instead.

I'm feeling indecisive so I thought I'd ask other people's opinion.
What should I do?

rg

Thorsten Kampe · Feb 21, 2009

* Ross Ridge (Sat, 21 Feb 2009 17:07:35 -0500)

The link demonstrates that Google Groups doesn't assume ASCII like
Python does. Since popular newsreaders like Google Groups and Outlook
Express can display the message correctly without the MIME headers,
but your obscure one can't, there's a much stronger case to made that
it's your newsreader that's broken.

*sigh* I give up on you. You didn't even read the "Joel on Software"
article. The whole "why" and "what for" of Unicode and MIME will always
be a complete mystery to you.

T.

Ross Ridge · Feb 21, 2009

Ross Ridge (Sat, 21 Feb 2009 17:07:35 -0500)

The link demonstrates that Google Groups doesn't assume ASCII like
Python does. Since popular newsreaders like Google Groups and Outlook
Express can display the message correctly without the MIME headers,
but your obscure one can't, there's a much stronger case to made that
it's your newsreader that's broken.

Thorsten Kampe said:
*sigh* I give up on you. You didn't even read the "Joel on Software"
article. The whole "why" and "what for" of Unicode and MIME will always
be a complete mystery to you.

I understand what Unicode and MIME are for and why they exist. Neither
their merits nor your insults change the fact that the only current
standard governing the content of Usenet posts doesn't require their use.

Ross Ridge

Thorsten Kampe · Feb 21, 2009

* Ross Ridge (Sat, 21 Feb 2009 18:06:35 -0500)

I understand what Unicode and MIME are for and why they exist. Neither
their merits nor your insults change the fact that the only current
standard governing the content of Usenet posts doesn't require their
use.

That's right. As long as you use pure ASCII you can skip this nasty step
of informing other people which charset you are using. If you do use non
ASCII then you have to do that. That's the way virtually all newsreaders
work. It has nothing to do with some 21+ year old RFC. Even your Google
Groups "newsreader" does that ('content="text/html; charset=UTF-8"').

Being explicit about your encoding is 99% of the whole Unicode magic in
Python and in any communication across the Internet (may it be NNTP,
SMTP or HTTP). Your Google Groups simply uses heuristics to guess the
encoding the OP probably used. Windows newsreaders simply use the locale
of the local host. That's guessing. You can call it assuming but it's
still guessing. There is no way you can be sure without any declaration.

And it's unpythonic. Python "assumes" ASCII and if the decodes/encoded
text doesn't fit that encoding it refuses to guess.

T.

Ross Ridge · Feb 22, 2009

Ross Ridge (Sat, 21 Feb 2009 18:06:35 -0500)

I understand what Unicode and MIME are for and why they exist. Neither
their merits nor your insults change the fact that the only current
standard governing the content of Usenet posts doesn't require their
use.

Thorsten Kampe said:
That's right. As long as you use pure ASCII you can skip this nasty step
of informing other people which charset you are using. If you do use non
ASCII then you have to do that. That's the way virtually all newsreaders
work. It has nothing to do with some 21+ year old RFC. Even your Google
Groups "newsreader" does that ('content="text/html; charset=UTF-8"').

No, the original post demonstrates you don't have include MIME headers for
ISO 8859-1 text to be properly displayed by many newsreaders. The fact
that your obscure newsreader didn't display it properly doesn't mean
that original poster's newsreader is broken.

Being explicit about your encoding is 99% of the whole Unicode magic in
Python and in any communication across the Internet (may it be NNTP,
SMTP or HTTP).

HTTP requires the assumption of ISO 8859-1 in the absense of any
specified encoding.

Your Google Groups simply uses heuristics to guess the
encoding the OP probably used. Windows newsreaders simply use the locale
of the local host. That's guessing. You can call it assuming but it's
still guessing. There is no way you can be sure without any declaration.

Newsreaders assuming ISO 8859-1 instead of ASCII doesn't make it a guess.
It's just a different assumption, nor does making an assumption, ASCII
or ISO 8850-1, give you any certainty.

And it's unpythonic. Python "assumes" ASCII and if the decodes/encoded
text doesn't fit that encoding it refuses to guess.

Which is reasonable given that Python is programming language where it's
better to have more conservative assumption about encodings so errors
can be more quickly diagnosed. A newsreader however is a different
beast, where it's better to make a less conservative assumption that's
more likely to display messages correctly to the user. Assuming ISO
8859-1 in the absense of any specified encoding allows the message to be
correctly displayed if the character set is either ISO 8859-1 or ASCII.
Doing things the "pythonic" way and assuming ASCII only allows such
messages to be displayed if ASCII is used.

Ross Ridge

Thorsten Kampe · Feb 22, 2009

* Ross Ridge (Sat, 21 Feb 2009 19:39:42 -0500)

No, the original post demonstrates you don't have include MIME headers for
ISO 8859-1 text to be properly displayed by many newsreaders.

*sigh* As you still refuse to read the article[1] I'm going to quote it
now here:

'The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember
one extremely important fact. It does not make sense to have a string
without knowing what encoding it uses.
[...]
If you have a string [...] in an email message, you have to know what
encoding it is in or you cannot interpret it or display it to users
correctly.

Almost every [...] "she can't read my emails when I use accents" problem
comes down to one naive programmer who didn't understand the simple fact
that if you don't tell me whether a particular string is encoded using
UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western
European), you simply cannot display it correctly [...]. There are over
a hundred encodings and above code point 127, all bets are off.'

Enough said.

The fact that your obscure newsreader didn't display it properly
doesn't mean that original poster's newsreader is broken.

You don't even know if my "obscure newsreader" displayed it properly.
Non ASCII text without a declared encoding is just a bunch of bytes.
It's not even text.

T.

[1] http://www.joelonsoftware.com/articles/Unicode.html

Steve Holden · Feb 22, 2009

Thorsten said:
* Ross Ridge (Sat, 21 Feb 2009 14:52:09 -0500)

They assume ASCII - unless you declare your charset (the exception being
Outlook Express and a few Windows newsreaders). Everything else is
"guessing".

Your understanding of the principles of Unicode is as least as non-
existant as the OP's.

There is no "standard" in regard to guessing (this is what you call
"assuming"). The need for explicit declaration of an encoding is exactly
the same in Python as in any Usenet article.

The reality is that all non-broken newsreaders use MIME headers to
declare and interpret the charset being used. I suggest you read at
least http://www.joelonsoftware.com/articles/Unicode.html to get an idea
of Unicode and associated topics.

And I suggest you try to phrase your remarks in a way more respectful of
those you are discussing these matters with. I understand that
exasperation can lead to offensiveness, but if a lack of understanding
does exist then it's better to simply try and remove it without
commenting on its existence.

regards
Steve

unicode by default	29	May 11, 2011
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Ascii to Unicode.	4	Jul 28, 2010
Unicode again ... default codec ...	0	Oct 20, 2009
Unicode confusion	0	Jul 14, 2008
Converting datetime.ctime() values to Unicode	0	May 17, 2010
Yet another unicode WTF	9	Jun 5, 2009
Unicode characters in btye-strings	5	Mar 12, 2010

To unicode or not to unicode

Ron Garret

Benjamin Peterson

Thorsten Kampe

MRAB

Ron Garret

Martin v. LÃ¶wis

Ron Garret

Martin v. Löwis

Ross Ridge

Thorsten Kampe

Ross Ridge

Thorsten Kampe

Ross Ridge

Carl Banks

Thorsten Kampe

Ross Ridge

Thorsten Kampe

Ross Ridge

Thorsten Kampe

Steve Holden

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads