Question about encoding, I need a clue ...

G

Geoff Wright

Hi,

I use Mac OSX for development but deploy on a Linux server. (Platform details provided below).

When the locale is set to FR_CA, I am not able to display a u circumflex consistently across the two machines even though the default encoding is set to "ascii" on both machines. Specifically, calendar.month_name[8] returns a ? (question mark) on the Linux server whereas it displays properly on the Mac OSX system. However, if I take the result from calendar.month_name[8] and run it through the following function .... unicode(calendar.month_name[8],"latin1") ... then the u circumflex displays correctly on the Linux server but does not display correctly on my Mac.

Of course, I could work around this problem with a relatively simple if statement but these issues are going to show up all over my application so even a simple if statement will start to get cumbersome.

I guess what it boils down to is that I would like to get a better handle on what is going on so that I will know how best to work through future encoding issues. Thanks in advance for any advice.

Here are the specifics of my problem.

On my Mac:

Python 2.6.7 (r267:88850, Jul 30 2011, 23:46:53)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
import locale
locale.setlocale(locale.LC_ALL,'fr_CA') 'fr_CA'
import sys
sys.getdefaultencoding() 'ascii'
import calendar
calendar.month_name[8] 'ao\xc3\xbbt'
print calendar.month_name[8] août
print unicode(calendar.month_name[8],"latin1")
août

On the linux server:

uname -a
Linux alhena 2.6.32.8-grsec-2.1.14-modsign-xeon-64 #2 SMP Sat Mar 13 00:42:43 PST 2010 x86_64 GNU/Linux

Python 2.5.2 (r252:60911, Jan 24 2010, 17:44:40)
[GCC 4.3.2] on linux2
import locale,sys,calendar
locale.setlocale(locale.LC_ALL,'fr_CA') 'fr_CA'
sys.getdefaultencoding() 'ascii'
calendar.month_name[8] 'ao\xfbt'
print calendar.month_name[8] ao?t
print unicode(calendar.month_name[8],"latin1")
août
 
S

Steven D'Aprano

Geoff said:
Hi,

I use Mac OSX for development but deploy on a Linux server. (Platform
details provided below).

When the locale is set to FR_CA, I am not able to display a u circumflex
consistently across the two machines even though the default encoding is
set to "ascii" on both machines.

As somebody else already pointed out, û (u circumflex) is not an ASCII
character, so why would you expect to be able to use it with the ASCII
encoding?

Essential reading:

http://www.joelonsoftware.com/articles/Unicode.html

Drop everything and go read that!

Using Python 2.x, so-called strings are byte strings, which complicates
matters greatly. The month name you get:

'ao\xc3\xbbt'

is a string of five bytes with hex values:

x61 x6f xc3 xbb x74

Depending on how your terminal is set up, that MAY be interpreted as the
characters a o û t but you could end up with anything:
ao羶t

(In theory, even the a, o and t could change, but I haven't found any
terminal settings *that* wacky.)

Specifically, calendar.month_name[8]
returns a ? (question mark) on the Linux server whereas it displays
properly on the Mac OSX system.

That could mean either:

(1) the terminal on the Linux server is set to a different default encoding
from that on the Mac; or

(2) the two terminals have the same encoding, but the font used on the Linux
server doesn't include the right glyph to display û.

Of the two, I expect (1) is more likely.

The solution is to avoid relying on lucky accidents of the terminal
encoding, and deal with this the right way. The right way is nearly always
to use UTF-8 everywhere you can, not Latin 1. Make sure your terminal is
set to use UTF-8 as well (I believe this is the default for Mac OS's
terminal app, but I have no idea about the many different Linux terminals).
Then:
bytes = 'ao\xc3\xbbt' # From calendar.month_name[8]
s = bytes.decode('utf-8') # Like unicode(bytes, 'utf-8')
s u'ao\xfbt'
print s
août


Provided your Linux server terminal also is set to use UTF-8, this should
just work.
 
N

Nobody

I guess what it boils down to is that I would like to get a better handle
on what is going on so that I will know how best to work through future
encoding issues. Thanks in advance for any advice.

Here are the specifics of my problem.

On my Mac:
'ascii'

sys.getdefaultencoding() is a red herring. It's almost always 'ascii',
and isn't affected by the locale (and cannot be changed outside of the
site.py file).
import calendar
calendar.month_name[8]
'ao\xc3\xbbt'

This is the "repr()" of 'août' in UTF-8.
print calendar.month_name[8] août
print unicode(calendar.month_name[8],"latin1")
août

This is what you get if you decode the UTF-8 representation of 'août'
using ISO-8859-1 (aka ISO-Latin-1).
calendar.month_name[8]
'ao\xfbt'

This is the "repr()" of 'août' in ISO-8859-1.

Conclusion: the Mac's "fr_CA" locale uses UTF-8, the Linux system uses
ISO-8859-1 (there may or may not be a distinct "fr_CA.utf8" locale which
uses UTF-8). The difference between the two /isn't/ responsible for your
problem; your problem is almost certainly due to a mismatch between the
encoding used by the terminal and the locale's encoding.

If you get a "?" on the Linux system, it's likely that the terminal (or
emulator) is configured to use something other than ISO-8859-1 (e.g. UTF-8
or ASCII). For a GUI-based emulator (xterm, etc), you need to consult the
documentation for the specific program. For the Linux console, refer to
the setfont(8) manual page.

In this situation, there probably isn't much point in converting to and
from Unicode. You can't perform the encoding step (Unicode -> whatever)
without knowing the terminal's encoding. It *should* be the same as the
locale's encoding, in which case converting to and from Unicode is an
identity transformation (i.e. you get out exactly what you put in). If it
isn't the same as the locale's encoding, well ... good luck trying to
figure out what it is.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top