How to get Python to default to UTF8

W

weheh

I'm developing a cgi-bin application that must be unicode sensitive. I'm
striving for a UTF8 implementation. I'm running python 2.3 on a development
machine (windows xp) and a server (windows xp server). Both environments are
running Apache 2.2 with the same configuration file.

The problem is this. On my development machine I get the following unicode
error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-6: invalid
data
args = ('utf8', 'adem\xe3\xa1s', 4, 7, 'invalid data')
encoding = 'utf8'
end = 7
object = 'adem\xe3\xa1s'
reason = 'invalid data'
start = 4


On my server, running exactly the same python code, I see the following
unicode error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 4:
ordinal not in range(128)
args = ('ascii', 'adem\xe3\xa1s', 4, 5, 'ordinal not in range(128)')
encoding = 'ascii'
end = 5
object = 'adem\xe3\xa1s'
reason = 'ordinal not in range(128)'
start = 4

Note the differences in the encoding -- on the development machine it's utf8
but on the server it's ascii.

I was under the impression that Python assumed ascii encoding by default.
I'm wondering how did my development machine get to be utf8? And since my
python code is the same on both machines, what is it about my configuration
that could be causing a difference in default encoding? I checked site.py on
both machines and both files default to ASCII, so I assume it's something
else.

Thanks in advance.
 
F

Fredrik Lundh

weheh said:
I'm developing a cgi-bin application that must be unicode sensitive. I'm
striving for a UTF8 implementation. I'm running python 2.3 on a development
machine (windows xp) and a server (windows xp server). Both environments are
running Apache 2.2 with the same configuration file.

The problem is this. On my development machine I get the following unicode
error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 4-6: invalid
data
args = ('utf8', 'adem\xe3\xa1s', 4, 7, 'invalid data')
encoding = 'utf8'
end = 7
object = 'adem\xe3\xa1s'
reason = 'invalid data'
start = 4

Could be that sys.stdin.encoding differs between the setups.

*Where* do you get this exception? In the database layer? When the
script is trying to read things from a file? When it's trying to output
things? Somewhere else?

</F>
 
W

weheh

Hi Fredrik, thanks for responding. After reading up some more on this, I
think my title should be changed to "How to get Python to default to ASCII".
In point of fact, I want my 2 environments to agree so that I can debug
thinkgs more easily. Right now it's a nightmare.

As to your questions, in this example, I believe the exception was caused by
trying to do a count of the number of times a string appears in an array.
One of the strings was unicode and the other was encoded by Python by
default.
 
F

Fredrik Lundh

weheh said:
Hi Fredrik, thanks for responding. After reading up some more on this, I
think my title should be changed to "How to get Python to default to ASCII".
In point of fact, I want my 2 environments to agree so that I can debug
thinkgs more easily. Right now it's a nightmare.

As to your questions, in this example, I believe the exception was caused by
trying to do a count of the number of times a string appears in an array.
One of the strings was unicode and the other was encoded by Python by
default.

to fix this, figure out from where you got the encoded (8-bit) string,
and make sure you decode it properly on the way in. only use Unicode
strings on the "inside".

(Python does have two encoding defaults; there's a default encoding that
*shouldn't* ever be changed from the "ascii" default, and there's also a
stdin/stdout encoding that's correctly set if you run the code in an
ordinary terminal window. if you get your data from anywhere else, you
cannot trust any of these, so you should do your own decoding on the way
in, and encoding things on the way out).

</F>
 
W

weheh

Hi Fredrik,

Thanks again for your feedback. I am much obliged.

Indeed, I am forced to be exteremely rigorous about decoding on the way in
and encoding on the way out everywhere in my program, just as you say. Your
advice is excellent and concurs with other sources of unicode expertise.
Following this approach is the only thing that has made it possible for me
to get my program to work.

However, the situation is still unacceptable to me because I often make
mistakes and it is easy for me to miss places where encoding is necessary. I
rely on testing to find my faults. On my development environment, I get no
error message and it seems that everything works perfectly. However, once
ported to the server, I see a crash. But this is too late a stage to catch
the error since the app is already live.

I assume that the default encoding that you mention shouldn't ever be
changed is stored in the site.py file. I've checked this file and it's set
to ascii in both machines (development and server). I haven't touched
site.py. However, a week or so ago, following the advice of someone I read
on the web, I did create a file in my cgi-bin directory called something
like site-config.py, wherein encoding was set to utf8. I ran my program a
few times, but then reading elsewhere that the site-config.py approach was
outmoded, I decided to remove this file. I'm wondering whether it made a
permanent change somewhere in the bowels of python while I wasn't looking?

Can you elaborate on where to look to see what stdin/stdout encodings are
set to? All inputs are coming at my app either via html forms or input
files. All output goes either to the browser via html or to an output file.
 
M

Martin v. Löwis

However, the situation is still unacceptable to me because I often make
mistakes and it is easy for me to miss places where encoding is necessary. I
rely on testing to find my faults. On my development environment, I get no
error message and it seems that everything works perfectly. However, once
ported to the server, I see a crash. But this is too late a stage to catch
the error since the app is already live.

If you want to check whether there is indeed no place where you forgot
to properly .encode, you can set the default encoding on your
development machine to "undefined" (see site.py). This will give you an
exception whenever the default encoding is invoked, even if the encoding
would have succeeded under the default default encoding (ie. "ascii")

Such a setting should not be applied a production environment.
Can you elaborate on where to look to see what stdin/stdout encodings are
set to?

Just print out sys.stdin.encoding and sys.stdout.encoding. Or were you
asking for the precise source in the interpreter that sets them?
All inputs are coming at my app either via html forms or input
files. All output goes either to the browser via html or to an output file.

Then sys.stdout.encoding will not be set to anything.

Regards,
Martin
 
J

John Nagle

weheh said:
Hi Fredrik,

Thanks again for your feedback. I am much obliged.

Bear in mind that in Python, ASCII currently means ASCII, values
0..127. Type "str" will accept values > 127. However, the default
conversion from "str" to "unicode" requires true ASCII values, in
0..127. So if you take in data from some source which might have
a byte value > 127, the default conversion to Unicode won't work.

There are conversion functions for specifying the meaning of
values 128..255, (the input might be "latin1" encoding, for
example), or ignoring unexpected characters, or converting them
to "?".

John Nagle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top