UTF-8 in basic CGI mode

coldpizza · Jan 15, 2008

Hi,

I have a basic Python CGI web form that shows data from a SQLite3
database. It runs under the built-in CGIWebserver which looks
something like
this:

Code:

from BaseHTTPServer import HTTPServer
from CGIHTTPServer  import CGIHTTPRequestHandler
HTTPServer("8000", CGIHTTPRequestHandler).serve_forever( )

The script runs Ok with ANSI characters, but when I try to process
non-
ASCII data I get an UnicodeDecodeError exception ('ascii' codec can't
decode byte 0xd0 in position 0: ordinal not in range(128)).

I have added the the 'u' prefix to all my literal strings, and I
_have_ wrapped all my output statements into myString.encode('utf8',
"replace"), but, apparently the UnicodeDecodeError exception occurs
because of a string that I get back to the script through
cgi.FieldStorage( ).

I.e. I have the lines:
form = cgi.FieldStorage( )
word= form['word']
which retrieve the 'word' value from a GET request.

I am using this 'word' variable like this:

print u'''<input type="text" name="blabla" value="%s">''' % (word)

and apparently this causes exceptions with non-ASCII strings.

I've also tried this:
print u'''<input type="text" name="blabla" value="%s">''' %
(word.encode('utf8'))
but I still get the same UnicodeDecodeError..

What is the general good practice for working with UTF8?

The standard Python CGI documentation has nothing on character sets.

It looks insane to have to explicitly wrap every string
with .encode('utf8'), but even this does not work.

Could the problem be related to the encoding of the string returned by
the cgi.fieldstorage()? My page is using UTF-8 encoding.

What would be encoding for the data that comes from the browser after
the form is submitted?

Why does Python always try to use 'ascii'? I have checked all my
strings and they are prefixed with 'u'. I have also tried replacing
print statements with
sys.stdout.write (DATA.encode('utf8'))
but this did not help.

Any clues?

Sion Arrowsmith · Jan 16, 2008

coldpizza said:
I am using this 'word' variable like this:

print u'''<input type="text" name="blabla" value="%s">''' % (word)

and apparently this causes exceptions with non-ASCII strings.

I've also tried this:
print u'''<input type="text" name="blabla" value="%s">''' %
(word.encode('utf8'))
but I still get the same UnicodeDecodeError..

Your 'word' is a byte string (presumably UTF8 encoded). When python
is asked to insert a byte string into a unicode string (as you are
doing with the % operator, but the same applies to concatenation
with the + operator) it attempts to convert the byte string into
unicode. And the default encoding is 'ascii', and the ascii codec
takes a very strict view about what an ASCII character is -- and
that is that only characters below 128 are ASCII.

To get it to work, you need to *decode* word. It is already UTF8
(or something) encoded. Under most circumstances, use encode() to
turn unicode strings to byte strings, and decode() to go in the
other direction.

coldpizza · Jan 17, 2008

Thanks, Sion, that makes sense!

Would it be correct to assume that the encoding of strings retrieved
by FieldStorage() would be the same as the encoding of the submitted
web form (in my case utf-8)?

Funny but I have the same form implemented in PSP (Python Server
Pages), running under Apache with mod_python and it works
transparently with no explicit charset translation required.

UTF-8 in basic CGI mode	0	Jan 15, 2008
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	58	Sep 29, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	67	Jul 4, 2013
utf-8 and ctypes	5	Sep 28, 2010
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position	4	Dec 6, 2012
CGI and UTF-8	14	Sep 28, 2009

UTF-8 in basic CGI mode

coldpizza

Sion Arrowsmith

coldpizza

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads