Encoding and norwegian (non ASCII) characters.

joakim.hove · Oct 7, 2006

Hello,

I am having great problems writing norwegian characters æøå to file
from a python application. My (simplified) scenario is as follows:

1. I have a web form where the user can enter his name.

2. I use the cgi module module to get to the input from the user:
....
name = form["name"].value

3. The name is stored in a file

fileH = open(namefile , "a")
fileH.write("name:%s \n" % name)
fileH.close()

Now, this works very well indeed as long the users have 'ascii' names,
however when someone enters a name with one of the norwegian characters
æøå - it breaks at the write() statement.

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position
.....

Now - I understand that the ascii codec can't be used to decode the
particular characters, however my attempts of specifying an alternative
encoding have all failed.

I have tried variants along the line:

fileH = codecs.open(namefile , "a" , "latin-1") / fileH =
open(namefile , "a")
fileH.write(name) / fileH.write(name.encode("latin-1"))

It seems *whatever* I do the Python interpreter fails to see my pledge
for an alternative encoding, and fails with the dreaded
UnicodeDecodeError.

Any tips on this would be *highly* appreciated.

Joakim

Peter Otten · Oct 7, 2006

Hello,

I am having great problems writing norwegian characters Ã¦Ã¸Ã¥ to file
from a python application. My (simplified) scenario is as follows:

1. I have a web form where the user can enter his name.

2. I use the cgi module module to get to the input from the user:
....
name = form["name"].value

3. The name is stored in a file

fileH = open(namefile , "a")
fileH.write("name:%s \n" % name)
fileH.close()

Now, this works very well indeed as long the users have 'ascii' names,
however when someone enters a name with one of the norwegian characters
Ã¦Ã¸Ã¥ - it breaks at the write() statement.

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position
....

Now - I understand that the ascii codec can't be used to decode the
particular characters, however my attempts of specifying an alternative
encoding have all failed.

I have tried variants along the line:

fileH = codecs.open(namefile , "a" , "latin-1") / fileH =
open(namefile , "a")
fileH.write(name) / fileH.write(name.encode("latin-1"))

It seems *whatever* I do the Python interpreter fails to see my pledge
for an alternative encoding, and fails with the dreaded
UnicodeDecodeError.

Any tips on this would be *highly* appreciated.

The approach with codecs.open() should succeed
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/local/lib/python2.4/codecs.py", line 501, in write
return self.writer.write(data)
File "/usr/local/lib/python2.4/codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

provided that you write only unicode strings with characters in the range
unichr(0)...unichr(255) and normal strs in the range chr(0)...chr(127).

You have to decode non-ascii strs before feeding them to write() with the
appropriate encoding (that only you know)

If there are unicode code points beyond unichr(255) you have to change the
encoding in codecs.open(), typically to UTF-8.

# raises UnicodeEncodeError
codecs.open("tmp.txt", "a", "latin1").write(u"\u1234")

# works
codecs.open("tmp.txt", "a", "utf8").write(u"\u1234")

Peter

Paul Boddie · Oct 8, 2006

I am having great problems writing norwegian characters æøå to file
from a python application. My (simplified) scenario is as follows:

1. I have a web form where the user can enter his name.

2. I use the cgi module module to get to the input from the user:
....
name = form["name"].value

The cgi module should produce plain strings, not Unicode objects, which
makes some of the later behaviour quite "interesting".

3. The name is stored in a file

fileH = open(namefile , "a")
fileH.write("name:%s \n" % name)
fileH.close()

Now, this works very well indeed as long the users have 'ascii' names,
however when someone enters a name with one of the norwegian characters
æøå - it breaks at the write() statement.

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8f in position

This is odd, since writing plain strings to files shouldn't involve any
Unicode conversions. If you received a plain string from the cgi
module, the text you write to the file should still be a plain string.
This is like obtaining a sequence of bytes and just passing them
around. Perhaps your Python configuration is different in some
non-standard way, although I wouldn't want to point the finger at
anything in particular (although sys.getdefaultencoding might suggest
something).

Now - I understand that the ascii codec can't be used to decode the
particular characters, however my attempts of specifying an alternative
encoding have all failed.

I have tried variants along the line:

fileH = codecs.open(namefile , "a" , "latin-1") / fileH = open(namefile , "a")
fileH.write(name) / fileH.write(name.encode("latin-1"))

It seems *whatever* I do the Python interpreter fails to see my pledge
for an alternative encoding, and fails with the dreaded
UnicodeDecodeError.

To use a file opened through codecs.open, you really should present
Unicode objects to the write method. Otherwise, I imagine that the
method will try and automatically convert to Unicode the plain string
that the name object supposedly is, and this conversion will assume
that the string only contains ASCII characters (as is Python's default
behaviour) and thus cause the error you are seeing. Only after getting
the text as a Unicode object will the method then try to encode the
text in the specified encoding in order to write it to the file.

In other words, you'll see this behaviour:

name (plain string) -> Unicode object -> encoded text (written to
file)

Or rather, in the failure case:

name (plain string) -> error! (couldn't produce the Unicode object)

As Peter Otten suggests, you could first make the Unicode object
yourself, stating explicitly that the name object contains "latin-1"
characters. In other words:

name (plain string) -> Unicode object

Then, the write method has an easier time:

Unicode object -> encoded text (written to file)

All this seems unnecessary for your application, I suppose, since you
know (or believe) that the form values only contain "latin-1"
characters. However, as is the standard advice on such matters, you may
wish to embrace Unicode more eagerly, converting plain strings to
Unicode as soon as possible and only converting them to text in various
encodings when writing them out.

In some Web framework APIs, the values of form fields are immediately
available as Unicode without any (or much) additional work. WebStack
returns Unicode objects for form fields, as does the Java Servlet API,
but I'm not particularly aware of many other Python frameworks which
enforce or promote such semantics.

Paul

Unicode/ascii encoding nightmare	19	Nov 6, 2006
email with a non-ascii charset in Python3 ?	3	Aug 15, 2012
parsing non-ascii characters	2	Nov 10, 2008
HTMLParser and non-ascii html pages	0	Sep 20, 2011
files.py (encoding error)	0	Jun 10, 2013
encoding error	1	Feb 20, 2013
problem with logging exceptions with non-ASCII __str__ result	1	Jan 14, 2008
Preserving unicode filename encoding	1	Oct 20, 2012

Encoding and norwegian (non ASCII) characters.

joakim.hove

Peter Otten

Paul Boddie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads