UnicodeEncodeError in Windows

G

geoff_ness

Hello - and apologies in advance for the length of this post.

I am having a hard time understanding the errors being generated by a
program I've written. The code is intended to parse text files which
are copied and pasted from web pages from an online game. The encoding
of the pages is ISO-8859-1, but the text that gets copied contains
characters from character sets other than latin-1.
For instance, one of the lines I need to be able to read is:
196679 Daimyo 石 Druid 145 27 12/09/07 21:40:04 [ Expel ]

I start with the file 'citizen_list' and use this function to read it
and return a list of names (for instance, Daimyo 石 Druid) and ID
numbers:

# builds the list of names from the citizens list
def getNames(f):
"""Builds a list from the town list of names

Returns a list"""
newlist = []
for line in f:
namewords = line.rstrip('[Expel]\n\t ')\
.rstrip(':/0123456789 ').rstrip('\t ').rstrip('0123456789 ')\
.rstrip('\t ').rstrip('0123456789 ').rstrip('\t ').split()
entry = ";".join([namewords[0], "
".join(namewords[1:len(namewords)])])
newlist.append(entry)
return newlist

citizens = codecs.open('citizen_list', 'r', 'utf-8', 'strict')
listNames = getNames(citizens)
citizens.close()

I've specified 'utf-8' as the encoding as this seemed to be the best
candidate for picking up all the names in the list. I use the names in
other functions - for example:

def getdamage(warrior, rpt):
"""reads each line of war report

returns damage and number of kills for citizen name"""
for line in rpt:
if (line.startswith(warrior.name) or \
line.startswith('A blue aura surrounds ' +
warrior.name))\
and line.find('weapon') > 0:
warrior.addDamage(int(line[line.find('caused ')
+7:line.find(' damage')]))
if rpt.next().find('is dead') >0:
warrior.addKill()
elif line.startswith(warrior.name+' is dead'):
warrior.dies()
break
elif line.startswith('Starting round'):
warrior.addRound()

for cit in listNames:
c = Warrior(cit.split(';')[0], cit.split(';')[1])
totalnum += 1
report = codecs.open('war_report','r', 'utf-8', 'strict')
getdamage(c, report)
report.close()
--[snip]--

def buildString(warrior):
"""Build a string from a warrior's stats

Returns string for output to warStat."""
return "!tr!!td!!id!"+str(warrior.ID)+"!/id!!/td!"+\
"!td!"+str(warrior.damage)+"!/td!!td!"+str(warrior.kills)+\
"!/td!!td!"+str(warrior.survived)+"!/td!!/tr!"

This code runs fine on my linux machine, but when I sent the code to a
friend with python running on windows, he got the following error:

Traceback (most recent call last):
File "D:\Python25\Lib\SITE-P~1\PYTHON~1\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in _main_._dict_
File "C:\Documents and Settings\Administrator\Desktop
\reparser_014(2)\parser_1.0.py", line 63, in <module>
"".join(["%s" % buildString(c) for c in citlistS[:100]])+"!/
table!")
File "C:\Documents and Settings\Administrator\Desktop
\reparser_014(2)\iotp_alt2.py", line 169, in buildString
"!/td!!td!"+str(warrior.survived)+"!/td!!/tr!"
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in
position 0: ordinal not in range(128)

As I understand it the error is related to the ascii codec being
unable to cope with the unicode string u'\ufeff'.
The issue I have is that this error doesn't show up for me - ascii is
the default encoding for me also. Any thoughts or assistance would be
welcomed.

Cheers
 
G

Gabriel Genellina

def buildString(warrior):
"""Build a string from a warrior's stats

Returns string for output to warStat."""
return "!tr!!td!!id!"+str(warrior.ID)+"!/id!!/td!"+\
"!td!"+str(warrior.damage)+"!/td!!td!"+str(warrior.kills)+\
"!/td!!td!"+str(warrior.survived)+"!/td!!/tr!"

This code runs fine on my linux machine, but when I sent the code to a
friend with python running on windows, he got the following error:

Traceback (most recent call last):
File "C:\Documents and Settings\Administrator\Desktop
\reparser_014(2)\iotp_alt2.py", line 169, in buildString
"!/td!!td!"+str(warrior.survived)+"!/td!!/tr!"
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in
position 0: ordinal not in range(128)

As I understand it the error is related to the ascii codec being
unable to cope with the unicode string u'\ufeff'.
The issue I have is that this error doesn't show up for me - ascii is
the default encoding for me also. Any thoughts or assistance would be
welcomed.

Some of those `warrior` attributes is an Unicode object that contains
characters outside ASCII. str(x) tries to convert to string, using the
default encoding, and fails. This happens on Windows and Linux too,
depending on the data.
I've seen that you use codecs.open: you should write Unicode objects to
the file, not strings, and that would be fine.
Look for some recent posts about this same problem.
 
G

geoff_ness

En Mon, 17 Sep 2007 07:38:16 -0300, geoff_ness <[email protected]>
escribi?:








Some of those `warrior` attributes is an Unicode object that contains
characters outside ASCII. str(x) tries to convert to string, using the
default encoding, and fails. This happens on Windows and Linux too,
depending on the data.
I've seen that you use codecs.open: you should write Unicode objects to
the file, not strings, and that would be fine.
Look for some recent posts about this same problem.

Thanks Gabriel, I hadn't thought about the str() function that way - I
had initially used it to coerce the attributes which are type int to
type str so that I could write them to the output file. I've rewritten
the buildString() function now so that the unicode objects don't get
fed to str(), and apparently windows copes ok with that. I'm still
puzzled as to why python at my end had no problem with it...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top