Text Encoding - Like Wrestling Oiled Pigs

A

apotheos

So I've got a problem.

I've got a database of information that is encoded in Windows/CP1252.
What I want to do is dump this to a UTF-8 encoded text file (a RSS
feed).

While the overall problem seems to be related to the conversion, the
only error I'm getting is a

"UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position
163: ordinal not in range(128)"

So somewhere I'm missing an implicit conversion to ASCII which is
completely aggrivating my brain.

So, what fundamental issue am I completely overlooking?

Code follows.

def GenerateNoticeRSS():


output = codecs.open(FILEBASE + 'noticeboard.xml','w','utf-8')


conn = psycopg.connect(DSN)


curs = conn.cursor()


sql_query = "select story.subject as subject, story.content as
content, story.summary as summary, story.sid as sid, posts.bid as
board, posts.date_to_publish as date from story$
curs.execute(sql_query)


rows = curs.fetchall()


output.write('<?xml version="1.0" encoding="utf-8"?>\n')


output.write('<rss version="2.0">\n')



output.write('<channel>\n')


output.write('<title>U of L Notice Board</title>\n')


output.write('<link>http://www.uleth.ca/notice</link>\n')


output.write('<description>University of Lethbridge News and
Events</description>\n')


for each in rows:




output.write('<item>\n')


output.write('<title>' + rssTitlePrefix(each[4]) +
unicode(each[0]) + '</title>\n')


output.write('<link>http://www.uleth.ca/notice/display.html?b=' +
str(each[4]) + '&amp;s=' + str(each[3]) + '</link>\n')


output.write('<guid>http://www.uleth.ca/notice/display.html?b=' +
str(each[4]) + '&amp;s=' + str(each[3]) + '</guid>\n')
descript = each[2] + '<BR><BR>' + each[1]





output.write(u'<description>' + unicode(descript) +
u'</description>\n') # this is the line that causes the error.


output.write('</item>\n')
output.write('</channel>\n')
output.write('</rss>\n')
output.close()


return 0
 
J

John Machin

So I've got a problem.

I've got a database of information that is encoded in Windows/CP1252.
What I want to do is dump this to a UTF-8 encoded text file (a RSS
feed).

While the overall problem seems to be related to the conversion, the
only error I'm getting is a

"UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position
163: ordinal not in range(128)"

So somewhere I'm missing an implicit conversion to ASCII which is
completely aggrivating my brain.

So, what fundamental issue am I completely overlooking?

That nowhere in your *code* do you mention "I've got a database of
information that is encoded in Windows/CP1252". This is not recorded
anywhere in your database. Python is fantastic, but we don't expect a
readauthorsmind() function until Python 4000 :)
Code follows.
[snip]

sql_query = "select story.subject as subject, story.content as
content, story.summary as summary, story.sid as sid, posts.bid as
board, posts.date_to_publish as date from story$

The above line has been mangled ... fortunately it doesn't affect the
diagnostic outcome.

[snip]
output.write(u'<description>' + unicode(descript) +
u'</description>\n') # this is the line that causes the error.

What is happening is that unicode(descript) has not been told what
encoding to use to decode your "Windows/CP1252" text, and it uses the
default encoding, "ascii". You need to put unicode(descript, 'cp1252').

Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top