'ascii' codec can't encode character u'\u2013'

T

thomas Armstrong

Hi

Using Python 2.3.4 + Feedparser 3.3 (a library to parse XML documents)

I'm trying to parse a UTF-8 document with special characters like
acute-accent vowels:
--------
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
....
-------

But I get this error message:
-------
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
position 122: ordinal not in range(128)
-------

when trying to execute a MySQL query:
----
query = "UPDATE blogs_news SET text = '" + text_extrated + "'WHERE
id='" + id + "'"
cursor.execute (query) #<--- error line
----

I tried with:
-------
text_extrated = text_extrated.encode('iso-8859-1') #<--- error line
query = "UPDATE blogs_news SET text = '" + text_extrated + "'WHERE
id='" + id + "'"
cursor.execute (query)
-------

But I get this error:
------
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013'
in position 92: ordinal not in range(256)
-----

I also tried with:
----
text_extrated = re.sub(u'\u2013', '-' , text_extrated)
query = "UPDATE blogs_news SET text = '" + text_extrated + "'WHERE
id='" + id + "'"
cursor.execute (query)
-----

It works, but I don't want to substitute each special character,
because there are
always forgotten ones which can crack the program.

Any suggestion to fix it? Thank you very much.
 
D

deelan

thomas Armstrong wrote:
(...)
when trying to execute a MySQL query:
----
query = "UPDATE blogs_news SET text = '" + text_extrated + "'WHERE
id='" + id + "'"
cursor.execute (query) #<--- error line
----

well, to start it's not the best way to do an update,
try this instead:

query = "UPDATE blogs_news SET text = %s WHERE id=%s"
cursor.execute(query, (text_extrated, id))

so mysqldb will take care to quote text_extrated automatically. this
may not not your problem, but it's considered "good style" when dealing
with dbs.

apart for this, IIRC feedparser returns text as unicode strings, and
you correctly tried to encode those as latin-1 str objects before to
pass it to mysql, but not all glyphs in the orginal utf-8 feed can be
translated to latin-1. the charecter set of latin-1 is very thin
compared to the utf-8.

you have to decide:

* switch your mysql db to utf-8 and encode stuff before
insertion to UTF-8

* lose those characters that cannot be mapped into latin-1,
using the:

text_extrated.encode('latin-1', errors='replace')

so unrecognized chars will be replaced by ?

also, mysqldb has some support to manage unicode objects directly, but
things changed a bit during recent releases so i cannot be precise in
this regard.

HTH.
 
T

thomas Armstrong

Hi.

Thank you both for your answers.

Finally I changed my MySQL table to UTF-8 and changed the structure
of the query (with '%s').

It works. Thank you very much.
 
J

John J. Lee

deelan said:
query = "UPDATE blogs_news SET text = %s WHERE id=%s"
cursor.execute(query, (text_extrated, id))

so mysqldb will take care to quote text_extrated automatically. this
may not not your problem, but it's considered "good style" when dealing
with dbs.
[...]

More than just good style: it prevents SQL injection attacks that
could otherwise allow people to do bad things to your databases.


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top