unicode confusing

someone · May 25, 2009

Hi,

reading content of webpage (encoded in utf-8) with urllib2, I can't
get parsed data into DB

Exception:

File "/usr/lib/python2.5/site-packages/pyPgSQL/PgSQL.py", line 3111,
in execute
raise OperationalError, msg
libpq.OperationalError: ERROR: invalid UTF-8 byte sequence detected
near byte 0xe4

I've already checked several python unicode tutorials, but I have no
idea how to solve my problem.

Appreciating your help
Pet

Paul Boddie · May 25, 2009

Hi,

reading content of webpage (encoded in utf-8) with urllib2, I can't
get parsed data into DB

Exception:

File "/usr/lib/python2.5/site-packages/pyPgSQL/PgSQL.py", line 3111,
in execute
raise OperationalError, msg
libpq.OperationalError: ERROR: invalid UTF-8 byte sequence detected
near byte 0xe4

I've already checked several python unicode tutorials, but I have no
idea how to solve my problem.

With pyPgSQL, there are a few tricks that you have to take into
account:

1. With PostgreSQL, it would appear advantageous to create databases
using the "-E unicode" option.

2. When connecting, use the client_encoding and unicode_results
arguments for the connect function call:

connection = PgSQL.connect(client_encoding="utf-8",
unicode_results=1)

3. After connecting, it appears necessary to set the client encoding
explicitly:

connection.cursor().execute("set client_encoding to unicode")

I'd appreciate any suggestions which improve on the above, but what
this should allow you to do is to present Unicode objects to the
database and to receive such objects from queries. Whether you can
relax this and pass UTF-8-encoded strings instead of Unicode objects
is not something I can guarantee, but it's usually recommended that
you manipulate Unicode objects in your program where possible, and
here you should be able to let pyPgSQL deal with the encodings
preferred by the database.

Paul

Pet · May 26, 2009

With pyPgSQL, there are a few tricks that you have to take into
account:

1. With PostgreSQL, it would appear advantageous to create databases
using the "-E unicode" option.

Hi,

DB is in UTF8

2. When connecting, use the client_encoding and unicode_results
arguments for the connect function call:

connection = PgSQL.connect(client_encoding="utf-8",
unicode_results=1)

If I do unicode_results=1, then there are exceptions in other places,
e.g. urllib.urlencode(values)
cant encode values

3. After connecting, it appears necessary to set the client encoding
explicitly:

connection.cursor().execute("set client_encoding to unicode")

I've tried this as well, but still have exceptions

I'd appreciate any suggestions which improve on the above, but what
this should allow you to do is to present Unicode objects to the
database and to receive such objects from queries. Whether you can
relax this and pass UTF-8-encoded strings instead of Unicode objects
is not something I can guarantee, but it's usually recommended that
you manipulate Unicode objects in your program where possible, and
here you should be able to let pyPgSQL deal with the encodings
preferred by the database.

Thanks for your suggestions! Sadly, I can't solve my problem...

Pet

Pet · May 26, 2009

Hi,

DB is in UTF8

If I do unicode_results=1, then there are exceptions in other places,
e.g. urllib.urlencode(values)
cant encode values

I've tried this as well, but still have exceptions

Thanks for your suggestions! Sadly, I can't solve my problem...

Pet

After some time, I've tried, to convert result with unicode(result,
'ISO-8859-15') and that was it

I've thought it was already utf-8, because of charset defining in
<meta> of webpage I'm fetching
Pet

Paul Boddie · May 26, 2009

After some time, I've tried, to convert result with unicode(result,
'ISO-8859-15') and that was it

I haven't really investigated having unicode_results set to false (or
the default) with a database containing UTF-8 (or any non-ASCII
encoded) text, since it's always desirable to manipulate Unicode
internally in one's programs: I don't want plain strings containing
various encoded sequences of bytes when I'm dealing with characters.
That said, if one were consuming XML/HTML and then putting it in raw
form into a database (including the tags), I could understand that
Unicode objects might then seem like a distraction.

I've thought it was already utf-8, because of charset defining in
<meta> of webpage I'm fetching

There are lots of caveats about Web page encodings - which metadata
actually indicates the encoding - but I still regard the best approach
to involve converting text to Unicode as soon as possible, then
presenting Unicode objects to the database. This way, you can separate
the decisions about which encodings the Web pages are using and which
encoding the database is using.

Paul

MySQLdb not playing nice with unicode	1	Mar 30, 2013
decoding a byte array that is unicode escaped?	2	Nov 6, 2009
Unicode -> Python -> DBAPI -> PyPgSQL -> PostgreSQL	2	Nov 3, 2003
File names, character sets and Unicode	1	Dec 12, 2008
byte count unicode string	2	Sep 20, 2006
Unicode Question	4	Jan 10, 2006
Python's handling of unicode surrogates	17	Apr 20, 2007
DBD::Oracle, Unicode, non-UTF8-non-ASCII strings	0	Jul 23, 2009

unicode confusing

someone

Paul Boddie

Pet

Pet

Paul Boddie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads