unicode confusing

S

someone

Hi,

reading content of webpage (encoded in utf-8) with urllib2, I can't
get parsed data into DB

Exception:

File "/usr/lib/python2.5/site-packages/pyPgSQL/PgSQL.py", line 3111,
in execute
raise OperationalError, msg
libpq.OperationalError: ERROR: invalid UTF-8 byte sequence detected
near byte 0xe4

I've already checked several python unicode tutorials, but I have no
idea how to solve my problem.

Appreciating your help
Pet
 
P

Paul Boddie

Hi,

reading content of webpage (encoded in utf-8) with urllib2, I can't
get parsed data into DB

Exception:

  File "/usr/lib/python2.5/site-packages/pyPgSQL/PgSQL.py", line 3111,
in execute
    raise OperationalError, msg
libpq.OperationalError: ERROR:  invalid UTF-8 byte sequence detected
near byte 0xe4

I've already checked several python unicode tutorials, but I have no
idea how to solve my problem.

With pyPgSQL, there are a few tricks that you have to take into
account:

1. With PostgreSQL, it would appear advantageous to create databases
using the "-E unicode" option.

2. When connecting, use the client_encoding and unicode_results
arguments for the connect function call:

connection = PgSQL.connect(client_encoding="utf-8",
unicode_results=1)

3. After connecting, it appears necessary to set the client encoding
explicitly:

connection.cursor().execute("set client_encoding to unicode")

I'd appreciate any suggestions which improve on the above, but what
this should allow you to do is to present Unicode objects to the
database and to receive such objects from queries. Whether you can
relax this and pass UTF-8-encoded strings instead of Unicode objects
is not something I can guarantee, but it's usually recommended that
you manipulate Unicode objects in your program where possible, and
here you should be able to let pyPgSQL deal with the encodings
preferred by the database.

Paul
 
P

Pet

With pyPgSQL, there are a few tricks that you have to take into
account:

1. With PostgreSQL, it would appear advantageous to create databases
using the "-E unicode" option.

Hi,

DB is in UTF8

2. When connecting, use the client_encoding and unicode_results
arguments for the connect function call:

  connection = PgSQL.connect(client_encoding="utf-8",
unicode_results=1)

If I do unicode_results=1, then there are exceptions in other places,
e.g. urllib.urlencode(values)
cant encode values
3. After connecting, it appears necessary to set the client encoding
explicitly:

  connection.cursor().execute("set client_encoding to unicode")

I've tried this as well, but still have exceptions
I'd appreciate any suggestions which improve on the above, but what
this should allow you to do is to present Unicode objects to the
database and to receive such objects from queries. Whether you can
relax this and pass UTF-8-encoded strings instead of Unicode objects
is not something I can guarantee, but it's usually recommended that
you manipulate Unicode objects in your program where possible, and
here you should be able to let pyPgSQL deal with the encodings
preferred by the database.

Thanks for your suggestions! Sadly, I can't solve my problem...

Pet
 
P

Pet

Hi,

DB is in UTF8





If I do unicode_results=1, then there are exceptions in other places,
e.g. urllib.urlencode(values)
cant encode values





I've tried this as well, but still have exceptions




Thanks for your suggestions! Sadly, I can't solve my problem...

Pet

After some time, I've tried, to convert result with unicode(result,
'ISO-8859-15') and that was it :)
I've thought it was already utf-8, because of charset defining in
<meta> of webpage I'm fetching
Pet
 
P

Paul Boddie

After some time, I've tried, to convert result with unicode(result,
'ISO-8859-15') and that was it :)

I haven't really investigated having unicode_results set to false (or
the default) with a database containing UTF-8 (or any non-ASCII
encoded) text, since it's always desirable to manipulate Unicode
internally in one's programs: I don't want plain strings containing
various encoded sequences of bytes when I'm dealing with characters.
That said, if one were consuming XML/HTML and then putting it in raw
form into a database (including the tags), I could understand that
Unicode objects might then seem like a distraction.
I've thought it was already utf-8, because of charset defining in
<meta> of webpage I'm fetching

There are lots of caveats about Web page encodings - which metadata
actually indicates the encoding - but I still regard the best approach
to involve converting text to Unicode as soon as possible, then
presenting Unicode objects to the database. This way, you can separate
the decisions about which encodings the Web pages are using and which
encoding the database is using.

Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,140
Latest member
SweetcalmCBDreview
Top