Unicode & mx.ODBC module

C

Chuck Bearden

I'm having a tough time understanding how to manage Unicode when loading
data into an MS SQL server. I'm still pretty new to Unicode, but I
think I have a grasp of the basic concepts. I'm running ActivePython
2.3.2 Build 230 on Windows XP. I have the Egenix mx.ODBC package
version 2.0.1 (thanks, Marc-Andre).

I have a script that is loading the contents of selected HTML files into
a database, along with information identifying the file. Here is a
sample script:

-------------------------begin snippet-------------------------
import sys
import mx.ODBC.Windows

#-- initialize the db connection
dbname = 'theDb'
uname = 'theUser'
password = 'thePassword'
dsn = "DSN=%s;UID=%s;PWD=%s" % (dbname, uname, password)
con = mx.ODBC.Windows.DriverConnect(dsn)

#-- handle UTF-8 encoded Unicode; this worked when loading XML files
con.encoding = 'utf-8'
con.stringformat = mx.ODBC.Windows.UNICODE_STRINGFORMAT

cur = con.cursor()

#-- get the contents of our file (crudely: filename is 2nd arg)
html_f = open(sys.argv[1], 'r')
htmldata = html_f.read()
html_f.close()

#-- make statement string and insert values tuple, and execute
stmnt = """
INSERT INTO pmLinkHTML
(PMID, Ord, HTML, HTMLlen)
VALUES
(?, ?, ?, ?)
"""
val_t = (549, 0, htmldata, len(htmldata))
cur.execute(stmnt, val_t)

cur.close()
con.close()
--------------------------end snippet--------------------------

For my pains I am rewarded with:

Traceback (most recent call last):
File "./unitest.py", line 27, in ?
cur.execute(stmnt, val_t)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position
45662: unexpected code byte

Byte 45662 of the HTML file is indeed "\xBE". I don't think that should
be a problem.

What am I doing wrong? I have spent a fair bit of time googling the
ng in various ways, and consulting Python in a Nutshell and the online
standard library docs at python.org. It may be something quite
obvious to a better-informed coder, but I am prepared to learn.

Many thanks in advance.
Chuck Bearden
 
V

vincent wehren

| I'm having a tough time understanding how to manage Unicode when loading
| data into an MS SQL server. <snipped for brevity>

....

| html_f = open(sys.argv[1], 'r')
| htmldata = html_f.read()
| html_f.close()
|
| #-- make statement string and insert values tuple, and execute
| stmnt = """
| INSERT INTO pmLinkHTML
| (PMID, Ord, HTML, HTMLlen)
| VALUES
| (?, ?, ?, ?)
| """
| val_t = (549, 0, htmldata, len(htmldata))
| cur.execute(stmnt, val_t)
|
| cur.close()
| con.close()
| --------------------------end snippet--------------------------
|
| For my pains I am rewarded with:
|
| Traceback (most recent call last):
| File "./unitest.py", line 27, in ?
| cur.execute(stmnt, val_t)
| UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position
| 45662: unexpected code byte
|
| Byte 45662 of the HTML file is indeed "\xBE". I don't think that should
| be a problem.
|
| What am I doing wrong?

What happens if you decode htmldata first by using

enc = "iso-8859-1" #change to whatever the input file's encoding is
htmldata = unicode(htmldata, enc)

?

Vincent Wehren


I have spent a fair bit of time googling the
| ng in various ways, and consulting Python in a Nutshell and the online
| standard library docs at python.org. It may be something quite
| obvious to a better-informed coder, but I am prepared to learn.
|
| Many thanks in advance.
| Chuck Bearden
|
|
 
C

Chuck Bearden

| I'm having a tough time understanding how to manage Unicode when loading
| data into an MS SQL server. <snipped for brevity>

...

| html_f = open(sys.argv[1], 'r')
| htmldata = html_f.read()
| html_f.close()
|
| #-- make statement string and insert values tuple, and execute
| stmnt = """
| INSERT INTO pmLinkHTML
| (PMID, Ord, HTML, HTMLlen)
| VALUES
| (?, ?, ?, ?)
| """
| val_t = (549, 0, htmldata, len(htmldata))
| cur.execute(stmnt, val_t)
|
| cur.close()
| con.close()
| --------------------------end snippet--------------------------
|
| For my pains I am rewarded with:
|
| Traceback (most recent call last):
| File "./unitest.py", line 27, in ?
| cur.execute(stmnt, val_t)
| UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position
| 45662: unexpected code byte
|
| Byte 45662 of the HTML file is indeed "\xBE". I don't think that should
| be a problem.
|
| What am I doing wrong?

What happens if you decode htmldata first by using

enc = "iso-8859-1" #change to whatever the input file's encoding is
htmldata = unicode(htmldata, enc)

?

Thanks. That was simple. It feels so good when you stop beating your
head against a brick wall. After using your timp to make my
simplified code above work, I was able to figure out how to apply it
to my more complex real project.

I think I'm still not entirely clear on when Unicode encoding &
decoding happen in Python and for what reasons. In my searching on this
problem I kept my eye open for a nice, systematic treatment of Unicode
in Python, but I haven't found anything yet.

Again, many thanks for your repsonse.
Best wishes,
Chuck
 
S

Scott David Daniels

Chuck said:
I think I'm still not entirely clear on when Unicode encoding &
decoding happen in Python and for what reasons. In my searching on this
problem I kept my eye open for a nice, systematic treatment of Unicode
in Python, but I haven't found anything yet.
It is not really tough, but you need to understand some facts that you
won't want to believe.

1) You are normally (when using str's) dealing with _bytes_, not
_characters_ in strings. Just because your system can print them
doesn't mean someone else's system will print the same thing.

2) Unicode is a coding system for _characters_ and not binary values.
Especially if you wander into the stranger sections of unicode, a
single character may take several positions in a unicode string.

3) Deciding if two unicode strings are _the_same_ is a question of
philosophy, and not just programming.

OK, with those caveats, you can pretend --
unicode(some_byte_string, encoding) produces a unicode string.
The byte string has no coding -- it is a sequence of bytes. The
coding is how you interpret those bytes to determine the characters
that the bytes mean.

Unicode, on the other hand, is a _character_ encoding. In some
sense, you should expect the unicode expression "unicode(s, enc)"
to "mean" the same thing on all different computers that implement
python.

It really shouldn't matter what the bytes are in a unicode string,
just like it shouldn't matter what the characters are in a byte
string.

Please let me know whether this is:
A) obvious,
B) clear,
C) comprehensible with effort
D) gibberish
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top