Unicode / cx_Oracle problem

  • Thread starter Richard Schulman
  • Start date
R

Richard Schulman

Sorry to be back at the goodly well so soon, but...

....when I execute the following -- variable mean_eng_txt being
utf-16LE and its datatype nvarchar2(79) in Oracle:

cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt)
VALUES :)id,:mean)""",id=id,mean=mean)

I not surprisingly get this error message:

"cx_Oracle.NotSupportedError: Variable_TypeByValue(): unhandled data
type unicode"

But when I try putting a codecs.BOM_UTF16_LE in various plausible
places, I just end up generating different errors.

Recommendations, please?

TIA,
Richard Schulman
(Remove xx for email reply)
 
D

Diez B. Roggisch

Richard said:
Sorry to be back at the goodly well so soon, but...

...when I execute the following -- variable mean_eng_txt being
utf-16LE and its datatype nvarchar2(79) in Oracle:

cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt)
VALUES :)id,:mean)""",id=id,mean=mean)

I not surprisingly get this error message:

"cx_Oracle.NotSupportedError: Variable_TypeByValue(): unhandled data
type unicode"

But when I try putting a codecs.BOM_UTF16_LE in various plausible
places, I just end up generating different errors.

Show us the alleged plausible places, and the different errors.
Otherwise it's crystal ball time again.

Diez
 
R

Richard Schulman

cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt)
Diez:
Show us the alleged plausible places, and the different errors.
Otherwise it's crystal ball time again.

More usefully, let's just try to fix the code above. Here's the error
message I get:

NotSupportedError: Variable_TypeByValue(): unhandled data type unicode

Traceback (innermost last):

File "c:\pythonapps\LoadMeanToOra.py", line 1, in ?
# LoadMeanToOra reads a UTF-16LE input file one record at a time
File "c:\pythonapps\LoadMeanToOra.py", line 23, in ?
cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt)

What I can't figure out is whether cx_Oracle is saying it can't handle
Unicode for an Oracle nvarchar2 data type or whether it can handle the
input but that it needs to be in a specific format that I'm not
supplying.

- Richard Schulman
 
D

Diez B. Roggisch

Richard said:
More usefully, let's just try to fix the code above. Here's the error
message I get:

NotSupportedError: Variable_TypeByValue(): unhandled data type unicode

Traceback (innermost last):

File "c:\pythonapps\LoadMeanToOra.py", line 1, in ?
# LoadMeanToOra reads a UTF-16LE input file one record at a time
File "c:\pythonapps\LoadMeanToOra.py", line 23, in ?
cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt)

What I can't figure out is whether cx_Oracle is saying it can't handle
Unicode for an Oracle nvarchar2 data type or whether it can handle the
input but that it needs to be in a specific format that I'm not
supplying.

What does

print repr(mean)

give you?

It _looks_ to me (don't have an orcacle available right now) as if it is
a unicode object. That you have to consider as some abstract string
representation. Which means it has to be encoded in some way before sent
over the wire. There might exist db-api bindings that can deal with
them, by applying a default encoding or somehow figuring out what
encoding the DB expects. But I don't see any references to unicode in
pep 249, so I presume you can't rely on that - which seems to be the
case here.

The oracle NLS is a sometimes tricky beast, as it sets the encoding it
tries to be clever and assigns an existing connection some encoding,
based on the users/machines locale. Which can yield unexpected results,
such as "Dusseldorf" instead of "Düsseldorf" when querying a german city
list with an english locale.

So - you have to figure out, what encoding your db-connection expects.
You can do so by issuing some queries against the session tables I
believe - I don't have my oracle resources at home, but googling will
bring you there, the important oracle term is NLS.

Then you need to encode the unicode string before passing it - something
like this:

mean = mean.encode("latin1")

That should help.

Diez
 
R

Richard Schulman

What does print repr(mean) give you?

That is a useful suggestion.

For context, I reproduce the source code:

in_file = codecs.open("c:\\pythonapps\\mean.my",encoding="utf_16_LE")
connection = cx_Oracle.connect("username", "password")
cursor = connection.cursor()
for row in in_file:
id = row[0]
mean = row[1]
print "Value of row is ", repr(row) #debug line
print "Value of the variable 'id' is ", repr(id) #debug line
print "Value of the variable 'mean' is ", repr(mean) #debug line
cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt)
VALUES :)id,:mean)""",id=id,mean=mean)

Here is the result from the print repr() statements:

Value of row is u"\ufeff(3,'sadness, lament; sympathize with,
pity')\r\n"
Value of the variable 'id' is u'\ufeff'
Value of the variable 'mean' is u'('

Clearly, the values loaded into the 'id' and 'mean' variables are not
satisfactory but are picking up the BOM.
...
The oracle NLS is a sometimes tricky beast, as it sets the encoding it
tries to be clever and assigns an existing connection some encoding,
based on the users/machines locale. Which can yield unexpected results,
such as "Dusseldorf" instead of "Düsseldorf" when querying a german city
list with an english locale.
Agreed.

So - you have to figure out, what encoding your db-connection expects.
You can do so by issuing some queries against the session tables I
believe - I don't have my oracle resources at home, but googling will
bring you there, the important oracle term is NLS.

It's very hard to figure out what to do on the basis of complexities
on the order of

http://download-east.oracle.com/docs/cd/B25329_01/doc/appdev.102/b25108/xedev_global.htm#sthref1042

(tiny equivalent http://tinyurl.com/fnc54

But I'm not even sure I got that far. My problems so far seem prior:
in Python or Python's cx_Oracle driver. To be candid, I'm very tempted
at this point to abandon the Python effort and revert to an all-ucs2
environment, much as I dislike Java and C#'s complexities and the poor
support available for all-Java databases.
Then you need to encode the unicode string before passing it - something
like this:

mean = mean.encode("latin1")

I don't see how the Chinese characters embedded in the English text
will carry over if I do that.

In any case, thanks for your patient and generous help.

Richard Schulman
Delete the antispamming 'xx' characters for email reply
 
D

Diez B. Roggisch

Value of the variable 'id' is u'\ufeff'
Value of the variable 'mean' is u'('

So they both are unicode objects - as I presumed.

It's very hard to figure out what to do on the basis of complexities
on the order of

http://download-east.oracle.com/docs/cd/B25329_01/doc/appdev.102/b25108/xedev_global.htm#sthref1042

(tiny equivalent http://tinyurl.com/fnc54

Yes, that is somewhat intimidating.
But I'm not even sure I got that far. My problems so far seem prior:
in Python or Python's cx_Oracle driver. To be candid, I'm very tempted
at this point to abandon the Python effort and revert to an all-ucs2
environment, much as I dislike Java and C#'s complexities and the poor
support available for all-Java databases.

That actually doesn't help you much I guess - just because JDBC will
convert java's unicode strings to byte strings behind the curtains, you
will lose all encoding information nonetheless - especially if the DB
itself isn't running an encoding that will allow for all possible
unicode characters to be represented.
I don't see how the Chinese characters embedded in the English text
will carry over if I do that.

Me neither, but how could I have foreseen that? So use something else
instead - utf-8 for example, or whatever the oracle connection will grok.

I think you should read up on what unicode and encodings are, and how
they work in python, and unfortunately how they do work in oracle.
Because even if you use java - not understanding how things are
connected will hit you in the neck at some point.

Diez
 
J

John Machin

Richard said:
What does print repr(mean) give you?

That is a useful suggestion.

For context, I reproduce the source code:

in_file = codecs.open("c:\\pythonapps\\mean.my",encoding="utf_16_LE")
connection = cx_Oracle.connect("username", "password")
cursor = connection.cursor()
for row in in_file:
id = row[0]
mean = row[1]
print "Value of row is ", repr(row) #debug line
print "Value of the variable 'id' is ", repr(id) #debug line
print "Value of the variable 'mean' is ", repr(mean) #debug line
cursor.execute("""INSERT INTO mean (mean_id,mean_eng_txt)
VALUES :)id,:mean)""",id=id,mean=mean)

Here is the result from the print repr() statements:

Value of row is u"\ufeff(3,'sadness, lament; sympathize with,
pity')\r\n"
Value of the variable 'id' is u'\ufeff'
Value of the variable 'mean' is u'('

Clearly, the values loaded into the 'id' and 'mean' variables are not
satisfactory but are picking up the BOM.

Well of course they're "unsatisfactory" and this is absolutely nothing
to do with Oracle and cx_Oracle.

row is a string of characters. row[0] is the BOM. Read my lips (from a
previous thread):

"""
Use utf_16 -- it will strip off the BOM for you.
"""
and again:
"""
| >>> codecs.open('guff.utf16le', 'r', encoding='utf_16').read()
| u'abc\n\rdef\n\rghi' ######### Look, Mom, no BOM!
"""

row[1] is the first ***character*** of what looks suspiciously like the
Python representation of a tuple:

"""(3,'sadness, lament; sympathize with, pity')"""

Who wrote that like that??? If it is at all under your control, do it
like this:
Encode each Unicode text field in UTF-8. Write the file as a CSV file
using Python's csv module. Read the CSV file using the same module.
Decode the text fields from UTF-8.

You need to parse the incoming line into column values (the csv module
does this for you) and then convert each column value from
string/Unicode to a Python type that is compatible with the Oracle type
for that column.

My guess (not having used cx_Oracle) is that the error is happening
because the column "id" has a numeric type and you are trying to jam a
Unicode string into it. IOW, nothing to do with the "mean" column
(yet!).

BTW, I've managed to decode that "eng" means English not engineering
and "mean" means meaning i.e. not average and not stingy. Holy
obfuscation, Batman!

HTH,
John
 
R

Richard Schulman

...
Encode each Unicode text field in UTF-8. Write the file as a CSV file
using Python's csv module. Read the CSV file using the same module.
Decode the text fields from UTF-8.

You need to parse the incoming line into column values (the csv module
does this for you) and then convert each column value from
string/Unicode to a Python type that is compatible with the Oracle type
for that column.
...

John, how am I to reconcile your suggestions above with my
ActivePython 2.4 documentation, which states:

<<12.20 csv -- CSV File Reading and Writing
<<New in version 2.3.
....
<<Note: This version of the csv module doesn't support Unicode input.
Also, there are currently some issues regarding ASCII NUL characters.
Accordingly, all input should generally be printable ASCII to be safe.
These restrictions will be removed in the future.>>

Regards,
Richard Schulman
 
J

John Machin

Richard said:
John, how am I to reconcile your suggestions above with my
ActivePython 2.4 documentation, which states:

<<12.20 csv -- CSV File Reading and Writing
<<New in version 2.3.
...
<<Note: This version of the csv module doesn't support Unicode input.
Also, there are currently some issues regarding ASCII NUL characters.
Accordingly, all input should generally be printable ASCII to be safe.
These restrictions will be removed in the future.>>

1. For "Unicode" read "UTF-16".

2. Unless you have \u0000 in your Unicode data, encoding it into UTF-8
won't cause any ASCII NUL bytes to appear. Ensuring that you don't have
NULs in your data is a good idea in general.

3. There are also evidently some issues regarding ASCII LF characters
embedded in fields (like when Excel users do Alt-Enter (Windows
version) to put a hard line break in their headings); see
http://docs.python.org/dev/whatsnew/modules.html of which the following
is an extract:
"""
The CSV parser is now stricter about multi-line quoted fields.
Previously, if a line ended within a quoted field without a terminating
newline character, a newline would be inserted into the returned field.
This behavior caused problems when reading files that contained
carriage return characters within fields, so the code was changed to
return the field without inserting newlines. As a consequence, if
newlines embedded within fields are important, the input should be
split into lines in a manner that preserves the newline characters.
"""

4. Provided your fields don't contain any of CR, LF, ctrl-Z (maybe),
and NUL, you should be OK. I can't understand the sentence
"Accordingly, all input should generally be printable ASCII to be
safe." -- especially the "accordingly". If it was running amok with
8-bit characters with ord(c) >= 128, there would have been loud shrieks
from several corners of the globe.

5. However, to be safe, you could go the next step and convert the
UTF-8 to base64 -- see
http://docs.python.org/dev/lib/module-base64.html -- BUT once you've
done that your encoded data doesn't even have commas and quotes in it,
so you can avoid the maybe unsafe csv module and just write your data
as ",".join(base64_encoded_fields).

HTH,
John



HTH,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top