Convertion of Unicode to ASCII NIGHTMARE

C

ChaosKCW

Hi

I am reading from an oracle database using cx_Oracle. I am writing to a
SQLite database using apsw.

The oracle database is returning utf-8 characters for euopean item
names, ie special charcaters from an ASCII perspective.

I get the following error:
SQLiteCur.execute(sql, row)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 12: ordinal not in >range(128)

I have googled for serval days now and still cant get it to encode to
ascii.

I encode the SQL as follows:

sql = "insert into %s values %s" % (SQLiteTable, paramstr)
sql.encode('ascii', 'ignore')

I then code each of the row values returned from Oracle like this:

row = map(encodestr, row)
SQLiteCur.execute(sql, row)

where encodestr is as follows:

def encodestr(item):
if isinstance(item, types.StringTypes):
return unicodedata.normalize('NFKD', unicode(item, 'utf-8',
'ignore')).encode('ASCII', 'ignore')
else:
return item

I have tried a thousand of similiar functions to the above,
permitations of the above from various google searches. But I still get
the above exception on the line:

SQLiteCur.execute(sql, row)

and the exception is reslated to the data in one field.

Int the end I resorted to using oracles convert function in the SQL
statement but would like to understand why this is happening and why
its so hard to convert the string in python. I have read many
complaints about this from other people some of whom have written
custom stripping routines. I havent tried a custom routine yet, cause I
think it should be possilble in python.

Thanks,
 
P

Paul Boddie

ChaosKCW said:
Hi

I am reading from an oracle database using cx_Oracle. I am writing to a
SQLite database using apsw.

The oracle database is returning utf-8 characters for euopean item
names, ie special charcaters from an ASCII perspective.

And does cx_Oracle return those as Unicode objects or as plain strings
containing UTF-8 byte sequences? It's very important to distinguish
between these two cases, and I don't have any experience with cx_Oracle
to be able to give advice here.
I get the following error:

It looks like you may have Unicode objects that you're presenting to
sqlite. In any case, with earlier versions of pysqlite that I've used,
you need to connect with a special unicode_results parameter, although
later versions should work with Unicode objects without special
configuration. See here for a thread (in which I seem to have
participated, coincidentally):

http://mail.python.org/pipermail/python-list/2002-June/107526.html
I have googled for serval days now and still cant get it to encode to
ascii.

This is a tough thing to find out - whilst previous searches did
uncover some discussions about it, I just tried and failed to find the
enlightening documents - and I certainly didn't see many references to
it on the official pysqlite site.

Paul
 
R

Robert Kern

ChaosKCW said:
Hi

I am reading from an oracle database using cx_Oracle. I am writing to a
SQLite database using apsw.

The oracle database is returning utf-8 characters for euopean item
names, ie special charcaters from an ASCII perspective.

I'm not sure that you are using those terms correctly. From your description
below, it seems that your data is being returned from the Oracle database as
unicode strings rather than regular strings containing UTF-8 encoded data. These
European characters are not "special characters from an ASCII perspective;" they
simply aren't characters in the ASCII character set at all.
I get the following error:


I have googled for serval days now and still cant get it to encode to
ascii.

Don't. You can't. Those characters don't exist in the ASCII character set.
SQLite 3.0 deals with UTF-8 encoded SQL statements, though.

http://www.sqlite.org/version3.html
I encode the SQL as follows:

sql = "insert into %s values %s" % (SQLiteTable, paramstr)
sql.encode('ascii', 'ignore')

The .encode() method returns a new value; it does not change an object inplace.

sql = sql.encode('utf-8')

--
Robert Kern
(e-mail address removed)

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
D

Diez B. Roggisch

Don't. You can't. Those characters don't exist in the ASCII character set.
SQLite 3.0 deals with UTF-8 encoded SQL statements, though.

That is not entirely correct - one can, if losing information is ok. The OPs
code that normalized UTF-8 to NFKD, an umlaut like ä is transformed to a
two-character-sequence basically saying "a with two dots on top". With
'ignore' specified as parameter to the encoder, this should be result in
the letter a.


Regards,

Diez
 
P

Paul Boddie

Oh, and it occurs to me, as I seem to have mentioned a document about
PgSQL rather than pysqlite (although they both have the same principal
developer), that you might need to investigate the client_encoding
parameter when setting up your connection. The following message gives
some information (but not much):

http://groups.google.com/group/comp.lang.python/msg/f27fa9866c9b7b5f

Sadly, I can't find the information about getting result values as
Unicode objects, but I believe it involves some kind of SQL comment
that you send to the database system which actually tells pysqlite to
change its behaviour.

Paul
 
R

Roger Binns

Paul Boddie said:
It looks like you may have Unicode objects that you're presenting to
sqlite. In any case, with earlier versions of pysqlite that I've used,
you need to connect with a special unicode_results parameter,

He is using apsw. apsw correctly handles unicode. In fact it won't
accept a str with bytes >127 as they will be an unknown encoding and
SQLite only uses Unicode internally. It does have a blob type
using buffer for situations where binary data needs to be stored.
pysqlite's mishandling of Unicode is one of the things that drove
me to writing apsw in the first place.

Roger
 
C

ChaosKCW

Hi

Thanks for all the posts. I am still digesting it all but here are my
initial comments.
Don't. You can't. Those characters don't exist in the ASCII character set.
SQLite 3.0 deals with UTF-8 encoded SQL statements, though.
http://www.sqlite.org/version3.html

As mentioned by the next poster, there is, its supposed to be encode
with the 'ignore' option. Thus you lose data, but thats just dandy with
me. As for SQLite supporting unicode, it probably does, but something
on the python side (probabyl in apsw) converts it to ascii at some
point before its handed to SQLite.
The .encode() method returns a new value; it does not change an object inplace.
sql = sql.encode('utf-8')

Ah yes, big bistake on my part :-/
He is using apsw. apsw correctly handles unicode. In fact it won't
accept a str with bytes >127 as they will be an unknown encoding and
SQLite only uses Unicode internally. It does have a blob type
using buffer for situations where binary data needs to be stored.
pysqlite's mishandling of Unicode is one of the things that drove
me to writing apsw in the first place.

Ok if SQLite uses unicode internally why do you need to ignore
everything greater than 127, the ascii table (256 bit one) fits into
unicode just fine as far as I recall? Or did I miss the boat here ?

Thanks,
 
J

John Machin

"Thus you lose data, but thats just dandy with
me.": Please reconsider this attitude, before you perpetrate a nonsense
or even a disaster.

Wrt your last para:
1. Roger didn't say "ignore" -- he said "won't accept" (a major
difference).
2. The ASCII code comprises 128 characters, *NOT* 256.
3. What Roger means is: given a Python 8-bit string and no other
information, you don't have a clue what the encoding is. Most codes of
interest these days have the ASCII code (or a mild perversion thereof)
in the first 128 positions, but it's anyones guess what the writer of
the string had in mind with the next 128.
 
R

Robert Kern

Roger said:
He is using apsw. apsw correctly handles unicode. In fact it won't
accept a str with bytes >127 as they will be an unknown encoding and
SQLite only uses Unicode internally. It does have a blob type
using buffer for situations where binary data needs to be stored.
pysqlite's mishandling of Unicode is one of the things that drove
me to writing apsw in the first place.

Ah, I misread the OP's traceback.

Okay, the OP is getting regular strings, which are probably encoded in
ISO-8859-1 if I had to guess, from the Oracle DB. He is trying to pass them in
to SQLiteCur.execute() which tries to make a unicode string from the input:

In [1]: unicode('\xdc')
---------------------------------------------------------------------------
exceptions.UnicodeDecodeError Traceback (most recent call
last)

/Users/kern/<ipython console>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xdc in position 0: ordinal
not in range(128)

*Now*, my advice to the OP is to figure out the encoding of the strings that are
being returned from Oracle. As I said, ISO-8859-1 is probably a good guess.
Then, he would *decode* the string to a unicode string using the encoding. E.g.:

row = row.decode('iso-8859-1')

Then everything should be peachy. I hope.

--
Robert Kern
(e-mail address removed)

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
P

Paul Boddie

Note that I've since mentioned client_encoding which seems to matter
for pysqlite 1.x.

For pysqlite 2.x, it appears that Unicode objects can be handed
straight to the API methods, and I'd be interested to hear about your
problems with pysqlite, Unicode and what actually made you write apsw
instead.
Ah, I misread the OP's traceback.

Okay, the OP is getting regular strings, which are probably encoded in
ISO-8859-1 if I had to guess, from the Oracle DB. He is trying to pass them in
to SQLiteCur.execute() which tries to make a unicode string from the input:

[...]

There's an Oracle environment variable that appears to make a
difference: NLS_CHARSET, perhaps - it's been a while since I've had to
deal with Oracle, and I'm not looking for another adventure into
Oracle's hideous documentation to find out.
*Now*, my advice to the OP is to figure out the encoding of the strings that are
being returned from Oracle. As I said, ISO-8859-1 is probably a good guess.
Then, he would *decode* the string to a unicode string using the encoding. E.g.:

row = row.decode('iso-8859-1')

Then everything should be peachy. I hope.

Yes, just find out what Oracle wants first, then set it all up, noting
that without looking into the Oracle wrapper being used, I can't
suggest an easier way.

Paul
 
R

Roger Binns

ChaosKCW said:
me. As for SQLite supporting unicode, it probably does,

No, SQLite *ONLY* supports Unicode. It will *only* accept
strings in Unicode and only produces strings in Unicode.
All the functionality built into SQLite such as comparison
operators operate only on Unicode strings.
but something
on the python side (probabyl in apsw) converts it to ascii at some
point before its handed to SQLite.

No. APSW converts it *to* Unicode. SQLite only accepts Unicode
so a Unicode string has to be supplied. If you supply a non-Unicode
string then conversion has to happen. APSW asks Python to
supply the string in Unicode. If Python can't do that (eg
it doesn't know the encoding) then you get an error.

I strongly recommend reading this:

The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know About Unicode and Character Sets

http://www.joelonsoftware.com/articles/Unicode.html
Ok if SQLite uses unicode internally why do you need to ignore
everything greater than 127,

I never said that. I said that a special case is made so that
if the string you supply only contains ASCII characters (ie <=127)
then the ASCII string is converted to Unicode. (In fact it is
valid UTF-8 hence the shortcut).
the ascii table (256 bit one) fits into
unicode just fine as far as I recall?

No, ASCII characters have defined Unicode codepoints. The ASCII
character number just happens to be the same as the Unicode
codepoints. But there are only 127 ASCII characters.
Or did I miss the boat here ?

For bytes greater than 127, what character set is used? There
are hundreds of character sets that define those characters.
You have to tell the computer which one to use. See the Unicode
article referenced above.

Roger
 
F

Fredrik Lundh

Roger said:
SQLite only accepts Unicode so a Unicode string has to be supplied.

fact or FUD? let's see:

import pysqlite2.dbapi2 as DB

db = DB.connect("test.db")

cur = db.cursor()

cur.execute("create table if not exists test (col text)")
cur.execute("insert into test values (?)", ["this is an ascii string"])
cur.execute("insert into test values (?)", [u"this is a unicode string"])
cur.execute("insert into test values (?)", [u"thïs ïs ö unicöde strïng"])

cur.execute("select * from test")

for row in cur.fetchall():
print row

prints

(u'this is an ascii string',)
(u'this is a unicode string',)
(u'th\xefs \xefs \xf6 unic\xf6de str\xefng',)

which is correct behaviour under Python's Unicode model.

</F>
 
R

Roger Binns

Fredrik Lundh said:
fact or FUD? let's see:

Note I said SQLite. For APIs that take/give strings, you can either
supply/get a UTF-8 encoded sequence of bytes, or two bytes per character
host byte order sequence. Any wrapper of SQLite that doesn't do
Unicode in/out is seriously breaking things.

I ended up using the UTF-8 versions of the API as Python can't quite
make its mind up how to represent Unicode strings at the C api level.
You can have two bytes per char or four, and the handling/production
of byte order markers isn't that clear either.
import pysqlite2.dbapi2 as DB

pysqlite had several unicode problems in the past. It has since
been cleaned up as you saw.

Roger
 
F

Fredrik Lundh

Roger said:
Note I said SQLite. For APIs that take/give strings, you can either
supply/get a UTF-8 encoded sequence of bytes, or two bytes per character
host byte order sequence. Any wrapper of SQLite that doesn't do
Unicode in/out is seriously breaking things.

I ended up using the UTF-8 versions of the API as Python can't quite
make its mind up how to represent Unicode strings at the C api level.
You can have two bytes per char or four, and the handling/production
of byte order markers isn't that clear either.

sounds like your understanding of Unicode and Python's Unicode system
is a bit unclear.

</F>
 
R

Roger Binns

Fredrik Lundh said:
sounds like your understanding of Unicode and Python's Unicode system
is a bit unclear.

Err, no. Relaying unicode data between two disparate
C APIs requires being careful and thorough. That means
paying attention to when conversions happen, byte ordering
(compile time) and boms (run time) and when the API
documentation isn't thorough, verifying the behaviour
yourself. That requires a very clear understanding of
Unicode in order to do the requisite test cases, as well
as reading what the code does.

Roger
 
S

Serge Orlov

Roger said:
Note I said SQLite. For APIs that take/give strings, you can either
supply/get a UTF-8 encoded sequence of bytes, or two bytes per character
host byte order sequence. Any wrapper of SQLite that doesn't do
Unicode in/out is seriously breaking things.

I ended up using the UTF-8 versions of the API as Python can't quite
make its mind up how to represent Unicode strings at the C api level.
You can have two bytes per char or four, and the handling/production
of byte order markers isn't that clear either.

I have an impression that handling/production of byte order marks is
pretty clear: they are produced/consumed only by two codecs: utf-16 and
utf-8-sig. What is not clear?

Serge
 
R

Roger Binns

Serge Orlov said:
I have an impression that handling/production of byte order marks is
pretty clear: they are produced/consumed only by two codecs: utf-16 and
utf-8-sig. What is not clear?

Are you talking about the C APIs in Python/SQLite (that is what I
have been discussing) or the language level?

At the C level, SQLite doesn't accept boms. You have to
provide UTF-8 or host byte order two bytes per char
UTF-16.

Roger
 
S

Serge Orlov

Roger said:
Are you talking about the C APIs in Python/SQLite (that is what I
have been discussing) or the language level?

Both. Documentation for PyUnicode_DecodeUTF16 and PyUnicode_EncodeUTF16
is pretty clear when BOM is produced/removed. The only problem is that
you have to find out host endianess yourself. In python it's
sys.byteorder, in C you use hack like

unsigned long one = 1;
endianess = (*(char *) &one) == 0) ? 1 : -1;

And then pass endianess to PyUnicode_(De/En)codeUTF16. So I still don't
see what is unclear about BOM production/handling.

At the C level, SQLite doesn't accept boms.

It would be surprising if it did. Quote from
<http://www.unicode.org/faq/utf_bom.html>: "Where the data is typed,
such as a field in a database, a BOM is unnecessary"
 
C

ChaosKCW

Roger said:
No. APSW converts it *to* Unicode. SQLite only accepts Unicode
so a Unicode string has to be supplied. If you supply a non-Unicode
string then conversion has to happen. APSW asks Python to
supply the string in Unicode. If Python can't do that (eg
it doesn't know the encoding) then you get an error.

If what you say is true, I have to ask why I get a converstion error
which states it cant convert to ASCII, not it cant convert to UNICODE?

I never said that. I said that a special case is made so that
if the string you supply only contains ASCII characters (ie <=127)
then the ASCII string is converted to Unicode. (In fact it is
valid UTF-8 hence the shortcut).


No, ASCII characters have defined Unicode codepoints. The ASCII
character number just happens to be the same as the Unicode
codepoints. But there are only 127 ASCII characters.


For bytes greater than 127, what character set is used? There
are hundreds of character sets that define those characters.
You have to tell the computer which one to use. See the Unicode
article referenced above.

Yes I know there are a million "extended" ASCII charaters sets, which
happen to the bane of all existence. Most computers deal in bytes
nativly and the 7 bit coding still causes problems to this day. But
since the error I get is a converstion error to ASCII, not from ASCII,
I am willing to accept loss of information. You cant code unicode into
ascii without loss of information or two charcater codes. In my mind,
somewhere inside the "cursor.execute" function, it converts to ascii. I
say this because of the error msg recieved. So I am missing how a
function which supposedly converts evereythin to unicode lands up doing
an ascii converstion ?
 
C

ChaosKCW

There's an Oracle environment variable that appears to make a
difference: NLS_CHARSET, perhaps - it's been a while since I've had to
deal with Oracle, and I'm not looking for another adventure into
Oracle's hideous documentation to find out.

That is an EVIL setting which should not be used. The NLS_CHARSET
environment variable causes so many headaches its not worth playing
with it at all.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top