Writing Unicode to database using ODBC

Discussion in 'Python' started by Mudcat, Sep 3, 2008.

  1. Mudcat

    Mudcat Guest

    In short what I'm trying to do is read a document using an xml parser
    and then upload that data back into a database. I've got the code more
    or less completed using xml.etree.ElementTree for the parser and dbi/
    odbc for my db connection.

    To fix problems with unicode I built a work-around by mapping unicode
    characters to equivalent ascii characters and then encoding everything
    to ascii. That allowed me to build the application and debug it
    without running into problems printing to file or stdout to screen.

    However, now that I've got all that working I'd like to simply take
    the unicode data from the xml parser and then pass it directly into
    the database (which is currently set up for unicode data). I've run
    into problems and just can't figure why this isn't working.

    The breakdown is occurring when I try to execute the db query:

    cur.execute( query )

    Fairly straightforward. I get the following error:

    File "atp_alt.py", line 273, in dbWrite
    cur.execute( query )
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in
    position 3
    79: ordinal not in range(128)

    I've verified that query is of type unicode by checking the type a
    statement or two earlier (output: <type 'unicode'>).

    So then I thought maybe the odbc execute just can't handle unicode
    data. But when I do the following command:

    query = query.encode('utf-8')

    It actually works. So apparently execute can handle unicode data. The
    problem now is that basically the data has been encoded twice and is
    in the wrong format when I pull it from the database:

    >>> a

    u'+CMGL: (\xe2\u20ac\u0153REC UNREAD\xe2\u20ac\x9d,\xe2\u20ac\x9dREC
    READ\xe2\u20ac\x9d,\xe2\u20ac\x9dSTO UNSENT\xe2\u20ac\x9d,\xe2\u20ac
    \x9dSTO SENT\xe2\u20ac\x9d,\xe2\u20ac\x9dALL\xe2\u20ac\x9d) OK'
    >>> print a

    +CMGL: (“REC UNREADâ€,â€REC READâ€,â€STO UNSENTâ€,â€STO SENTâ
    €,â€ALLâ€) OK

    The non-alpha characters should be double-quotes. It works correctly
    if I copy/paste into the editor:

    >>> d

    u'\u201cREC'
    >>> print d

    “REC
    >>> d.encode('utf-8')

    '\xe2\x80\x9cREC'
    >>> type(d.encode('utf-8'))

    <type 'str'>


    I can then decode the string to get back the proper unicode data. I
    can't do that with the data out of the db because it's of the wrong
    type for the data that it has.

    I think the problem is that I'm having to encode data again to force
    it into the database, but how can I use the odbc.execute() function
    without having to do that?
    Mudcat, Sep 3, 2008
    #1
    1. Advertising

  2. Mudcat

    John Machin Guest

    On Sep 4, 3:11 am, Mudcat <> wrote:
    [snip]

    > However, now that I've got all that working I'd like to simply take
    > the unicode data from the xml parser and then pass it directly into
    > the database (which is currently set up for unicode data). I've run


    What database? What does "set up for unicode data" mean? If you are
    using MS SQL Server, are your text columns defined to be varchar or
    nvarchar or something else?

    > into problems and just can't figure why this isn't working.
    >
    > The breakdown is occurring when I try to execute the db query:
    >
    >           cur.execute( query )
    >
    > Fairly straightforward. I get the following error:
    >
    >   File "atp_alt.py", line 273, in dbWrite
    >     cur.execute( query )
    > UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in
    > position 3
    > 79: ordinal not in range(128)
    >
    > I've verified that query is of type unicode by checking the type a
    > statement or two earlier (output: <type 'unicode'>).


    > So then I thought maybe the odbc execute just can't handle unicode
    > data.


    It appears to be expecting a str object, not a unicode object.

    > But when I do the following command:
    >
    >           query = query.encode('utf-8')
    >
    > It actually works. So apparently execute can handle unicode data.


    "not crashing" != "works"

    > The
    > problem now is that basically the data has been encoded twice and is
    > in the wrong format when I pull it from the database:


    No, your utf8 string has been DEcoded using some strange encoding.

    >
    > >>> a

    >
    > u'+CMGL: (\xe2\u20ac\u0153REC UNREAD\xe2\u20ac\x9d,\xe2\u20ac\x9dREC
    > READ\xe2\u20ac\x9d,\xe2\u20ac\x9dSTO UNSENT\xe2\u20ac\x9d,\xe2\u20ac
    > \x9dSTO SENT\xe2\u20ac\x9d,\xe2\u20ac\x9dALL\xe2\u20ac\x9d) OK'>>> print a
    >
    > +CMGL: (“REC UNREAD†,†REC READ†,†STO UNSENT†,†STO SENTâ
    > € ,†ALL†) OK


    It would help very much if you showed the repr() of your input unicode
    text.

    Observation: the first bunch of rubbish output (\xe2\u20ac\u0153)
    differs from all the others (\xe2\u20ac\x9d).

    >
    > The non-alpha characters should be double-quotes.


    What "double-quotes" character(s)? Unicode has several: U+0022
    (unoriented), U+201C (left), U+201D (right), plus more exotic ones.

    > It works correctly
    > if I copy/paste into the editor:
    >
    > >>> d

    > u'\u201cREC'
    > >>> print d

    > “REC
    > >>> d.encode('utf-8')

    > '\xe2\x80\x9cREC'


    More observations:

    >>> u'\u201c'.encode('utf8')

    '\xe2\x80\x9c'
    >>> u'\u201c'.encode('utf8').decode('cp1252')

    u'\xe2\u20ac\u0153'

    Aha! The first load of rubbish! However:

    >>> u'\u201d'.encode('utf8')

    '\xe2\x80\x9d'
    >>> u'\u201d'.encode('utf8').decode('cp1252')

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\Python25\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
    2: character maps to <undefined>
    >>>


    Hmmm, try the ferschlugginer mcbs encoding:

    >>> u'\u201d'.encode('utf8').decode('mbcs')

    u'\xe2\u20ac\x9d'
    >>> u'\u201c'.encode('utf8').decode('mbcs')

    u'\xe2\u20ac\u0153'
    >>>


    So, if you must persist with the odbc module, either encode your
    unicode text with mbcs, not utf8, or find out how to "set up for
    unicode data" so that utf8 is the default.

    You may like to consider using pyODBC or mxODBC instead of odbc.

    HTH,
    John
    John Machin, Sep 4, 2008
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?U2FudGhvc2guQS5O?=

    How to open a MSProject saved in Database using ODBC in Client Mac

    =?Utf-8?B?U2FudGhvc2guQS5O?=, Jul 23, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    495
    =?Utf-8?B?U2FudGhvc2guQS5O?=
    Jul 23, 2004
  2. Replies:
    2
    Views:
    2,036
  3. Wes Gamble
    Replies:
    1
    Views:
    151
    Gerardo Santana Gómez Garrido
    Apr 5, 2006
  4. madan
    Replies:
    1
    Views:
    112
    David Squire
    Jun 9, 2006
  5. madan
    Replies:
    3
    Views:
    229
    Sherm Pendley
    Jun 12, 2006
Loading...

Share This Page