Python 2.4 vs 2.5 - Unicode error

Discussion in 'Python' started by Gaurav Veda, Jan 21, 2009.

  1. Gaurav Veda

    Gaurav Veda Guest

    Hi,

    I am trying to put some webpages into a mysql database using python
    (after some processing on the text). If I use Python 2.4.2, it works
    without a fuss. However, on Python 2.5, I get the following error:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
    4357: ordinal not in range(128)

    Before sending the (insert) query to the mysql server, I do the
    following which I think should've taken care of this problem:
    sqlStr = sqlStr.replace('\\', '\\\\')

    (where sqlStr is the query).

    Any suggestions?

    Thanks!
    Gaurav
     
    Gaurav Veda, Jan 21, 2009
    #1
    1. Advertising

  2. Gaurav Veda

    John Machin Guest

    On Jan 22, 4:49 am, Gaurav Veda <> wrote:
    > Hi,
    >
    > I am trying to put some webpages into a mysql database using python
    > (after some processing on the text). If I use Python 2.4.2, it works
    > without a fuss. However, on Python 2.5, I get the following error:
    >
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
    > 4357: ordinal not in range(128)
    >
    > Before sending the (insert) query to the mysql server, I do the
    > following which I think should've taken care of this problem:
    >  sqlStr = sqlStr.replace('\\', '\\\\')
    >
    > (where sqlStr is the query).
    >
    > Any suggestions?


    The 0xc2 strongly suggests that you are feeding the beast data encoded
    in UTF-8 while giving it no reason to believe that it is in fact not
    encoded in ASCII. Curiously the first errant byte is a long way (4KB)
    into your data. Consider doing
    print repr(data)
    to see what you've actually got there.

    I'm a little skeptical about the "2.4 works, 2.5 doesn't" notion --
    different versions of mysql, perhaps?

    Show at the very least the full traceback that you get. Try to write a
    short script that demonstrates the problem with 2.5 and no problem
    with 2.4, so that (a) it is apparent what you are doing (b) the
    problem can be reproduced if necessary by someone with access to
    mysql.

    You might like to explain why you think that doubling backslashes in
    your SQL is a good idea, and amplify "some processing on the text".

    HTH,
    John
     
    John Machin, Jan 21, 2009
    #2
    1. Advertising

  3. Gaurav Veda

    Gaurav Veda Guest

    > The 0xc2 strongly suggests that you are feeding the beast data encoded
    > in UTF-8 while giving it no reason to believe that it is in fact not
    > encoded in ASCII. Curiously the first errant byte is a long way (4KB)
    > into your data. Consider doing
    > print repr(data)
    > to see what you've actually got there.


    >>> sqlStr[4352:4362]

    ' and 25\xc2\xb0F'

    All I want to do is to just replace all the non-ascii characters by a
    space.

    > I'm a little skeptical about the "2.4 works, 2.5 doesn't" notion --
    > different versions of mysql, perhaps?


    I am trying to put content into the mysql server running on machine A,
    from machine B & machine C with different versions of python. So I
    don't think this is a mysql issue.

    > Show at the very least the full traceback that you get. Try to write a
    > short script that demonstrates the problem with 2.5 and no problem
    > with 2.4, so that (a) it is apparent what you are doing (b) the
    > problem can be reproduced if necessary by someone with access to
    > mysql.


    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "putDataIntoDB.py", line 164, in <module>
    cursor.execute(sqlStr)
    File "/usr/lib64/python2.5/site-packages/MySQLdb/cursors.py", line
    146, in execute
    query = query.encode(charset)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
    4359: ordinal not in range(128)

    > You might like to explain why you think that doubling backslashes in
    > your SQL is a good idea, and amplify "some processing on the text".


    I thought this will achieve 2 things.
    a) It will escape any unicode character (obviously, I was wrong. Got
    carried away by the display. I thought \xc2 will get escaped to \\xc2,
    which is completely preposterous).
    b) It will make sure that the escape sequences in the string (e.g.
    '\n') are received by mysql as an escape sequence.

    Thanks for your reply!
    Gaurav

    > HTH,
    > John
     
    Gaurav Veda, Jan 21, 2009
    #3
  4. Gaurav Veda

    John Machin Guest

    On Jan 22, 9:50 am, Gaurav Veda <> wrote:
    > > The 0xc2 strongly suggests that you are feeding the beast data encoded
    > > in UTF-8 while giving it no reason to believe that it is in fact not
    > > encoded in ASCII. Curiously the first errant byte is a long way (4KB)
    > > into your data. Consider doing
    > >     print repr(data)
    > > to see what you've actually got there.
    > >>> sqlStr[4352:4362]

    >
    > ' and 25\xc2\xb0F'


    That's the UTF-8 version of ' and 25°F' where the character between
    the 25 and the F is U+00B0 DEGREE SIGN ... interesting stuff to have
    in an SQL query string.

    >
    > All I want to do is to just replace all the non-ascii characters by a
    > space.


    I can't imagine why you would want to do that to data, let alone to an
    SQL query.

    I can't see any evidence that you actually tried to do that, anyway.

    To replace non-ASCII characters in a UTF-8-encoded string by spaces:
    | >>> u8 = ' and 25\xc2\xb0F'
    | >>> u = u8.decode('utf8')
    | >>> ''.join([chr(ord(c)) if c <= u'\x7f' else ' ' for c in u])
    | ' and 25 F'

    >
    > > I'm a little skeptical about the "2.4 works, 2.5 doesn't" notion --
    > > different versions of mysql, perhaps?

    >
    > I am trying to put content into the mysql server running on machine A,
    > from machine B & machine C with different versions of python. So I
    > don't think this is a mysql issue.


    Terminology confusion. Consider the possibility of different versions
    of MySQLdb (the client interface package) on the client machines B and
    C.

    Also consider the possibility that you didn't run exactly the same
    code on B and C.

    > > Show at the very least the full traceback that you get. Try to write a
    > > short script that demonstrates the problem with 2.5 and no problem
    > > with 2.4, so that (a) it is apparent what you are doing (b) the
    > > problem can be reproduced if necessary by someone with access to
    > > mysql.


    How about a very small script which includes the minimum necessary to
    run these two lines (with appropriate substitutions for column_x and
    table_y:
    sql_str = "select column_x from table_y where column_x = '\xc2\xb0'"
    cursor.execute(sql_str)

    and run that on B and C

    >
    > Traceback (most recent call last):
    >   File "<stdin>", line 1, in <module>
    >   File "putDataIntoDB.py", line 164, in <module>
    >     cursor.execute(sqlStr)
    >   File "/usr/lib64/python2.5/site-packages/MySQLdb/cursors.py", line
    > 146, in execute
    >     query = query.encode(charset)
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
    > 4359: ordinal not in range(128)
    >
    > > You might like to explain why you think that doubling backslashes in
    > > your SQL is a good idea, and amplify "some processing on the text".

    >
    > I thought this will achieve 2 things.
    > a) It will escape any unicode character (obviously, I was wrong. Got
    > carried away by the display. I thought \xc2 will get escaped to \\xc2,
    > which is completely preposterous).
    > b) It will make sure that the escape sequences in the string (e.g.
    > '\n') are received by mysql as an escape sequence.


    Run-time programmatic fiddling with an SQL query string is dangerous
    and tricky at the best of times, worse when you don't inspect the
    result before you press the launch button.

    Cheers,
    John
     
    John Machin, Jan 22, 2009
    #4
  5. On Mittwoch, 21. Januar 2009, Gaurav Veda wrote:
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
    > 4357: ordinal not in range(128)
    >
    > Before sending the (insert) query to the mysql server, I do the
    > following which I think should've taken care of this problem:
    > sqlStr = sqlStr.replace('\\', '\\\\')


    you might consider using what mysql offers about unicode: save
    all strings encoded as unicode. Might be more work now but I think
    it would be a good investment in the future.

    have a look at the mysql documentation for

    mysql_real_escape_string() takes care of quoted chars.

    mysql_set_character_set() for setting the character set used
    by the database connection

    you can ensure that the web page is unicode by doing something
    like

    charsetregex = re.compile(r'charset=(.*?)[\"&]')
    charsetmatch = charsetregex.search(page)
    if charsetmatch:
    charset=charsetmatch.group(1)
    utf8Text = unicode(page,charset)

    --
    Wolfgang
     
    Wolfgang Rohdewald, Jan 22, 2009
    #5
  6. Gaurav Veda

    Gaurav Veda Guest

    On Jan 21, 7:08 pm, John Machin <> wrote:
    >
    > To replace non-ASCII characters in a UTF-8-encoded string by spaces:
    > | >>> u8 = ' and 25\xc2\xb0F'
    > | >>> u = u8.decode('utf8')
    > | >>> ''.join([chr(ord(c)) if c <= u'\x7f' else ' ' for c in u])
    > | ' and 25 F'


    Thanks John for your reply. This is what I needed.

    Cheers,
    Gaurav
     
    Gaurav Veda, Jan 23, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    4,037
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    588
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    560
    Gabriele *darkbard* Farina
    May 16, 2006
  4. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    1,024
    Grzegorz ¦liwiñski
    Jan 19, 2011
  5. Terry Reedy
    Replies:
    0
    Views:
    91
    Terry Reedy
    Jan 7, 2014
Loading...

Share This Page