handling unicode data

Discussion in 'Python' started by Filipe, Jun 28, 2006.

  1. Filipe

    Filipe Guest

    Hi all,

    I'm starting to learn python but am having some difficulties with how
    it handles the encoding of data I'm reading from a database. I'm using
    pymssql to access data stored in a SqlServer database, and the
    following is the script I'm using for testing purposes.

    -----------------------------------------------------------------------------
    import pymssql

    mssqlConnection =
    pymssql.connect(host='localhost',user='sa',password='password',database='TestDB')
    cur = mssqlConnection.cursor()
    query="Select ID, Term from TestTable where ID > 200 and ID < 300;"
    cur.execute(query)
    row = cur.fetchone()
    results = []
    while row is not None:
    term = row[1]
    print type(row[1])
    print term
    results.append(term)
    row = cur.fetchone()
    cur.close()
    mssqlConnection.close()
    print results
    -----------------------------------------------------------------------------

    In the console output, for a record where I expected to see "França"
    I'm getting the following:

    "<type 'str'>" - When I print the type (print type(row[1]))
    "Fran+a" - When I print the "term" variable (print term)
    "Fran\xd8a" - When I print all the query results (print results)


    The values in "Term" column in "TestTable" are stored as unicode (the
    column's datatype is nvarchar), yet, the python data type of the values
    I'm reading is not unicode.
    It all seems to be an encoding issue, but I can't see what I'm doing
    wrong..
    Any thoughts?

    thanks in advance,
    Filipe
     
    Filipe, Jun 28, 2006
    #1
    1. Advertising

  2. Filipe wrote:

    > In the console output, for a record where I expected to see "França"
    > I'm getting the following:
    >
    > "<type 'str'>" - When I print the type (print type(row[1]))
    > "Fran+a" - When I print the "term" variable (print term)
    > "Fran\xd8a" - When I print all the query results (print results)
    >
    > The values in "Term" column in "TestTable" are stored as unicode (the
    > column's datatype is nvarchar), yet, the python data type of the values
    > I'm reading is not unicode.
    > It all seems to be an encoding issue, but I can't see what I'm doing
    > wrong..


    looks like the DB-API driver returns 8-bit ISO-8859-1 strings instead of Unicode
    strings. there might be some configuration option for this; see

    in worst case, you could do something like

    def unicodify(value):
    if isinstance(value, str):
    value = unicode(value, "iso-8859-1")
    return value

    term = unicodify(row[1])

    but it's definitely better if you can get the DB-API driver to do the right thing.

    </F>
     
    Fredrik Lundh, Jun 28, 2006
    #2
    1. Advertising

  3. Fredrik Lundh wrote:
    > looks like the DB-API driver returns 8-bit ISO-8859-1 strings instead of Unicode
    > strings. there might be some configuration option for this; see
    >


    Where did you want to point the OP here?

    > in worst case, you could do something like
    >
    > def unicodify(value):
    > if isinstance(value, str):
    > value = unicode(value, "iso-8859-1")
    > return value
    >
    > term = unicodify(row[1])
    >
    > but it's definitely better if you can get the DB-API driver to do the right thing.


    It seems pymssql does not support such a thing.

    Also, it appears that DB-Library (the API used by pymssql) always
    returns CP_ACP characters (unless ANSI-to-OEM conversion is enabled);
    so the "right" encoding to use is "mbcs".

    Notice that Microsoft plans to abandon DB-Library, so it might be
    best to switch to a different module for SQL Server access.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jun 28, 2006
    #3
  4. Filipe

    Filipe Guest

    Hi Fredrik,

    Thanks for the reply.
    Instead of:
    term = row[1]
    I tried:
    term = unicode(row[1], "iso-8859-1")

    but the following error was returned when printing "term":
    Traceback (most recent call last):
    File "test.py", line 11, in ?
    print term
    File "c:\Program Files\Python24\lib\encodings\cp437.py", line 18, in
    encode
    return codecs.charmap_encode(input,errors,encoding_map)
    UnicodeEncodeError: 'charmap' codec can't encode character u'\xd8' in
    position 31: character maps to <undefined>

    Is it possible some unicode strings are not printable to the console?
    It's odd, because I can manually write in the console the same string
    I'm trying to print.
    I also tried other encodings, besides iso-8859-1, but got the same
    error.

    Do you think this has something to do with the DB-API driver? I don't
    even know where to start if I have to change something in there :|

    Cheers,
    Filipe
     
    Filipe, Jun 28, 2006
    #4
  5. Filipe wrote:

    > Thanks for the reply.
    > Instead of:
    > term = row[1]
    > I tried:
    > term = unicode(row[1], "iso-8859-1")
    >
    > but the following error was returned when printing "term":
    > Traceback (most recent call last):
    > File "test.py", line 11, in ?
    > print term
    > File "c:\Program Files\Python24\lib\encodings\cp437.py", line 18, in
    > encode
    > return codecs.charmap_encode(input,errors,encoding_map)
    > UnicodeEncodeError: 'charmap' codec can't encode character u'\xd8' in
    > position 31: character maps to <undefined>


    works for me, given your example:

    >>> s = "Fran\xd8a"
    >>> unicode(s, "iso-8859-1")

    u'Fran\xd8a'

    what does

    print repr(row[1])

    print in this case ?

    </F>
     
    Fredrik Lundh, Jun 28, 2006
    #5
  6. Filipe

    Filipe Guest

    Hi,

    Martin v. Löwis wrote:
    > Also, it appears that DB-Library (the API used by pymssql) always
    > returns CP_ACP characters (unless ANSI-to-OEM conversion is enabled);
    > so the "right" encoding to use is "mbcs".


    do you mean using something like the following line?
    term = unicode(row[1], "mbcs")

    What do you mean by "ANSI-to-OEM conversion is enabled"? (sorry, I'm
    quite a newbie to python)

    > Notice that Microsoft plans to abandon DB-Library, so it might be
    > best to switch to a different module for SQL Server access.


    I've done some searching and settled for pymssql, but it's not too late
    to change yet.
    I've found these options to connect to a MSSqlServer database:

    Pymssql
    http://pymssql.sourceforge.net/

    ADODB for Python (windows only)
    http://phplens.com/lens/adodb/adodb-py-docs.htm

    SQLServer for Python (discontinued?)
    http://www.object-craft.com.au/projects/mssql/

    mxODBC (commercial license)
    http://www.egenix.com/files/python/mxODBC.html

    ASPN Recipe
    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/144183


    Pymssql seemed like the best choice. The ASPN Recipe I mention doesn't
    look bad either, but there doesn't seem to be as many people using it
    as using pymssql. I'll look a little further though.
     
    Filipe, Jun 28, 2006
    #6
  7. Filipe

    Filipe Guest

    Fredrik Lundh wrote:
    > works for me, given your example:
    > >>> s = "Fran\xd8a"
    > >>> unicode(s, "iso-8859-1")

    > u'Fran\xd8a'
    >
    > what does
    > print repr(row[1])
    >
    > print in this case ?


    It prints:
    'Fran\xd8a'

    The error I'm getting is beeing thrown when I print the value to the
    console. If I just convert it to unicode all seems ok (except for not
    beeing able to show it in the console, that is... :).

    For example, when I try this:
    print unicode("Fran\xd8a", "iso-8859-1")

    I get the error:
    Traceback (most recent call last):
    File "a.py", line 1, in ?
    print unicode("Fran\xd8a", "iso-8859-1")
    File "c:\Program Files\Python24\lib\encodings\cp437.py", line 18, in
    encode
    return codecs.charmap_encode(input,errors,encoding_map)
    UnicodeEncodeError: 'charmap' codec can't encode character u'\xd8' in
    position 4
    : character maps to <undefined>
     
    Filipe, Jun 28, 2006
    #7
  8. In <>, Filipe wrote:

    > The error I'm getting is beeing thrown when I print the value to the
    > console. If I just convert it to unicode all seems ok (except for not
    > beeing able to show it in the console, that is... :).
    >
    > For example, when I try this:
    > print unicode("Fran\xd8a", "iso-8859-1")
    >
    > I get the error:
    > Traceback (most recent call last):
    > File "a.py", line 1, in ?
    > print unicode("Fran\xd8a", "iso-8859-1")
    > File "c:\Program Files\Python24\lib\encodings\cp437.py", line 18, in
    > encode
    > return codecs.charmap_encode(input,errors,encoding_map)
    > UnicodeEncodeError: 'charmap' codec can't encode character u'\xd8' in
    > position 4
    > : character maps to <undefined>


    The `unicode()` call doesn't fail here but the ``print`` because printing
    unicode strings means they have to be encoded into a byte string again.
    And whatever encoding the target of the print (your console) uses, it
    does not contain the unicode character u'\xd8'. From the traceback it
    seems your terminal uses `cp437` as encoding.

    As you can see here: http://www.wordiq.com/definition/CP437 there's no Ø
    in that character set.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Jun 28, 2006
    #8
  9. Filipe wrote:
    > Hi,
    >
    > I've done some searching and settled for pymssql, but it's not too late
    > to change yet.
    > I've found these options to connect to a MSSqlServer database:
    >
    > Pymssql
    > http://pymssql.sourceforge.net/
    >
    > ADODB for Python (windows only)
    > http://phplens.com/lens/adodb/adodb-py-docs.htm
    >
    > SQLServer for Python (discontinued?)
    > http://www.object-craft.com.au/projects/mssql/
    >
    > mxODBC (commercial license)
    > http://www.egenix.com/files/python/mxODBC.html
    >
    > ASPN Recipe
    > http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/144183
    >


    You did not mention the odbc module from Mark Hammond's win32
    extensions. This is what I use, and it works for me. I believe it is
    not 100% DB-API 2.0 compliant, but I have not had any problems.

    I have not tried connecting to the database from a Linux box (or from
    another Windows box, for that matter). I don't know if there are any
    implications there.

    Frank Millman
     
    Frank Millman, Jun 29, 2006
    #9
  10. Filipe wrote:
    >> Also, it appears that DB-Library (the API used by pymssql) always
    >> returns CP_ACP characters (unless ANSI-to-OEM conversion is enabled);
    >> so the "right" encoding to use is "mbcs".

    >
    > do you mean using something like the following line?
    > term = unicode(row[1], "mbcs")


    Correct.

    > What do you mean by "ANSI-to-OEM conversion is enabled"? (sorry, I'm
    > quite a newbie to python)


    It's an SQL server thing more than a Python thing. See AutoAnsiToOem
    in

    http://support.microsoft.com/default.aspx?scid=KB;EN-US;199819

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jun 29, 2006
    #10
  11. Filipe

    Filipe Guest

    Frank Millman wrote:
    > You did not mention the odbc module from Mark Hammond's win32
    > extensions. This is what I use, and it works for me. I believe it is
    > not 100% DB-API 2.0 compliant, but I have not had any problems.
    >
    > I have not tried connecting to the database from a Linux box (or from
    > another Windows box, for that matter). I don't know if there are any
    > implications there.


    According to sourceforge's project page
    (https://sourceforge.net/projects/pywin32/) it seems to only work on
    windows.

    There's also adodbapi (http://adodbapi.sourceforge.net/), that also
    depends on PyWin32, but it would be very handy if I could run this code
    on a linux box, and with these libs I wouldn't be able to. Still,
    options are always good to have around :)
     
    Filipe, Jun 30, 2006
    #11
  12. Filipe

    Filipe Guest

    Marc 'BlackJack' Rintsch wrote:
    > The `unicode()` call doesn't fail here but the ``print`` because printing
    > unicode strings means they have to be encoded into a byte string again.
    > And whatever encoding the target of the print (your console) uses, it
    > does not contain the unicode character u'\xd8'. From the traceback it
    > seems your terminal uses `cp437` as encoding.
    >
    > As you can see here: http://www.wordiq.com/definition/CP437 there's no Ø
    > in that character set.


    somethings are much, much, clearer to me now. thanks!

    For future reference, these links may also help:
    http://www.jorendorff.com/articles/unicode/python.html
    http://www.thescripts.com/forum/thread23314.html

    I've changed my windows console copdepage to latin1 and the following
    prints are now outputting "França", as expected:
    print unicode("Fran\x87a", "cp850").encode("iso-8859-1")
    print unicode("Fran\xe7a", "iso-8859-1").encode("iso-8859-1")

    However, I don't yet fully understand what's happening with Pymssql.
    The encoding I expected to be receiving from MSSqlServer was cp850 (the
    column in question uses the collation SQL_Latin1_General_CP850_CS_AS),
    but it doesn't seem to be what the query is returning. I tried
    converting to a unicode string from a few different encodings, but none
    of them seems to be the right one. For example, for cp850, using a
    latin1 console:

    --------------------------------------------------------
    term = unicode(row[1], "cp850")
    print repr(term)
    print term

    ---- output -------------------------------------------
    u'Fran\xcfa'
    FranÏa
    --------------------------------------------------------


    And for iso-8859-1 (also got the same result for mbcs):
    --------------------------------------------------------
    term = unicode(row[1], "iso-8859-1")
    print repr(term)
    print term

    ---- output -------------------------------------------
    u'Fran\xd8a'
    FranØa
    --------------------------------------------------------


    What do you think? Might it be Pymssql doing something wrong?
     
    Filipe, Jun 30, 2006
    #12
  13. Filipe

    Filipe Guest

    Martin v. Löwis wrote:
    > > What do you mean by "ANSI-to-OEM conversion is enabled"?

    >
    > See AutoAnsiToOem in
    > http://support.microsoft.com/default.aspx?scid=KB;EN-US;199819
    >


    I checked the registry key
    "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\Client\DB-Lib", and
    verified AutoAnsiToOem was 'ON'.

    I also tried assuming mbcs as the encoding but didn't got very far
    (please see my other post).
     
    Filipe, Jun 30, 2006
    #13
  14. Filipe wrote:
    > ---- output -------------------------------------------
    > u'Fran\xd8a'
    > FranØa
    > --------------------------------------------------------
    >
    >
    > What do you think? Might it be Pymssql doing something wrong?


    I think the data in your database is already wrong. Are you
    sure the value in question is really "França" in the database?

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jul 1, 2006
    #14
  15. Filipe

    Filipe Guest

    Martin v. Löwis wrote:
    > Filipe wrote:
    > > ---- output -------------------------------------------
    > > u'Fran\xd8a'
    > > FranØa
    > > --------------------------------------------------------
    > >
    > > What do you think? Might it be Pymssql doing something wrong?

    >
    > I think the data in your database is already wrong. Are you
    > sure the value in question is really "França" in the database?
    >


    yes, I'm pretty sure. There's an application that was built to run on
    top of this database and it correctly reads as writes data to the DB. I
    also used SqlServer's Query Analyzer to select the data and it
    displayed fine.

    I've done some more tests and I think I'm very close to finding what
    the problem is. The tests I had done before were executed from the
    windows command line. I tried printing the following (row[1] is a value
    I selected from the database) in two distinct environments, from within
    an IDE (Pyscripter) and from the command line:

    import sys
    import locale
    print getattr(sys.stdout,'encoding',None)
    print locale.getdefaultlocale()[1]
    print sys.getdefaultencoding()
    term = "Fran\x87a"
    print repr(term)
    term = row[1]
    print repr(term)

    output I got in Pyscripter's interpreter window:
    None
    cp1252
    ascii
    'Fran\x87a'
    'Fran\x87a'

    output I got in the command line:
    cp1252
    cp1252
    ascii
    'Fran\x87a'
    'Fran\xd8a'

    I'd expect "print" to behave differently according with the console's
    encoding, but does this mean this happens with repr() too?
    in which way?

    thanks,
    Filipe
     
    Filipe, Jul 4, 2006
    #15
  16. Filipe wrote:
    > term = row[1]
    > print repr(term)
    >
    > output I got in Pyscripter's interpreter window:
    > 'Fran\x87a'
    >
    > output I got in the command line:
    > 'Fran\xd8a'
    >
    > I'd expect "print" to behave differently according with the console's
    > encoding, but does this mean this happens with repr() too?


    repr always generates ASCII bytes. They are not effected by the
    console's encoding. If you get different output, it really means
    that the values are different (check ord(row[1][4]) to be sure)

    What is the precise sequence of statements that you used to
    set the "row" variable?

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jul 4, 2006
    #16
  17. Filipe

    Filipe Guest

    Martin v. Löwis wrote:
    > Filipe wrote:
    > > term = row[1]
    > > print repr(term)
    > >
    > > output I got in Pyscripter's interpreter window:
    > > 'Fran\x87a'
    > >
    > > output I got in the command line:
    > > 'Fran\xd8a'
    > >
    > > I'd expect "print" to behave differently according with the console's
    > > encoding, but does this mean this happens with repr() too?

    >
    > repr always generates ASCII bytes. They are not effected by the
    > console's encoding. If you get different output, it really means
    > that the values are different (check ord(row[1][4]) to be sure)


    They do, in fact, output different values. The value outputed by
    pyscripter was "135" (x87) while the value outputed in the command line
    was "216" (xd8). I can't understand why though, because the script
    being run is precisely the same on both environments.

    > What is the precise sequence of statements that you used to
    > set the "row" variable?


    The complete script follows:
    -----------------------------------------------------------------------
    import sys
    import locale
    print getattr(sys.stdout,'encoding',None)
    print locale.getdefaultlocale()[1]
    print sys.getdefaultencoding()

    import pymssql
    mssqlConnection =
    pymssql.connect(host='localhost',user='sa',password='password',database='TestDB')
    cur = mssqlConnection.cursor()
    query="Select ID, Term from TestTable where ID = 204;"
    cur.execute(query)
    row = cur.fetchone()
    results = []
    while row is not None:
    term = unicode(row[1], "cp850")
    print ord(row[1][4])
    print ord(term[4])
    print term
    results.append(term)
    row = cur.fetchone()
    cur.close()
    mssqlConnection.close()
    print results
    -----------------------------------------------------------------------


    The values outputed were, in pyscripter:
    None
    cp1252
    ascii
    135
    231
    França
    [uFran\xe7a']

    and in the command line
    cp850
    cp1252
    ascii
    216
    207
    FranÏa
    [u'Fran\xcfa']

    regards,
    Filipe
     
    Filipe, Jul 5, 2006
    #17
  18. Filipe wrote:
    > They do, in fact, output different values. The value outputed by
    > pyscripter was "135" (x87) while the value outputed in the command line
    > was "216" (xd8). I can't understand why though, because the script
    > being run is precisely the same on both environments.


    That's indeed surprising, and doesn't really increase trust into
    pymssql.

    If we look at the values of

    > print ord(row[1][4])

    (where row is the actual data read from the database)

    we get

    > The values outputed were, in pyscripter:
    > 135


    Here, 135==0x87 really is LATIN SMALL LETTER C WITH CEDILLA in
    code page 850.

    > and in the command line
    > 216


    216==0xd8 is not LATIN SMALL LETTER C WITH CEDILLA in any
    encode I know, so it appears that this value is bogus.
    One would have to ask the authors of pymssql, or Microsoft,
    why that happens; alternatively, you have to run pymssql
    in a debugger to find out yourself.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Jul 5, 2006
    #18
  19. Filipe

    Filipe Guest

    Hi Martin,

    > One would have to ask the authors of pymssql, or Microsoft,
    > why that happens; alternatively, you have to run pymssql
    > in a debugger to find out yourself.


    Tried running pymssql in a debugger, but I felt a bit lost. There are
    too many things I would need to understand about pymssql first.

    Meanwhile, I got to some very interesting conclusions. Remember the
    "ANSI-to-OEM conversion" option you mentioned before? I began reading
    some docs about it and this description turned up:

    "The ANSI to OEM conversion translates data coming back from SQL Server
    into the local code page used by your client."

    which seemed exactly what I don't want.. so I turned it to "OFF" (by
    using the registry key
    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\Client\DB-Lib\AutoAnsiToOem)
    and everything started working the way I was originally expecting!

    I think that the best way to actually solve this will be to override
    AutoAnsiToOem (instead of using the registry setting) from within
    pymssql, or find a way to specify what the "local code page" should be.
    If not, I will have to have the pain of verifying this setting in every
    system where the code I'm developing will be deployed. Which reminds
    me... that this setting doesn't exist on non-windows environments (ie,
    no registry on Linux) so I'm not sure how will it all work there.
    Anyone with experience in using DB-Library that can confirm how it
    works (namely, on linux)?
    (but perhaps this is outside the scope of this newsgroup.. )

    I got in touch with Andrzej Kukula, the current developer of pymssql,
    who has also been very helpful, and knows what we've been discussing
    over here.


    thanks for all the help,
    Filipe
     
    Filipe, Jul 6, 2006
    #19
  20. On 6 Jul 2006 08:42:32 -0700, "Filipe" <> declaimed
    the following in comp.lang.python:


    > system where the code I'm developing will be deployed. Which reminds
    > me... that this setting doesn't exist on non-windows environments (ie,
    > no registry on Linux) so I'm not sure how will it all work there.


    The setting most likely has to be made on the machine running the
    server -- and M$ SQL Server doesn't exist on Linux either <G>

    If the conversion was being done by some client library on Windows,
    then again, since that library probably doesn't exist on Linux, the
    conversion probably is not done.

    All conjecture...

    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
     
    Dennis Lee Bieber, Jul 6, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    4,054
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    591
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    566
    Gabriele *darkbard* Farina
    May 16, 2006
  4. Tim Golden

    RE: handling unicode data

    Tim Golden, Jun 29, 2006, in forum: Python
    Replies:
    1
    Views:
    871
    Filipe
    Jun 30, 2006
  5. joy99
    Replies:
    2
    Views:
    431
    Mark Tolonen
    Sep 12, 2009
Loading...

Share This Page