the unicode saga continues...

Discussion in 'Python' started by Ethan Furman, Nov 14, 2009.

  1. Ethan Furman

    Ethan Furman Guest

    So I've added unicode support to my dbf package, but I also have some
    rather large programs that aren't ready to make the switch over yet. So
    as a workaround I added a (rather lame) option to convert the
    unicode-ified data that was decoded from the dbf table back into an
    encoded format.

    Here's the fun part: in figuring out what the option should be for use
    with my system, I tried some tests...

    Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
    (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print u'\xed'

    í
    >>> print u'\xed'.encode('cp437')

    í
    >>> print u'\xed'.encode('cp850')

    í
    >>> print u'\xed'.encode('cp1252')

    φ
    >>> import locale
    >>> locale.getdefaultlocale()

    ('en_US', 'cp1252')

    My confusion lies in my apparant codepage (cp1252), and the discrepancy
    with character u'\xed' which is absolutely an i with an accent; yet when
    I encode with cp1252 and print it, I get an o with a line.

    Can anybody clue me in to what's going on here?

    ~Ethan~
    Ethan Furman, Nov 14, 2009
    #1
    1. Advertising

  2. Ethan Furman wrote:
    > Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
    > (Intel)] on win32
    > Type "help", "copyright", "credits" or "license" for more information.
    > >>> print u'\xed'

    > í
    > >>> print u'\xed'.encode('cp437')

    > í
    > >>> print u'\xed'.encode('cp850')

    > í
    > >>> print u'\xed'.encode('cp1252')

    > φ
    > >>> import locale
    > >>> locale.getdefaultlocale()

    > ('en_US', 'cp1252')
    >
    > My confusion lies in my apparant codepage (cp1252), and the discrepancy
    > with character u'\xed' which is absolutely an i with an accent; yet when
    > I encode with cp1252 and print it, I get an o with a line.

    ^^^^^^^^^^^^^^^^^^^^^^
    For the record: I read a small Greek letter phi in your posting, not an o
    with a line. If I encode according to my default locale (UTF-8), I get the
    letter i with the accent. If I encode with codepage 1252, I get a marker for
    an invalid character on my terminal. This is using Debian though, not MS
    Windows.

    Try printing the repr() of that. The point is that internally, you have the
    codepoint u00ED (u'\xed'). Then, you encode this thing in various codepages,
    which yields a string of bytes representing this thing ('\xa1', '\xa1' and
    '\xed'), useful for storing on disk when the file uses said codepage or
    other forms of IO.

    Now, with a Unicode string, the output (print) knows what to do, it encodes
    it according to the defaultlocale and sends the resulting bytes to stdout.
    With a byte string, I think it directly forwards the content to stdout.

    Note:
    * If you want to verify your code, rather use 'print repr(..)'.
    * I could imagine that your locale is simply not set up correctly.

    Uli
    Ulrich Eckhardt, Nov 14, 2009
    #2
    1. Advertising

  3. > Can anybody clue me in to what's going on here?

    It's as Mark says: the console encoding is cp437 on your system,
    cp1252.

    Windows has *two* default code pages at any point in time: the
    OEM code page, and the ANSI code page. Either one depends on the
    Windows release (Western, Japanese, etc.), and can be set by the
    administrator. The OEM code page is primarily used for the console
    (and then also as the encoding on the FAT filesystem); the ANSI
    code page is used in all other places (that don't use Unicode APIs).

    In addition, the console code page may deviate from the OEM code
    page, if you run chcp.exe.

    Regards,
    Martin
    Martin v. Löwis, Nov 14, 2009
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Francois Malgreve

    internationalization problem continues.

    Francois Malgreve, Dec 29, 2003, in forum: ASP .Net
    Replies:
    0
    Views:
    315
    Francois Malgreve
    Dec 29, 2003
  2. Timo
    Replies:
    0
    Views:
    560
  3. Alex Hall

    py2exe saga continues...

    Alex Hall, Apr 16, 2010, in forum: Python
    Replies:
    1
    Views:
    242
    alex23
    Apr 16, 2010
  4. sonic
    Replies:
    0
    Views:
    135
    sonic
    Oct 9, 2006
  5. Julian Leviston

    10.2.8 saga continued

    Julian Leviston, Aug 12, 2005, in forum: Ruby
    Replies:
    8
    Views:
    137
    Ralf Müller
    Aug 15, 2005
Loading...

Share This Page