Reading Windows CSV file with LCID entries under Linux.

Discussion in 'Python' started by Thomas Troeger, Sep 22, 2008.

  1. Dear all,

    I've stumbled over a problem with Windows Locale ID information and
    codepages. I'm writing a Python application that parses a CSV file,
    the format of a line in this file is "LCID;Text1;Text2". Each line can
    contain a different locale id (LCID) and the text fields contain data
    that is encoded in some codepage which is associated with this LCID. My
    current data file contains the codes 1033 for German and 1031 for
    English US (as listed in
    http://www.microsoft.com/globaldev/reference/lcid-all.mspx).
    Unfortunately, I cannot find out which Codepage (like cp-1252 or
    whatever) belongs to which LCID.

    My question is: How can I convert this data into something more
    reasonable like unicode? Basically, what I want is something like
    "Text1;Text2", both fields encoded as UTF-8. Can this be done with
    Python? How can I find out which codepage I have to use for 1033 and 1031?

    Any help appreciated,
    Thomas.
    Thomas Troeger, Sep 22, 2008
    #1
    1. Advertising

  2. Thomas Troeger

    Guest

    Thomas> My question is: How can I convert this data into something more
    Thomas> reasonable like unicode? Basically, what I want is something
    Thomas> like "Text1;Text2", both fields encoded as UTF-8. Can this be
    Thomas> done with Python? How can I find out which codepage I have to
    Thomas> use for 1033 and 1031?

    There are examples at end of the CSV module documentation which show how to
    create Unicode readers and writers. You can extend the UnicodeReader class
    to peek at the LCID field and save the corresponding codepage for the
    remainder of the line. (This would assume you're not creating CSV files
    which contain newlines. Each line read would be assumed to be a new record
    in the file.)

    Skip
    , Sep 22, 2008
    #2
    1. Advertising

  3. Thomas Troeger

    Tim Golden Guest

    Thomas Troeger wrote:
    > I've stumbled over a problem with Windows Locale ID information and
    > codepages. I'm writing a Python application that parses a CSV file,
    > the format of a line in this file is "LCID;Text1;Text2". Each line can
    > contain a different locale id (LCID) and the text fields contain data
    > that is encoded in some codepage which is associated with this LCID. My
    > current data file contains the codes 1033 for German and 1031 for
    > English US (as listed in
    > http://www.microsoft.com/globaldev/reference/lcid-all.mspx).
    > Unfortunately, I cannot find out which Codepage (like cp-1252 or
    > whatever) belongs to which LCID.
    >
    > My question is: How can I convert this data into something more
    > reasonable like unicode? Basically, what I want is something like
    > "Text1;Text2", both fields encoded as UTF-8. Can this be done with
    > Python? How can I find out which codepage I have to use for 1033 and 1031?



    The GetLocaleInfo API call can do that conversion:

    http://msdn.microsoft.com/en-us/library/ms776270(VS.85).aspx

    You'll need to use ctypes (or write a c extension) to
    use it. Be aware that if it doesn't succeed you may need
    to fall back on cp 65001 -- utf8.

    TJG
    Tim Golden, Sep 22, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. hshdude
    Replies:
    12
    Views:
    1,040
    Dimitri Maziuk
    Nov 4, 2004
  2. bronby
    Replies:
    1
    Views:
    592
    Andrew Thompson
    Jul 15, 2005
  3. Don Bruder
    Replies:
    3
    Views:
    952
    spikeysnack
    Aug 3, 2010
  4. Stefan Ram
    Replies:
    1
    Views:
    386
    Digger
    Sep 13, 2010
  5. Hadmut Danisch
    Replies:
    0
    Views:
    100
    Hadmut Danisch
    Jan 6, 2006
Loading...

Share This Page