Opening multiple Files in Different Encoding

Discussion in 'Python' started by Subhabrata, Jul 10, 2012.

  1. Subhabrata

    Subhabrata Guest

    Dear Group,

    I kept a good number of files in a folder. Now I want to read all of
    them. They are in different formats and different encoding. Using
    listdir/glob.glob I am able to find the list but how to open/read or
    process them for different encodings?

    If any one can help me out.I am using Python3.2 on Windows.

    Regards,
    Subhabrata Banerjee.
    Subhabrata, Jul 10, 2012
    #1
    1. Advertising

  2. Subhabrata

    MRAB Guest

    On 10/07/2012 18:46, Subhabrata wrote:
    > Dear Group,
    >
    > I kept a good number of files in a folder. Now I want to read all of
    > them. They are in different formats and different encoding. Using
    > listdir/glob.glob I am able to find the list but how to open/read or
    > process them for different encodings?
    >
    > If any one can help me out.I am using Python3.2 on Windows.
    >

    You could try different encodings. If it raises a UnicodeDecodeError,
    then it's the wrong encoding, Otherwise just look at the decoding
    result and see whether it "looks" OK.

    I believe that one method is to look at the frequency distribution of
    characters compared with sample texts.
    MRAB, Jul 10, 2012
    #2
    1. Advertising

  3. On Tue, 10 Jul 2012 10:46:08 -0700, Subhabrata wrote:

    > Dear Group,
    >
    > I kept a good number of files in a folder. Now I want to read all of
    > them. They are in different formats and different encoding. Using
    > listdir/glob.glob I am able to find the list but how to open/read or
    > process them for different encodings?


    open('first file', encoding='uft-8')
    open('second file', encoding='latin1')

    How you decide which encoding to use is up to you. Perhaps you can keep a
    mapping of {filename: encoding} somewhere.

    Or perhaps you can try auto-detecting the encodings. The chardet module
    should help you there.



    --
    Steven
    Steven D'Aprano, Jul 11, 2012
    #3
  4. Subhabrata

    Guest

    On Tuesday, July 10, 2012 11:16:08 PM UTC+5:30, Subhabrata wrote:
    > Dear Group,
    >
    > I kept a good number of files in a folder. Now I want to read all of
    > them. They are in different formats and different encoding. Using
    > listdir/glob.glob I am able to find the list but how to open/read or
    > process them for different encodings?
    >
    > If any one can help me out.I am using Python3.2 on Windows.
    >
    > Regards,
    > Subhabrata Banerjee.

    Dear Group,

    No generally I know the glob.glob or the encodings as I work lot on non-ASCII stuff, but I recently found an interesting issue, suppose there are .doc,.docx,.txt,.xls,.pdf files with different encodings.
    1) First I have to determine on the fly the file type.
    2) I can not assign encoding="..." whatever be the encoding I have to read it.

    Any idea. Thinking.

    Thanks in Advance,
    Regards,
    Subhabrata Banerjee.
    , Jul 11, 2012
    #4
  5. On Wed, 11 Jul 2012 11:15:02 -0700 (PDT),
    declaimed the following in gmane.comp.python.general:

    > No generally I know the glob.glob or the encodings as I work lot on non-ASCII stuff, but I recently found an interesting issue, suppose there are .doc,.docx,.txt,.xls,.pdf files with different encodings.
    > 1) First I have to determine on the fly the file type.
    > 2) I can not assign encoding="..." whatever be the encoding I have to read it.
    >


    Many of those are (semi) proprietary formats (M$ Office <G>).

    DOCX (and XLSX) are, as I recall ZIP-compressed XML formats -- and I
    think that also implies UTF-8 (once you manage to decompress them)...
    Note that, for a test, I renamed a .docx to .zip and opened it in
    PowerArchiver... It generates 19 files in a multi-level tree -- one of
    which is named
    [content_types].xml
    and contains
    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <Types
    xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
    <Override PartName="/word/footnotes.xml"
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml"/>
    <Default Extension="rels"
    ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
    <Default Extension="xml" ContentType="application/xml"/>
    <Override PartName="/word/document.xml"
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
    <Override PartName="/word/numbering.xml"
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml"/>
    <Override PartName="/word/styles.xml"
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
    <Override PartName="/word/endnotes.xml"
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml"/>
    <Override PartName="/docProps/app.xml"
    ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
    <Override PartName="/word/settings.xml"
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/>
    <Override PartName="/word/footer2.xml"
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
    <Override PartName="/docProps/custom.xml"
    ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/>
    <Override PartName="/word/footer1.xml"
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
    <Override PartName="/word/theme/theme1.xml"
    ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
    <Override PartName="/word/fontTable.xml"
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/>
    <Override PartName="/word/webSettings.xml"
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/>
    <Override PartName="/word/header1.xml"
    ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml"/>
    <Override PartName="/docProps/core.xml"
    ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
    </Types>

    That should also apply to the rest of the new Office document
    formats.

    Plain DOC format could be a mishmash of three or four binary formats
    (Word6 being the last compatible with 16-bit Windows 3.x Word). I
    believe one Office version assigned DOC to what were really RTF format
    files rather than the binary (yes, binary -- there is no guarantee that
    you can find meaningful text without being able to parse a binary file
    format).

    PDF contents can by binary compressed; again there is no guarantee
    you can find meaningful text without being able to parse the contents.
    http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf
    (an older version than current standard, I suspect)... Heck, many of the
    cheaper PDF conversions basically embed each page as a graphical
    (bitmap) image, not as text.

    For the Office documents, if you are running on a Windows system (or
    can open them in something like OpenOffice), your best chances are
    likely to be programmatically open them in the application and then do a
    "save as..." TXT (for Word) and CSV (for Excel) -- then process the
    TXT/CSV files (or save as RTF if that is an option -- that's usually in
    whatever the locale specific Windows code page contains, if not plain
    ASCII).

    I believe there is a library to read Excel files directly:
    http://pypi.python.org/pypi/xlrd/

    For PDF; I don't know if Acrobat Reader supports automation, to
    programmatically load and "save as text".
    http://p2p.wrox.com/vb-net-2002-2003-basics/39037-acrobat-reader-automation.html
    implies an ability to automate on Windows, so using the win32 extension
    library or ctypes may give you access to work with the files.


    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
    Dennis Lee Bieber, Jul 11, 2012
    #5
  6. On Wed, 11 Jul 2012 11:15:02 -0700, subhabangalore wrote:

    > On Tuesday, July 10, 2012 11:16:08 PM UTC+5:30, Subhabrata wrote:
    >> Dear Group,
    >>
    >> I kept a good number of files in a folder. Now I want to read all of
    >> them. They are in different formats and different encoding. Using
    >> listdir/glob.glob I am able to find the list but how to open/read or
    >> process them for different encodings?
    >>
    >> If any one can help me out.I am using Python3.2 on Windows.
    >>
    >> Regards,
    >> Subhabrata Banerjee.

    > Dear Group,
    >
    > No generally I know the glob.glob or the encodings as I work lot on
    > non-ASCII stuff, but I recently found an interesting issue, suppose
    > there are .doc,.docx,.txt,.xls,.pdf files with different encodings.


    You can have text files with different encodings, but not the others.

    ..doc .docx .xls and .pdf are all binary files. You don't specify an
    encoding when you read them, because they aren't text -- encodings are
    for mapping bytes to text, not bytes to binary formats.

    In particular, .docx is compressed XML, so once you have uncompressed it,
    the contents XML, which is *always* UTF-8.


    > 1) First I have to determine on the fly the file type.


    Which is a different problem from your first post.

    On Windows, you determine the file type using the file extension.

    import os
    name, ext = os.path.splitext("my_file_name.bmp")

    will give you ext = ".bmp".

    Then what do you expect to do? You can open the file as a binary blob,
    but what do you expect then?

    f = open("my_file_name.bmp", "rb")

    Now what do you want to do with it?


    > 2) I can not assign
    > encoding="..." whatever be the encoding I have to read it.


    You can't set the encoding when you open files in binary mode, but binary
    files don't have an encoding.



    --
    Steven
    Steven D'Aprano, Jul 12, 2012
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. BB
    Replies:
    2
    Views:
    330
    David Harmon
    Dec 20, 2003
  2. Replies:
    4
    Views:
    928
    M.E.Farmer
    Feb 13, 2005
  3. Raj
    Replies:
    0
    Views:
    258
  4. fniles
    Replies:
    0
    Views:
    256
    fniles
    Apr 26, 2009
  5. udoline
    Replies:
    2
    Views:
    139
    udoline
    Sep 24, 2003
Loading...

Share This Page