Opening multiple Files in Different Encoding

S

Subhabrata

Dear Group,

I kept a good number of files in a folder. Now I want to read all of
them. They are in different formats and different encoding. Using
listdir/glob.glob I am able to find the list but how to open/read or
process them for different encodings?

If any one can help me out.I am using Python3.2 on Windows.

Regards,
Subhabrata Banerjee.
 
M

MRAB

Dear Group,

I kept a good number of files in a folder. Now I want to read all of
them. They are in different formats and different encoding. Using
listdir/glob.glob I am able to find the list but how to open/read or
process them for different encodings?

If any one can help me out.I am using Python3.2 on Windows.
You could try different encodings. If it raises a UnicodeDecodeError,
then it's the wrong encoding, Otherwise just look at the decoding
result and see whether it "looks" OK.

I believe that one method is to look at the frequency distribution of
characters compared with sample texts.
 
S

Steven D'Aprano

Dear Group,

I kept a good number of files in a folder. Now I want to read all of
them. They are in different formats and different encoding. Using
listdir/glob.glob I am able to find the list but how to open/read or
process them for different encodings?

open('first file', encoding='uft-8')
open('second file', encoding='latin1')

How you decide which encoding to use is up to you. Perhaps you can keep a
mapping of {filename: encoding} somewhere.

Or perhaps you can try auto-detecting the encodings. The chardet module
should help you there.
 
S

subhabangalore

Dear Group,

I kept a good number of files in a folder. Now I want to read all of
them. They are in different formats and different encoding. Using
listdir/glob.glob I am able to find the list but how to open/read or
process them for different encodings?

If any one can help me out.I am using Python3.2 on Windows.

Regards,
Subhabrata Banerjee.
Dear Group,

No generally I know the glob.glob or the encodings as I work lot on non-ASCII stuff, but I recently found an interesting issue, suppose there are .doc,.docx,.txt,.xls,.pdf files with different encodings.
1) First I have to determine on the fly the file type.
2) I can not assign encoding="..." whatever be the encoding I have to read it.

Any idea. Thinking.

Thanks in Advance,
Regards,
Subhabrata Banerjee.
 
D

Dennis Lee Bieber

No generally I know the glob.glob or the encodings as I work lot on non-ASCII stuff, but I recently found an interesting issue, suppose there are .doc,.docx,.txt,.xls,.pdf files with different encodings.
1) First I have to determine on the fly the file type.
2) I can not assign encoding="..." whatever be the encoding I have to read it.

Many of those are (semi) proprietary formats (M$ Office <G>).

DOCX (and XLSX) are, as I recall ZIP-compressed XML formats -- and I
think that also implies UTF-8 (once you manage to decompress them)...
Note that, for a test, I renamed a .docx to .zip and opened it in
PowerArchiver... It generates 19 files in a multi-level tree -- one of
which is named
[content_types].xml
and contains
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types
xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Override PartName="/word/footnotes.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml"/>
<Default Extension="rels"
ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml" ContentType="application/xml"/>
<Override PartName="/word/document.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
<Override PartName="/word/numbering.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml"/>
<Override PartName="/word/styles.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
<Override PartName="/word/endnotes.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml"/>
<Override PartName="/docProps/app.xml"
ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
<Override PartName="/word/settings.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/>
<Override PartName="/word/footer2.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
<Override PartName="/docProps/custom.xml"
ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/>
<Override PartName="/word/footer1.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
<Override PartName="/word/theme/theme1.xml"
ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
<Override PartName="/word/fontTable.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/>
<Override PartName="/word/webSettings.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/>
<Override PartName="/word/header1.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml"/>
<Override PartName="/docProps/core.xml"
ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
</Types>

That should also apply to the rest of the new Office document
formats.

Plain DOC format could be a mishmash of three or four binary formats
(Word6 being the last compatible with 16-bit Windows 3.x Word). I
believe one Office version assigned DOC to what were really RTF format
files rather than the binary (yes, binary -- there is no guarantee that
you can find meaningful text without being able to parse a binary file
format).

PDF contents can by binary compressed; again there is no guarantee
you can find meaningful text without being able to parse the contents.
http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf
(an older version than current standard, I suspect)... Heck, many of the
cheaper PDF conversions basically embed each page as a graphical
(bitmap) image, not as text.

For the Office documents, if you are running on a Windows system (or
can open them in something like OpenOffice), your best chances are
likely to be programmatically open them in the application and then do a
"save as..." TXT (for Word) and CSV (for Excel) -- then process the
TXT/CSV files (or save as RTF if that is an option -- that's usually in
whatever the locale specific Windows code page contains, if not plain
ASCII).

I believe there is a library to read Excel files directly:
http://pypi.python.org/pypi/xlrd/

For PDF; I don't know if Acrobat Reader supports automation, to
programmatically load and "save as text".
http://p2p.wrox.com/vb-net-2002-2003-basics/39037-acrobat-reader-automation.html
implies an ability to automate on Windows, so using the win32 extension
library or ctypes may give you access to work with the files.
 
S

Steven D'Aprano

Dear Group,

No generally I know the glob.glob or the encodings as I work lot on
non-ASCII stuff, but I recently found an interesting issue, suppose
there are .doc,.docx,.txt,.xls,.pdf files with different encodings.

You can have text files with different encodings, but not the others.

..doc .docx .xls and .pdf are all binary files. You don't specify an
encoding when you read them, because they aren't text -- encodings are
for mapping bytes to text, not bytes to binary formats.

In particular, .docx is compressed XML, so once you have uncompressed it,
the contents XML, which is *always* UTF-8.

1) First I have to determine on the fly the file type.

Which is a different problem from your first post.

On Windows, you determine the file type using the file extension.

import os
name, ext = os.path.splitext("my_file_name.bmp")

will give you ext = ".bmp".

Then what do you expect to do? You can open the file as a binary blob,
but what do you expect then?

f = open("my_file_name.bmp", "rb")

Now what do you want to do with it?

2) I can not assign
encoding="..." whatever be the encoding I have to read it.

You can't set the encoding when you open files in binary mode, but binary
files don't have an encoding.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top