No generally I know the glob.glob or the encodings as I work lot on non-ASCII stuff, but I recently found an interesting issue, suppose there are .doc,.docx,.txt,.xls,.pdf files with different encodings.
1) First I have to determine on the fly the file type.
2) I can not assign encoding="..." whatever be the encoding I have to read it.
Many of those are (semi) proprietary formats (M$ Office <G>).
DOCX (and XLSX) are, as I recall ZIP-compressed XML formats -- and I
think that also implies UTF-8 (once you manage to decompress them)...
Note that, for a test, I renamed a .docx to .zip and opened it in
PowerArchiver... It generates 19 files in a multi-level tree -- one of
which is named
[content_types].xml
and contains
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types
xmlns="
http://schemas.openxmlformats.org/package/2006/content-types">
<Override PartName="/word/footnotes.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml"/>
<Default Extension="rels"
ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml" ContentType="application/xml"/>
<Override PartName="/word/document.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
<Override PartName="/word/numbering.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml"/>
<Override PartName="/word/styles.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
<Override PartName="/word/endnotes.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml"/>
<Override PartName="/docProps/app.xml"
ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/>
<Override PartName="/word/settings.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.settings+xml"/>
<Override PartName="/word/footer2.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
<Override PartName="/docProps/custom.xml"
ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/>
<Override PartName="/word/footer1.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
<Override PartName="/word/theme/theme1.xml"
ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/>
<Override PartName="/word/fontTable.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.fontTable+xml"/>
<Override PartName="/word/webSettings.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.webSettings+xml"/>
<Override PartName="/word/header1.xml"
ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.header+xml"/>
<Override PartName="/docProps/core.xml"
ContentType="application/vnd.openxmlformats-package.core-properties+xml"/>
</Types>
That should also apply to the rest of the new Office document
formats.
Plain DOC format could be a mishmash of three or four binary formats
(Word6 being the last compatible with 16-bit Windows 3.x Word). I
believe one Office version assigned DOC to what were really RTF format
files rather than the binary (yes, binary -- there is no guarantee that
you can find meaningful text without being able to parse a binary file
format).
PDF contents can by binary compressed; again there is no guarantee
you can find meaningful text without being able to parse the contents.
http://partners.adobe.com/public/developer/en/pdf/PDFReference.pdf
(an older version than current standard, I suspect)... Heck, many of the
cheaper PDF conversions basically embed each page as a graphical
(bitmap) image, not as text.
For the Office documents, if you are running on a Windows system (or
can open them in something like OpenOffice), your best chances are
likely to be programmatically open them in the application and then do a
"save as..." TXT (for Word) and CSV (for Excel) -- then process the
TXT/CSV files (or save as RTF if that is an option -- that's usually in
whatever the locale specific Windows code page contains, if not plain
ASCII).
I believe there is a library to read Excel files directly:
http://pypi.python.org/pypi/xlrd/
For PDF; I don't know if Acrobat Reader supports automation, to
programmatically load and "save as text".
http://p2p.wrox.com/vb-net-2002-2003-basics/39037-acrobat-reader-automation.html
implies an ability to automate on Windows, so using the win32 extension
library or ctypes may give you access to work with the files.