Determining when a file is an Open Office Document

T

tubby

Silly question, but here goes... what's a good way to determine when a
file is an Open Office document? I could look at the file extension, but
it seems there would be a better way. VI shows this info in the files:

mimetypeapplication/vnd.oasis.opendocument.textPK
mimetypeapplication/vnd.oasis.opendocument.presentationPK
etc.

Not really a Python specific question but, how do you guys do this sort
of thing? I've figured out how to break out the content.xml file in the
new OOo XML format, and do re searching and matching on that, now I just
need a fast, reliable way to determine when I need to do that versus
just reading the file.

Thanks,
Tubby
 
B

Ben Finney

tubby said:
Silly question, but here goes... what's a good way to determine when
a file is an Open Office document? I could look at the file
extension, but it seems there would be a better way.

Yes, the name of a file may be useful for communicating with humans
about that file's intended use, but is a lousy, unreliable way to make
a definite statement about the actual contents of the file.

The Unix 'file' command determines the type of a file by its contents,
not its name. This functionality is essentially a database of "magic"
byte patterns mapping to file types, and is provided by a library
called "libmagic", distributed with most GNU/Linux distributions.

<URL:http://packages.debian.org/testing/source/file>

There is a Python interface to the "magic" functionality. It's in
Debian; I'm not sure if it's part of the "magic" code base, or written
separately to interface with it. Either way, you can get the source
for those packages and find out more.

<URL:http://packages.debian.org/unstable/python/python-magic>
 
R

Ross Ridge

tubby said:
Silly question, but here goes... what's a good way to determine when a
file is an Open Office document? I could look at the file extension, but
it seems there would be a better way. VI shows this info in the files:

mimetypeapplication/vnd.oasis.opendocument.textPK

It's a ZIP archive. The info you've found are the file name
"mimetype", the uncompressed contents of that file
"application/vnd.oasis.opendocument.text", and part of the ZIP magic
number "PK". You should be able to use the "zipfile" module to check
to see if the file a ZIP file, if it has a member named "mimetype" and
if the contents of the file match one of the OpenOffice MIME types.

Ross Ridge
 
R

Ross Ridge

tubby said:
Silly question, but here goes... what's a good way to determine when a
file is an Open Office document? I could look at the file extension, but
it seems there would be a better way. VI shows this info in the files:

mimetypeapplication/vnd.oasis.opendocument.textPK

It's a ZIP archive. The info you've found are the file name
"mimetype", the uncompressed contents of that file
"application/vnd.oasis.opendocument.text", and part of the ZIP magic
number "PK". You should be able to use the "zipfile" module to check
to see if the file a ZIP file, if it has a member named "mimetype" and
if the contents of the file match one of the OpenOffice MIME types.

Ross Ridge
 
T

tubby

Ross said:
It's a ZIP archive.

Thanks, I used this approach:

import zipfile
if zipfile.is_zipfile(filename):
...

Now, If only I could something like that on PDF files :)
 
S

Steven D'Aprano

Yes, the name of a file may be useful for communicating with humans
about that file's intended use, but is a lousy, unreliable way to make
a definite statement about the actual contents of the file.

The Unix 'file' command determines the type of a file by its contents,
not its name. This functionality is essentially a database of "magic"
byte patterns mapping to file types,

Ah, another lousy, unreliable way to make a definite statement about the
actual contents of a file. Looking at magic bytes inside a file is hardly
bullet-proof (although file seems to be moderately reliable in practice,
at least under Linux).

Simple example: is the file consisting of two bytes "x09x0A" meant to be a
text file with a tab and a newline, or a binary file consisting of a
single two-byte int? There's no way to tell just from the contents.
It's a circular problem: to be sure what the file is ("it's a two-byte
int") one has to understand the contents ("the integer 2305") -- but you
can only understand the contents if you know what the file is.

There are only two ways out of this vicious circle:

(1) Have the creator of the file unambiguously label it. Some file systems
associate file-type metadata to files (e.g. Classic Apple Macintosh did
that), but sadly the main file systems in use today do not.

(2) Make an educated guess from various heuristics and conventions. The
old DOS 8.3 naming system is one such convention, and modern operating
systems tend to follow it. The Unix "file" utilities database of magic
bytes is such a heuristic.
 
R

Ross Ridge

tubby said:
Now, If only I could something like that on PDF files :)

PDF files should begin with "%PDF-" followed by a version number, eg.
"%PDF-1.4". The PDF Reference notes that Adobe Acrobat Reader is a bit
more flexiable about what it will accept:

13. Acrobat viewers require only that the header appear
somewhere within the first 1024 bytes of the file.
14. Acrobat viewers also accept a header of the form
%!PS-Adobe-N.n PDF-M.m

So identifying PDF files is pretty easy. If you want to examine the
contents of a PDF file you're better off using Postscript, Ghostscript
specifically, since PDF is essentially Postscript with a special
dictionary of commands.

Ross Ridge
 
R

Robert Marshall

Ah, another lousy, unreliable way to make a definite statement about
the actual contents of a file. Looking at magic bytes inside a file
is hardly bullet-proof (although file seems to be moderately
reliable in practice, at least under Linux).

Simple example: is the file consisting of two bytes "x09x0A" meant
to be a text file with a tab and a newline, or a binary file
consisting of a single two-byte int? There's no way to tell just
from the contents.

And see for example the problem that development versions of emacs is
(were?) having with C files that started #define and were then treated
as graphics files!

http://thread.gmane.org/gmane.emacs.devel/64823/focus=65228


Robert
 
S

Steven D'Aprano

PDF files should begin with "%PDF-" followed by a version number, eg.
"%PDF-1.4". The PDF Reference notes that Adobe Acrobat Reader is a bit
more flexiable about what it will accept:

13. Acrobat viewers require only that the header appear
somewhere within the first 1024 bytes of the file.
14. Acrobat viewers also accept a header of the form
%!PS-Adobe-N.n PDF-M.m

So identifying PDF files is pretty easy.

Sure. MIS-identifying PDF files is pretty easy. Identifying them is not.
Consider this example:

$ cat not_a_pdf
%PDF-1.4
This is not a pdf file.
$ file not_a_pdf
not_a_pdf: PDF document, version 1.4

Is there a security vulnerability buried in the detection of file types by
magic bytes? I don't know, but I wouldn't be surprised if there were.

Here's another example:

$ cat not_a_gif.txt
GIF89a is the header used to define a GIF file.
$ file not_a_gif.txt
not_a_gif: GIF image data, version 89a, 26912 x 8307

Any file system that doesn't have file type metadata is reduced to
guessing the type of the file, and guesses can be wrong. As heuristics go,
"look at the characters after the dot in the file name" is not that much
worse than "look at the bytes at offset X through Y inside the file", and
has the significant advantage that it is visible and easy to change for
the end user.
 
R

Ross Ridge

Ross said:
So identifying PDF files is pretty easy.
Sure. MIS-identifying PDF files is pretty easy. Identifying them is not.
Consider this example:

Your contrived example doesn't show how a PDF file would be
misidentified, it only shows how a file deliberately made to look like
PDF file would be "misidentified". Since that was the intent of
crafting such a file, I don't see the problem.
Is there a security vulnerability buried in the detection of file types by
magic bytes? I don't know, but I wouldn't be surprised if there were.

There's only a security vulnerability if you choose to trust a file
based on it's assumed file type. Since PDF files generally aren't
trusted, it's not likely to be an issue for whatever application tubby
has in mind.
Any file system that doesn't have file type metadata is reduced to
guessing the type of the file, and guesses can be wrong.

File type metadata can also be wrong. You can give any file a .PDF
extension and Windows will believe it's a PDF file. On Mac OS if file
has a signature "CARO"/"PDF ", it will believe it's a PDF file
regardless of it's contents. Metadata doesn't make programs any less
vulnerable to deliberate attempts to fool them.

Ross Ridge
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top