File type recognition

S

Stanimir Stamenkov

Although it's a more general topic I consider it in the context of
Java and that's why I'm posting it here.

The standard 'File' interface doesn't provide means (a property or
method) to unambiguously identify a file's data type. As 'type' I
would take a MIME type since MIME types are the only standard,
widely spread and definitive identification.

AFAIK, most file systems doesn't store per file MIME type (or
analogous) information and there are generally two common approaches
to identify a file's type: through file name patterns
(suffixes/extensions) and "magic numbers" (data pattern matching).

So my question is: What would be the better algorithm when
identifying file's data type in the absence of MIME type information?

Possible variants are:

1. If the file name matches a known pattern - use the associated
type else, try matching a data pattern;

2. If the file data matches a known pattern - use the associated
type else, try matching a name pattern;

3. If the file name matches a known pattern try to match the data
pattern associated with the corresponding type. If no type matching
both criteria has been found - use (1).

4. More suggestions?


As side notes: In my experiments I've used an 'URLConnection' to
obtain the 'contentType' and 'InputStream' for a file. I've noticed
I have a "content-types.properties" file in my <JRE_HOME/lib>
directory which governs the result of
'URLConnection.getContentType()' method. It generally describes MIME
type to file name extensions mappings, but I've heard on Linux (not
sure if it applies for all *IX-like systems) there's used only
"magic numbers" (don't know if the Linux Java implementation uses
it, though).
 
J

Jacob

Stanimir said:
So my question is: What would be the better algorithm when identifying
file's data type in the absence of MIME type information?

The problem is that you in general can't. All you can
do is to make some educated guesses. It whould be helpful
to know in what context you need this information.

On Unix-like systems, the use of magic numbers has been
standardized, and you can obtain type information through
the "file" command as Liz suggests. This is not portable
however, and the concept is not foulproof anyway as it
depends on everyone (i.e. file producers) follows the
rules and that file types has been pre-classified.

So then you are left with pattern matching as you suggest.
I use the following sequence to classify:

1. File name pattern match.
2. Read small portion of file and look for known patterns
3. Ask user.

If I start accessing a file which has been incorrectly
classified (a PDF file with .gif extension for instance),
I quickly get an exception. In my case this is good enough,
it might not be for you.

As an implementation note, what I have done is to make a
small class for each file type I'd like to support, and
pass a File argument to these classes and get a boolean
answer if the file is of the given type. Then I have a
master (factory-like) class that find all these classes
(by introspection) and pass the File object to these in
sequence until one report a match. I can then dynamically
add types without changing the factory class, but of course,
my approach will never be able to classify a type I haven't
pre-classified.
 
S

Stanimir Stamenkov

/Jacob/:
The problem is that you in general can't. All you can
do is to make some educated guesses. It whould be helpful
to know in what context you need this information.

I don't have exact context on my mind right now, but one may think
of general file browser/viewer application, for example. It is just
an issue I've encountered many times and wondered what solutions
there are.
So then you are left with pattern matching as you suggest.
I use the following sequence to classify:

1. File name pattern match.
2. Read small portion of file and look for known patterns
3. Ask user.

Thank you for your input.
 
L

Liz

Stanimir Stamenkov said:
/Liz/:


I don't understand - could you be more specific?

Unix/Linux has a command/program called "file" and it
does a bunch of stuff to try to figure out what type the
file is - esstntially it does this work for you. You
can call it from Java and look at it's output. It is not
always perfect. I have a version from MKS ported to the pc.
sample output: (my version is old and it is sometimes inaccurate)
c:\>file *
Plot1.java: ASCII text
LM.java: c program text
LinearSystem.java: fortran program text
newfile.csv: ASCII text
NewDraw.xlr: ASCII text with control characters
NewDraw.zip: Zip archive (at least v2.0 to extract)
 
O

Omar Khan

Peace be unto you.

How Mozilla determines MIME Types
by Christian Biesinger <[email protected]>
http://www.mozilla.org/docs/web-developer/mimetypes.html

The above site links to this one
Appendix A: MIME Type Detection in Internet Explorer
http://msdn.microsoft.com/workshop/networking/moniker/overview/appendix_a.asp

The Mozilla document is more interesting - there are links
to c++ code
http://lxr.mozilla.org/seamonkey/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp
and the source code is well documented and
easy to follow.

In both case, after extensive checks, if no match is found, they fall
back to classifying the file as either text/plain or
application/octet-stream (if the file seems filled with non-readable
characters in the latin range or something).

Nevertheless, there is a bias towards recognizing markup languages
in the implementation.

Stanimir said:
O.k. My question was which would be more sensible:

I guess it is up to you and the user/business requirements.
As you can see in the source
http://lxr.mozilla.org/seamonkey/source/uriloader/exthandler/nsExternalHelperAppService.cpp#368
some of the mime types are even hard coded to speed up the search.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top