File type recognition

Stanimir Stamenkov · Aug 11, 2004

Although it's a more general topic I consider it in the context of
Java and that's why I'm posting it here.

The standard 'File' interface doesn't provide means (a property or
method) to unambiguously identify a file's data type. As 'type' I
would take a MIME type since MIME types are the only standard,
widely spread and definitive identification.

AFAIK, most file systems doesn't store per file MIME type (or
analogous) information and there are generally two common approaches
to identify a file's type: through file name patterns
(suffixes/extensions) and "magic numbers" (data pattern matching).

So my question is: What would be the better algorithm when
identifying file's data type in the absence of MIME type information?

Possible variants are:

1. If the file name matches a known pattern - use the associated
type else, try matching a data pattern;

2. If the file data matches a known pattern - use the associated
type else, try matching a name pattern;

3. If the file name matches a known pattern try to match the data
pattern associated with the corresponding type. If no type matching
both criteria has been found - use (1).

4. More suggestions?

As side notes: In my experiments I've used an 'URLConnection' to
obtain the 'contentType' and 'InputStream' for a file. I've noticed
I have a "content-types.properties" file in my <JRE_HOME/lib>
directory which governs the result of
'URLConnection.getContentType()' method. It generally describes MIME
type to file name extensions mappings, but I've heard on Linux (not
sure if it applies for all *IX-like systems) there's used only
"magic numbers" (don't know if the Linux Java implementation uses
it, though).

Liz · Aug 12, 2004

try the 'file' command

Stanimir Stamenkov · Aug 12, 2004

/Liz/:

try the 'file' command

I don't understand - could you be more specific?

Jacob · Aug 12, 2004

Stanimir said:
So my question is: What would be the better algorithm when identifying
file's data type in the absence of MIME type information?

The problem is that you in general can't. All you can
do is to make some educated guesses. It whould be helpful
to know in what context you need this information.

On Unix-like systems, the use of magic numbers has been
standardized, and you can obtain type information through
the "file" command as Liz suggests. This is not portable
however, and the concept is not foulproof anyway as it
depends on everyone (i.e. file producers) follows the
rules and that file types has been pre-classified.

So then you are left with pattern matching as you suggest.
I use the following sequence to classify:

1. File name pattern match.
2. Read small portion of file and look for known patterns
3. Ask user.

If I start accessing a file which has been incorrectly
classified (a PDF file with .gif extension for instance),
I quickly get an exception. In my case this is good enough,
it might not be for you.

As an implementation note, what I have done is to make a
small class for each file type I'd like to support, and
pass a File argument to these classes and get a boolean
answer if the file is of the given type. Then I have a
master (factory-like) class that find all these classes
(by introspection) and pass the File object to these in
sequence until one report a match. I can then dynamically
add types without changing the factory class, but of course,
my approach will never be able to classify a type I haven't
pre-classified.

Stanimir Stamenkov · Aug 12, 2004

/Jacob/:

The problem is that you in general can't. All you can
do is to make some educated guesses. It whould be helpful
to know in what context you need this information.

I don't have exact context on my mind right now, but one may think
of general file browser/viewer application, for example. It is just
an issue I've encountered many times and wondered what solutions
there are.

So then you are left with pattern matching as you suggest.
I use the following sequence to classify:

1. File name pattern match.
2. Read small portion of file and look for known patterns
3. Ask user.

Thank you for your input.

Liz · Aug 12, 2004

Stanimir Stamenkov said:
/Liz/:

I don't understand - could you be more specific?

Unix/Linux has a command/program called "file" and it
does a bunch of stuff to try to figure out what type the
file is - esstntially it does this work for you. You
can call it from Java and look at it's output. It is not
always perfect. I have a version from MKS ported to the pc.
sample output: (my version is old and it is sometimes inaccurate)
c:\>file *
Plot1.java: ASCII text
LM.java: c program text
LinearSystem.java: fortran program text
newfile.csv: ASCII text
NewDraw.xlr: ASCII text with control characters
NewDraw.zip: Zip archive (at least v2.0 to extract)

Omar Khan · Aug 15, 2004

Peace be unto you.

How Mozilla determines MIME Types
by Christian Biesinger <[email protected]>
http://www.mozilla.org/docs/web-developer/mimetypes.html

The above site links to this one
Appendix A: MIME Type Detection in Internet Explorer
http://msdn.microsoft.com/workshop/networking/moniker/overview/appendix_a.asp

The Mozilla document is more interesting - there are links
to c++ code
http://lxr.mozilla.org/seamonkey/source/netwerk/streamconv/converters/nsUnknownDecoder.cpp
and the source code is well documented and
easy to follow.

In both case, after extensive checks, if no match is found, they fall
back to classifying the file as either text/plain or
application/octet-stream (if the file seems filled with non-readable
characters in the latin range or something).

Nevertheless, there is a bias towards recognizing markup languages
in the implementation.

Stanimir said:
O.k. My question was which would be more sensible:

I guess it is up to you and the user/business requirements.
As you can see in the source
http://lxr.mozilla.org/seamonkey/source/uriloader/exthandler/nsExternalHelperAppService.cpp#368
some of the mime types are even hard coded to speed up the search.

HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023
Mime-Type applet - test	12	Nov 11, 2009
Big problem I need to solve with some unix utils	1	Jun 19, 2022
Fatal error: Uncaught Error: Cannot use object of type WP_Error as array in	0	Dec 23, 2021
Check forms With JavaScript	1	Mar 28, 2023
Detecting mime types	1	Sep 20, 2007
Issue with passing fetched data to POST form. How can I?	0	Jul 23, 2023
I dont get this. Please help me!!	2	Jan 24, 2023

File type recognition

Stanimir Stamenkov

Liz

Stanimir Stamenkov

Jacob

Stanimir Stamenkov

Liz

Omar Khan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads