S
Stanimir Stamenkov
Although it's a more general topic I consider it in the context of
Java and that's why I'm posting it here.
The standard 'File' interface doesn't provide means (a property or
method) to unambiguously identify a file's data type. As 'type' I
would take a MIME type since MIME types are the only standard,
widely spread and definitive identification.
AFAIK, most file systems doesn't store per file MIME type (or
analogous) information and there are generally two common approaches
to identify a file's type: through file name patterns
(suffixes/extensions) and "magic numbers" (data pattern matching).
So my question is: What would be the better algorithm when
identifying file's data type in the absence of MIME type information?
Possible variants are:
1. If the file name matches a known pattern - use the associated
type else, try matching a data pattern;
2. If the file data matches a known pattern - use the associated
type else, try matching a name pattern;
3. If the file name matches a known pattern try to match the data
pattern associated with the corresponding type. If no type matching
both criteria has been found - use (1).
4. More suggestions?
As side notes: In my experiments I've used an 'URLConnection' to
obtain the 'contentType' and 'InputStream' for a file. I've noticed
I have a "content-types.properties" file in my <JRE_HOME/lib>
directory which governs the result of
'URLConnection.getContentType()' method. It generally describes MIME
type to file name extensions mappings, but I've heard on Linux (not
sure if it applies for all *IX-like systems) there's used only
"magic numbers" (don't know if the Linux Java implementation uses
it, though).
Java and that's why I'm posting it here.
The standard 'File' interface doesn't provide means (a property or
method) to unambiguously identify a file's data type. As 'type' I
would take a MIME type since MIME types are the only standard,
widely spread and definitive identification.
AFAIK, most file systems doesn't store per file MIME type (or
analogous) information and there are generally two common approaches
to identify a file's type: through file name patterns
(suffixes/extensions) and "magic numbers" (data pattern matching).
So my question is: What would be the better algorithm when
identifying file's data type in the absence of MIME type information?
Possible variants are:
1. If the file name matches a known pattern - use the associated
type else, try matching a data pattern;
2. If the file data matches a known pattern - use the associated
type else, try matching a name pattern;
3. If the file name matches a known pattern try to match the data
pattern associated with the corresponding type. If no type matching
both criteria has been found - use (1).
4. More suggestions?
As side notes: In my experiments I've used an 'URLConnection' to
obtain the 'contentType' and 'InputStream' for a file. I've noticed
I have a "content-types.properties" file in my <JRE_HOME/lib>
directory which governs the result of
'URLConnection.getContentType()' method. It generally describes MIME
type to file name extensions mappings, but I've heard on Linux (not
sure if it applies for all *IX-like systems) there's used only
"magic numbers" (don't know if the Linux Java implementation uses
it, though).