Recognising file type (ascii/binary)

B

Bruce Lee

Is there any easy way to get Java to determine whether a file is a binary
file or plain text ascii file?
 
M

Matt Humphrey

Bruce Lee said:
Is there any easy way to get Java to determine whether a file is a binary
file or plain text ascii file?

Files are simply sequences of (binary) bytes--there's no way to tell whether
it's supposed to contain only bytes that represent printable ascii (or
unicode) or any particular binary pattern. You can read the file to find
out--if you find values that signify unlikely or non-printable characters
you can deem the file binary or corrupt. Similarly, there are heuristics
(based on convention) for guessing the "type" of the file based on the first
few bytes, but there's no guarantee these are correct either. (And files
with 2-byte UNICODE characters can really confuse things.)

Of course, you could require that text files end in "txt" or something--it's
no worse than any of the above and significantly easier.

What are you trying to do?

Cheers,
 
O

Oliver Wong

Matt Humphrey said:
Files are simply sequences of (binary) bytes--there's no way to tell
whether it's supposed to contain only bytes that represent printable ascii
(or unicode) or any particular binary pattern. You can read the file to
find out--if you find values that signify unlikely or non-printable
characters you can deem the file binary or corrupt. Similarly, there are
heuristics (based on convention) for guessing the "type" of the file based
on the first few bytes, but there's no guarantee these are correct either.
(And files with 2-byte UNICODE characters can really confuse things.)

Of course, you could require that text files end in "txt" or
something--it's no worse than any of the above and significantly easier.

Matt Humphrey is completely correct. However as an additional check to
the heuristic of looking for unprintable characters, another trick is to
check if the newline string is consistent. It should always be either "\n"
(for UNIX-like systems), "\r" (for Mac-like systems) or "\r\n" (for
Windows-like systems). If the file starts switching around between these, it
probably isn't a valid ASCII file on any of the above three platforms.

You could also disregard 2-byte UNICODE characters as being "non-ASCII",
and lump them in with the category of "binary files".

- Oliver
 
B

Bruce Lee

Matt Humphrey said:
Files are simply sequences of (binary) bytes--there's no way to tell whether
it's supposed to contain only bytes that represent printable ascii (or
unicode) or any particular binary pattern. You can read the file to find
out--if you find values that signify unlikely or non-printable characters
you can deem the file binary or corrupt. Similarly, there are heuristics
(based on convention) for guessing the "type" of the file based on the first
few bytes, but there's no guarantee these are correct either. (And files
with 2-byte UNICODE characters can really confuse things.)

Of course, you could require that text files end in "txt" or something--it's
no worse than any of the above and significantly easier.

What are you trying to do?

Cheers,

To see if a url is binary or not without relying on the header.

I'm using something like this:

protected boolean isBinary(String url){

boolean isbin=false;
java.io.InputStream in=null;


try{
URL bin_url = new URL(url);

in = bin_url.openStream();
BufferedReader r = new BufferedReader(new InputStreamReader(in));

char [] cc= new char[255]; //do a peek
r.read(cc,0,255);

double prob_bin=0;

for(int i=0; i<cc.length; i++){
int j = (int)cc;

if(j<32 || j>127){ //with chinese and other type languages it might
flag them as binary - need another check ideaaly
prob_bin++;
}

}

double pb = prob_bin/255;
if(pb>0.5){
// System.out.println("probably binary at "+pb);
isbin= true;
}

}

in.close();

}catch(Exception ee){
System.out.println("WARN! Couldn't find isBinary() content-"+url);
isbin= false; //error - likely broken link - so return false
}

try{
in.close();
}catch(Exception E){}

System.out.println("url isBinary():"+url+":"+isbin);
return isbin;

}

I read somewhere that finding \n's might work as well.

Also, are ASCII 7bit and binary 8bit or something? Is there a way to find
this out - like analyse a byte?
 
R

Roedy Green

Is there any easy way to get Java to determine whether a file is a binary
file or plain text ascii file?

A practical test is to scan the first N bytes for a 0. If you find
one it is a binary, if not text.

It actually becomes a judgment call.

Let as say you define a text file as containing only 7-bit ASCII, no
control chars but \t space \n \r.

Then you find an 0x01 char somewhere in the file. Does that make it a
binary format?

Unfortunately not all OS's track the format/MIME etc of each file.
There is no universal scheme of embedded id signatures. It is a mess.
You have to do something seat of the pants yourself.

You can't even tell which encoding is used for a pure text file.
 
O

Oliver Wong

Bruce Lee said:
Also, are ASCII 7bit and binary 8bit or something?

There is not "bit length" associated with the concept of "binary". The
question is equivalent to "Is decimal 5 digits long or 7 digits long?" A
number written in decimal can be any number of digits long.
Is there a way to find
this out - like analyse a byte?

This is reminiscent of an discussion Roedy and I had about ASCII versus
binary formats. My position was that all data stored on a computer is stored
in binary (i.e. they are stored using bits), and one form of binary encoding
is called "ASCII". It was was a poor choice of wording to use "binary" to
mean "non-ASCII".

I'm assuming you don't directly care whether a given bitstream is ASCII
or non-ASCII; rather, you want this information so that you can solve
another problem. What is the real problem you are trying to solve? Perhaps
we can offer you solutions that don't involve distinguishing between ASCII
and non-ASCII bitstreams.

- Oliver

* The reason you may want to avoid distinguishing ASCII and non-ASCII
bitstream is that in general, it is completely impossible. There may exist
binary file formats out there which, given appropriate data to represent,
yeild bits which can legally be decoded into only printable characters using
the ASCII table, but that the semantic information in the file was never
meant to be text.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top