Detecting Binary content in files

R

ritu

Hi,

I'm wondering if Python has a utility to detect binary content in
files? Or if anyone has any ideas on how that can be accomplished? I
haven't been able to find any useful information to accomplish this
(my other option is to fire off a perl script from within m python
script that will tell me whether the file is binary), so any pointers
will be appreciated.

Thanks,
Ritu
 
M

Matt Nordhoff

ritu said:
Hi,

I'm wondering if Python has a utility to detect binary content in
files? Or if anyone has any ideas on how that can be accomplished? I
haven't been able to find any useful information to accomplish this
(my other option is to fire off a perl script from within m python
script that will tell me whether the file is binary), so any pointers
will be appreciated.

Thanks,
Ritu

There isn't any perfect test. The usual heuristic is to check if there
are any NUL bytes in the file:

That can fail, of course. UTF-16-encoded text will have tons of NUL
bytes, and some binary files may not have any.
--
 
J

Josh Dukes

There might be another way but off the top of my head:

#!/usr/bin/env python

def isbin(filename):
fd=open(filename,'rb')
for b in fd.read():
if ord(b) > 127:
fd.close()
return True
fd.close()
return False

for f in ['/bin/bash', '/etc/passwd']:
print "%s is binary: " % f, isbin(f)


Of course this would detect unicode files as being binary and maybe
that's not what you want. How are you thinking about doing it in
perl exactly?
 
J

Josh Dukes

s/if ord(b) > 127/if ord(b) > 127 or ord(b) < 32/


There might be another way but off the top of my head:

#!/usr/bin/env python

def isbin(filename):
fd=open(filename,'rb')
for b in fd.read():
if ord(b) > 127:
fd.close()
return True
fd.close()
return False

for f in ['/bin/bash', '/etc/passwd']:
print "%s is binary: " % f, isbin(f)


Of course this would detect unicode files as being binary and maybe
that's not what you want. How are you thinking about doing it in
perl exactly?


Hi,

I'm wondering if Python has a utility to detect binary content in
files? Or if anyone has any ideas on how that can be accomplished? I
haven't been able to find any useful information to accomplish this
(my other option is to fire off a perl script from within m python
script that will tell me whether the file is binary), so any
pointers will be appreciated.

Thanks,
Ritu
 
J

Josh Dukes

or rather:

#!/usr/bin/env python
import string

def isbin(filename):
fd=open(filename,'rb')
for b in fd.read():
if not b in string.printable and b not in string.whitespace:
fd.close()
return True
fd.close()
return False

for f in ['/bin/bash', '/etc/passwd']:
print "%s is binary: " %f, isbin(f)


whatever... basically it's what everyone else said, every file is
binary so it all depends on your definitiion of binary.

s/if ord(b) > 127/if ord(b) > 127 or ord(b) < 32/


There might be another way but off the top of my head:

#!/usr/bin/env python

def isbin(filename):
fd=open(filename,'rb')
for b in fd.read():
if ord(b) > 127:
fd.close()
return True
fd.close()
return False

for f in ['/bin/bash', '/etc/passwd']:
print "%s is binary: " % f, isbin(f)


Of course this would detect unicode files as being binary and maybe
that's not what you want. How are you thinking about doing it in
perl exactly?


Hi,

I'm wondering if Python has a utility to detect binary content in
files? Or if anyone has any ideas on how that can be
accomplished? I haven't been able to find any useful information
to accomplish this (my other option is to fire off a perl script
from within m python script that will tell me whether the file is
binary), so any pointers will be appreciated.

Thanks,
Ritu
 
D

Dave Angel

There are lots of ways to decide if a file is non-text, but I don't know
of any "standard" way. You can detect a file as not-ascii by simply
searching for any character greater than 0x7f. But that doesn't handle
a UTF-8 file, which is an 8bit text file representing Unicode.

The way I've seen done many times is to search for regular occurrence of
the end-of-line character, and the lack of nulls. Most "binary" files
will have more nulls than linefeeds, and any null could be considered a
marker for a non-text file.

If you're happy with your particular perl script, probably it could be
readily translated to Python.
 
D

Dave Angel

All files are binary, but probably by binary you mean non-text.

There are lots of ways to decide if a file is non-text, but I don't know
of any "standard" way. You can detect a file as not-ascii by simply
searching for any character greater than 0x7f. But that doesn't handle
a UTF-8 file, which is an 8bit text file representing Unicode.

The way I've seen done many times is to search for regular occurrence of
the end-of-line character, and the lack of nulls. Most "binary" files
will have more nulls than linefeeds, and any null could be considered a
marker for a non-text file.

If you're happy with your particular perl script, probably it could be
readily translated to Python.
 
R

ritu

There might be another way but off the top of my head:

#!/usr/bin/env python

def isbin(filename):
   fd=open(filename,'rb')
   for b in fd.read():
       if ord(b) > 127:
           fd.close()
           return True
   fd.close()
   return False

for f in ['/bin/bash', '/etc/passwd']:
   print "%s is binary: " % f, isbin(f)

Of course this would detect unicode files as being binary and maybe
that's not what you want. How are you thinking about doing it in
perl exactly?

With perl, I'm thinking of doing something like the below:

if ( ( -B $filename ||
$filename =~ /\.pdf$/ ) &&
-s $filename > 0 ) {
return(1);
}

So my isbin method should return a true for any file that isn't
entirely ASCII text, so I guess for my purposes classifying a unicode
file as a 'binary' would be alright. Thanks much for your response.
 
S

Steven D'Aprano

Hi,

I'm wondering if Python has a utility to detect binary content in files?

Define binary content.

Or if anyone has any ideas on how that can be accomplished?

Step one: read the file.

Step two: does any of the data you have read match your definition of
binary content? If so, then you have detected binary content.

Step three: there is no step three.

I haven't
been able to find any useful information to accomplish this (my other
option is to fire off a perl script from within m python script that
will tell me whether the file is binary), so any pointers will be
appreciated.

Look at the perl script and see how it does it. Does it give false
positives for Unicode text files?
 
D

Dennis Lee Bieber

if ( ( -B $filename ||
$filename =~ /\.pdf$/ ) &&
-s $filename > 0 ) {
return(1);
}
According to my old copy of the Camel, -B only reads the "first
block" of the file. If the block contains a <NUL>, or if ~30% of the
block contains bytes >127 or from some (undefined) set of control
characters (that is, I expect it does not count <LF>, <CR>, <TAB>, <VT>,
<FF>, maybe some others)... So...

def isbin(fid):
fin = open(fid, "r")
block = fin.read(1024) #what is the size of a "block" these days
binary = "\0" in block
if not binary:
mrkrs = [b for b in block
if b > 127
or b in [ "\r", "\n", "\t" ] ] #add needed
binary = (float(len(mrkrs)) / len(block)) > 0.30
fin.close()
return binary
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
J

John Machin

        According to my old copy of the Camel, -B only reads the "first
block" of the file. If the block contains a <NUL>, or if ~30% of the
block contains bytes >127 or from some (undefined) set of control
characters (that is, I expect it does not count <LF>, <CR>, <TAB>, <VT>,
<FF>, maybe some others)... So...

Not sure whether this is meant to be rough pseudocode or an April 1
"jeu d'esprit" or ...
def isbin(fid):
        fin = open(fid, "r")

(1) mode = "rb" might be better
        block = fin.read(1024)  #what is the size of a "block" these days
        binary = "\0" in block
        if not binary:
                mrkrs = [b for b in block
                                        if b > 127

(2) [assuming Python 2.x]
b is a str object; change 127 to "\x3f"
                                                or b in [ "\r", "\n", "\t" ]      ]       #add needed

(3) surely you mean "b not in"

(4) possible improvements on ["\r", etc etc] :
(4a) use tuple ("\r", etc etc)
(4b) use string "\r\n\t"
(you don't really want to build that list from scratch for each byte
tested, do you?)
                binary = (float(len(mrkrs)) / len(block)) > 0.30
        fin.close()
        return binary

Cheers,
John
 
J

John Machin

@yahoo.com> declaimed the following in
gmane.comp.python.general:
        According to my old copy of the Camel, -B only reads the "first
block" of the file. If the block contains a <NUL>, or if ~30% of the
block contains bytes >127 or from some (undefined) set of control
characters (that is, I expect it does not count <LF>, <CR>, <TAB>, <VT>,
<FF>, maybe some others)... So...

Not sure whether this is meant to be rough pseudocode or an April 1
"jeu d'esprit" or ...


def isbin(fid):
        fin = open(fid, "r")

(1) mode = "rb" might be better
        block = fin.read(1024)  #what is the size of a "block" these days
        binary = "\0" in block
        if not binary:
                mrkrs = [b for b in block
                                        if b > 127

(2) [assuming Python 2.x]
b is a str object; change 127 to "\x3f"

Gah ... it must be gamma rays from outer space! Trying again:

change 127 to "\x7f" (and actually "\x7e" would be a better choice)
                                                or b in [ "\r", "\n", "\t" ]      ]       #add needed

(3) surely you mean "b not in"

take 2:

surely you mean
... or b < "\x20" and b not in "\r\n\t"

and at that stage the idea of making a set of chars befor entering the
loop has some attraction :)
 
D

Dennis Lee Bieber

Not sure whether this is meant to be rough pseudocode or an April 1
"jeu d'esprit" or ...
It was basically an off-the-cuff attempt at coding something similar
to the documented behavior of the PERL -B operation.
(1) mode = "rb" might be better
Though having it modify line-endings is probably not going to cause
a false-positive said:
        block = fin.read(1024)  #what is the size of a "block" these days
        binary = "\0" in block
        if not binary:
                mrkrs = [b for b in block
                                        if b > 127

(2) [assuming Python 2.x]
b is a str object; change 127 to "\x3f"

Uhm... "\x7F said:
                                                or b in [ "\r", "\n", "\t" ]      ]       #add needed

(3) surely you mean "b not in"
Granted...

(you don't really want to build that list from scratch for each byte
tested, do you?)
In truth, I'd have predefined a "constant"... call it "TEXTCONTROLS"
maybe?
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,104
Latest member
LesliVqm09
Top