Re: Search for a string in binary files

  • Thread starter =?iso-8859-1?q?Fran=E7ois_Pinard?=
  • Start date
?

=?iso-8859-1?q?Fran=E7ois_Pinard?=

[hokieghal99]
How could I use python to search for a string in binary files? From the
command line, I would do something like this on a Linux machine to find
this string:
grep -a "Microsoft Excel" *.xls
How can I do this in Python?

Quite easily. To get you started, here is an untested draft, I leave it to
you to try and debug. :)

import glob
for name in glob.glob('*.xls'):
if file(name, 'rb').read().find('Microsoft Excel') >= 0:
print "Found in", name
 
J

John Hunter

hokieghal99> And, would it be more efficent (faster) to just call
hokieghal99> grep from python to do the searching?

Depending on how you call grep, probably. If you respawn grep for
each file, it might be slower than the python solution. If you first
build the file list of all the files you want to search and then call
grep on all the files simultaneously, it will likely be a good bit
faster. But you will have to deal with issues like quoting spaces in
filenames, etc....

JDH
 
?

=?iso-8859-1?q?Fran=E7ois_Pinard?=

[hokieghal99]
One last question: does grep actually open files when it searches them?

I did not look at `grep' sources for a good while, I might not remember
correctly, read me with caution. `grep' might be trying to `mmap' the files
if the file (and the underlying system) allows this, and there is system
overhead associated with that function, just like `open'.
And, would it be more efficent (faster) to just call grep from python to
do the searching?

No doubt to me that it is more efficient calling `grep' _instead_ of Python.
However, if Python is already started, it is more efficient doing the work
from within Python than launching an external program as `grep', as there is
non-negligible system overhead in doing so. (Yet for only a few files,
launching `grep' is fast enough that the user would not notice it anyway.)

Still, there are special cases, unusual in practice, when `grep' might be
faster despite the overhead of calling it. When the file is long enough,
and the string to be searched for meets some special conditions, the
Buyer-Moore algorithm (not sure of spelling) might progressively beat the
likely more simple-minded search technique used within `string.find'. Yet
if Python's `string.find' relies on `strstr' in GNU `libc', it might be
quite fast already. The implementation of such basic routines in `libc'
varied over time, they at least once used to be extremely well implemented
for speed, cleverly using bits of assembler here and there. For `strstr' in
particular, there was once some good code from Stephen van den Berg. I do
not know what `libc' uses nowadays, nor if Python takes advantage of it.

Finally, for huge files, proper reading in Python has to be done in chunks,
and the string to be searched for may happen to span chunks. Doing it
properly might require some more care than one might think at first. But in
practice, on the big average, for reasonable files, staying in Python wins.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top