Re: Search for a string in binary files

Discussion in 'Python' started by =?iso-8859-1?q?Fran=E7ois_Pinard?=, Jul 22, 2003.

  1. [hokieghal99]

    > How could I use python to search for a string in binary files? From the
    > command line, I would do something like this on a Linux machine to find
    > this string:

    > grep -a "Microsoft Excel" *.xls

    > How can I do this in Python?

    Quite easily. To get you started, here is an untested draft, I leave it to
    you to try and debug. :)

    import glob
    for name in glob.glob('*.xls'):
    if file(name, 'rb').read().find('Microsoft Excel') >= 0:
    print "Found in", name

    Fran├žois Pinard
    =?iso-8859-1?q?Fran=E7ois_Pinard?=, Jul 22, 2003
    1. Advertisements

  2. John Hunter

    John Hunter Guest

    >>>>> "hokieghal99" == hokieghal99 <> writes:

    hokieghal99> And, would it be more efficent (faster) to just call
    hokieghal99> grep from python to do the searching?

    Depending on how you call grep, probably. If you respawn grep for
    each file, it might be slower than the python solution. If you first
    build the file list of all the files you want to search and then call
    grep on all the files simultaneously, it will likely be a good bit
    faster. But you will have to deal with issues like quoting spaces in
    filenames, etc....

    John Hunter, Jul 22, 2003
    1. Advertisements

  3. [hokieghal99]

    > One last question: does grep actually open files when it searches them?

    I did not look at `grep' sources for a good while, I might not remember
    correctly, read me with caution. `grep' might be trying to `mmap' the files
    if the file (and the underlying system) allows this, and there is system
    overhead associated with that function, just like `open'.

    > And, would it be more efficent (faster) to just call grep from python to
    > do the searching?

    No doubt to me that it is more efficient calling `grep' _instead_ of Python.
    However, if Python is already started, it is more efficient doing the work
    from within Python than launching an external program as `grep', as there is
    non-negligible system overhead in doing so. (Yet for only a few files,
    launching `grep' is fast enough that the user would not notice it anyway.)

    Still, there are special cases, unusual in practice, when `grep' might be
    faster despite the overhead of calling it. When the file is long enough,
    and the string to be searched for meets some special conditions, the
    Buyer-Moore algorithm (not sure of spelling) might progressively beat the
    likely more simple-minded search technique used within `string.find'. Yet
    if Python's `string.find' relies on `strstr' in GNU `libc', it might be
    quite fast already. The implementation of such basic routines in `libc'
    varied over time, they at least once used to be extremely well implemented
    for speed, cleverly using bits of assembler here and there. For `strstr' in
    particular, there was once some good code from Stephen van den Berg. I do
    not know what `libc' uses nowadays, nor if Python takes advantage of it.

    Finally, for huge files, proper reading in Python has to be done in chunks,
    and the string to be searched for may happen to span chunks. Doing it
    properly might require some more care than one might think at first. But in
    practice, on the big average, for reasonable files, staying in Python wins.

    Fran├žois Pinard
    =?iso-8859-1?q?Fran=E7ois_Pinard?=, Jul 22, 2003
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andy
    Jack Klein
    Nov 25, 2003
  2. John Hunter

    Re: Search for a string in binary files

    John Hunter, Jul 21, 2003, in forum: Python
    John Hunter
    Jul 21, 2003
  3. utab
  4. Timmy
  5. Bogdan

    Binary tree search vs Binary search

    Bogdan, Oct 18, 2010, in forum: C Programming
    Michael Angelo Ravera
    Oct 21, 2010

Share This Page