Scanning a file

pinkfloydhomer · Oct 28, 2005

I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100. I've tried with this:

# start
import sys

numChars = 0
startCode = 0
count = 0

inputFile = sys.stdin

while True:
ch = inputFile.read(1)
numChars += 1

if len(ch) < 1: break

startCode = ((startCode << 8) & 0xffffffff) | (ord(ch))
if numChars < 4: continue

if startCode == 0x00000100:
count = count + 1

print count
# end

But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

/David

=?ISO-8859-1?Q?Gerhard_H=E4ring?= · Oct 28, 2005

I want to scan a file byte for byte [...]
while True:
ch = inputFile.read(1)
[...] But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

Read in blocks, not byte for byte. I had good experiences with block
sizes like 4096 or 8192.

-- Gerhard

pinkfloydhomer · Oct 28, 2005

Okay, how do I do this?

Also, if you look at the code, I build a 32-bit unsigned integer from
the bytes I read. And the 32-bit pattern I am looking for can start on
_any_ byte boundary in the file. It would be nice if I could somehow
just scan for that pattern explicitly, without having to build a 32-bit
integer first. If I could tell python "scan this file for the bytes 0,
0, 1, 0 in succession. How many 0, 0, 1, 0 did you find?"

/David

Paul Rubin · Oct 28, 2005

I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100. I've tried with this:

use re.search or string.find. The simplest way is just read the whole
file into memory first. If the file is too big, you have to read it in
chunks and include some hair to notice if the four byte pattern straddles
two adjoining chunks.

pinkfloydhomer · Oct 28, 2005

I'm now down to:

f = open("filename", "rb")
s = f.read()
sub = "\x00\x00\x01\x00"
count = s.count(sub)
print count

Which is quite fast. The only problems is that the file might be huge.
I really have no need for reading the entire file into a string as I am
doing here. All I want is to count occurences this substring. Can I
somehow count occurences in a file without reading it into a string
first?

/David

Guest · Oct 28, 2005

[email protected] said:
f = open("filename", "rb")
s = f.read()
sub = "\x00\x00\x01\x00"
count = s.count(sub)
print count

That's a lot of lines. This is a bit off topic, but I just can't stand
unnecessary local variables.

print file("filename", "rb").read().count("\x00\x00\x01\x00")

Jorge Godoy · Oct 28, 2005

[email protected] said:
Which is quite fast. The only problems is that the file might be huge.
I really have no need for reading the entire file into a string as I am
doing here. All I want is to count occurences this substring. Can I
somehow count occurences in a file without reading it into a string
first?

How about iterating through the file? You can read it line by line, two lines
at a time. Pseudocode follows:

line1 = read_line
while line2 = read_line:
line_to_check = ''.join([line1, line2])
check_for_desired_string
line1 = line2

With that you always have two lines in the buffer and you can check all of
them for your desired string, no matter what the size of the file is.

Be seeing you,

Bernhard Herzog · Oct 28, 2005

Jorge Godoy said:
How about iterating through the file? You can read it line by line, two lines
at a time. Pseudocode follows:

line1 = read_line
while line2 = read_line:
line_to_check = ''.join([line1, line2])
check_for_desired_string
line1 = line2

With that you always have two lines in the buffer and you can check all of
them for your desired string, no matter what the size of the file is.

This will fail if the string to search for is e.g. "\n\n\n\n" and it
actually occcurs in the file.

Bernhard

pinkfloydhomer · Oct 28, 2005

First of all, this isn't a text file, it is a binary file. Secondly,
substrings can overlap. In the sequence 0010010 the substring 0010
occurs twice.

/David

Andrew McCarthy · Oct 28, 2005

I'm now down to:

f = open("filename", "rb")
s = f.read()
sub = "\x00\x00\x01\x00"
count = s.count(sub)
print count

Which is quite fast. The only problems is that the file might be huge.
I really have no need for reading the entire file into a string as I am
doing here. All I want is to count occurences this substring. Can I
somehow count occurences in a file without reading it into a string
first?

Yes - use memory mapping (the mmap module). An mmap object is like a
cross between a file and a string, but the data is only read into RAM
when, and for as long as, necessary. An mmap object doesn't have a
count() method, but you can just use find() in a while loop instead.

Andrew

Jeremy Sanders · Oct 28, 2005

Gerhard said:
I want to scan a file byte for byte [...]
while True:
ch = inputFile.read(1)
[...] But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

Click to expand...

Read in blocks, not byte for byte. I had good experiences with block
sizes like 4096 or 8192.

It's difficult to handle overlaps. The four byte sequence may occur at the
end of one block and beginning of the next. You'd need to check for these
special cases.

Jeremy

Paul Watson · Oct 28, 2005

I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100. I've tried with this:

# start
import sys

numChars = 0
startCode = 0
count = 0

inputFile = sys.stdin

while True:
ch = inputFile.read(1)
numChars += 1

if len(ch) < 1: break

startCode = ((startCode << 8) & 0xffffffff) | (ord(ch))
if numChars < 4: continue

if startCode == 0x00000100:
count = count + 1

print count
# end

But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

/David

How about something like:

#!/usr/bin/env python

import sys

fn = 't.dat'
ss = '\x00\x00\x01\x00'

be = len(ss) - 1 # length of overlap to check
blocksize = 4 * 1024 # need to ensure that blocksize > overlap

fp = open(fn, 'rb')
b = fp.read(blocksize)
found = 0
while len(b) > be:
if b.find(ss) != -1:
found = 1
break
b = b[-be:] + fp.read(blocksize)
fp.close()
sys.exit(found)

James Stroud · Oct 28, 2005

That's a lot of lines. This is a bit off topic, but I just can't stand
unnecessary local variables.

print file("filename", "rb").read().count("\x00\x00\x01\x00")

The "f" is not terribly unnecessary, because the part of the code you didn't
see was

f.close()

Which would be considered good practice.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Mike Meyer · Oct 28, 2005

Andrew McCarthy said:
Yes - use memory mapping (the mmap module). An mmap object is like a
cross between a file and a string, but the data is only read into RAM
when, and for as long as, necessary. An mmap object doesn't have a
count() method, but you can just use find() in a while loop instead.

Except if you can't read the file into memory because it's to large,
there's a pretty good chance you won't be able to mmap it either. To
deal with huge files, the only option is to read the file in in
chunks, count the occurences in each chunk, and then do some fiddling
to deal with the pattern landing on a boundary.

<mike

Kent Johnson · Oct 28, 2005

I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100.

data = sys.stdin.read()
print data.count('\x00\x00\x01\x00')

Kent

Paul Watson · Oct 29, 2005

I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100. I've tried with this:

# start
import sys

numChars = 0
startCode = 0
count = 0

inputFile = sys.stdin

while True:
ch = inputFile.read(1)
numChars += 1

if len(ch) < 1: break

startCode = ((startCode << 8) & 0xffffffff) | (ord(ch))
if numChars < 4: continue

if startCode == 0x00000100:
count = count + 1

print count
# end

But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

/David

Here is an attempt at counting and using the mmap facility. There appear to
be some serious backward compatibility issues. I tried Python 2.1 on
Windows and AIX and had some odd results. If you are 2.4.1 or higher that
should not be a problem.

#!/usr/bin/env python
import sys
import os
import mmap

fn = 't.dat'
ss = '\x00\x00\x01\x00'

fp = open(fn, 'rb')
b = mmap.mmap(fp.fileno(), os.stat(fp.name).st_size,
access=mmap.ACCESS_READ)

count = 0
foundpoint = b.find(ss, 0)
while foundpoint != -1 and (foundpoint + 1) < b.size():
count = count + 1
foundpoint = b.find(ss, foundpoint + 1)
b.close()

print count

fp.close()
sys.exit(0)

Steven D'Aprano · Oct 29, 2005

Which is quite fast. The only problems is that the file might be huge.

What *you* call huge and what *Python* calls huge may be very different
indeed. What are you calling huge?

I really have no need for reading the entire file into a string as I am
doing here. All I want is to count occurences this substring. Can I
somehow count occurences in a file without reading it into a string
first?

Magic?

You have to read the file into memory at some stage, otherwise how can you
see what value the bytes are? The only question is, can you read it all
into one big string (in which case, your solution is unlikely to be
beaten), or do you have to read the file in chunks and deal with the
boundary cases (which is harder)?

Here is another thought. What are you going to do with the count when you
are done? That sounds to me like a pretty pointless result: "Hi user, the
file XYZ has 27 occurrences of bitpattern \x00\x00\x01\x00. Would you like
to do another file?"

If you are planning to use this count to do something, perhaps there is a
more efficient way to combine the two steps into one -- especially
valuable if your files really are huge.

Steven D'Aprano · Oct 29, 2005

That's a lot of lines. This is a bit off topic, but I just can't stand
unnecessary local variables.

print file("filename", "rb").read().count("\x00\x00\x01\x00")

Funny you should say that, because I can't stand unnecessary one-liners.

In any case, you are assuming that Python will automagically close the
file when you are done. That's good enough for a script, but not best
practice.

f = open("filename", "rb")
print f.read().count("\x00\x00\x01\x00")
f.close()

is safer, has no unnecessary local variables, and is not unnecessarily
terse. Unfortunately, it doesn't solve the original poster's problem,
because his file is too big to read into memory all at once -- or so he
tells us.

Alex Martelli · Oct 29, 2005

Mike Meyer said:
Except if you can't read the file into memory because it's to large,
there's a pretty good chance you won't be able to mmap it either. To
deal with huge files, the only option is to read the file in in
chunks, count the occurences in each chunk, and then do some fiddling
to deal with the pattern landing on a boundary.

That's the kind of things generators are for...:

def byblocks(f, blocksize, overlap):
block = f.read(blocksize)
yield block
while block:
block = block[-overlap:] + f.read(blocksize-overlap)
if block: yield block

Now, to look for a substring of length N in an open binary file f:

f = open(whatever, 'b')
count = 0
for block in byblocks(f, 1024*1024, len(subst)-1):
count += block.count(subst)
f.close()

not much "fiddling" needed, as you can see, and what little "fiddling"
is needed is entirely encompassed by the generator...

Alex

Paul Watson · Oct 29, 2005

Here is a better one that counts, and not just detects, the substring. This
is -much- faster than using mmap; especially for a large file that may cause
paging to start. Using mmap can be -very- slow.

#!/usr/bin/env python
import sys

fn = 't2.dat'
ss = '\x00\x00\x01\x00'

be = len(ss) - 1 # length of overlap to check
blocksize = 64 * 1024 # need to ensure that blocksize > overlap

fp = open(fn, 'rb')
b = fp.read(blocksize)
count = 0
while len(b) > be:
count += b.count(ss)
b = b[-be:] + fp.read(blocksize)
fp.close()

print count
sys.exit(0)

How to try a range of hex values in C# code ?	0	Nov 19, 2022
Empty string when reading a file	1	Apr 5, 2009
ChatBot	4	Jan 19, 2021
Scanning file version 2.0	5	Jan 17, 2006
binary file to CString	2	Nov 19, 2008
Quesion about running a exe file in Python(Not enough memory)	5	Apr 25, 2013
A simple-to-use sound file writer	46	Jan 14, 2010
Flushing buffer on file copy on linux	0	Aug 15, 2012

Scanning a file

pinkfloydhomer

=?ISO-8859-1?Q?Gerhard_H=E4ring?=

pinkfloydhomer

Paul Rubin

pinkfloydhomer

Guest

Jorge Godoy

Bernhard Herzog

pinkfloydhomer

Andrew McCarthy

Jeremy Sanders

Paul Watson

James Stroud

Mike Meyer

Kent Johnson

Paul Watson

Steven D'Aprano

Steven D'Aprano

Alex Martelli

Paul Watson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads