Scanning a file

P

pinkfloydhomer

I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100. I've tried with this:

# start
import sys

numChars = 0
startCode = 0
count = 0

inputFile = sys.stdin

while True:
ch = inputFile.read(1)
numChars += 1

if len(ch) < 1: break

startCode = ((startCode << 8) & 0xffffffff) | (ord(ch))
if numChars < 4: continue

if startCode == 0x00000100:
count = count + 1

print count
# end

But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

/David
 
?

=?ISO-8859-1?Q?Gerhard_H=E4ring?=

I want to scan a file byte for byte [...]
while True:
ch = inputFile.read(1)
[...] But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

Read in blocks, not byte for byte. I had good experiences with block
sizes like 4096 or 8192.

-- Gerhard
 
P

pinkfloydhomer

Okay, how do I do this?

Also, if you look at the code, I build a 32-bit unsigned integer from
the bytes I read. And the 32-bit pattern I am looking for can start on
_any_ byte boundary in the file. It would be nice if I could somehow
just scan for that pattern explicitly, without having to build a 32-bit
integer first. If I could tell python "scan this file for the bytes 0,
0, 1, 0 in succession. How many 0, 0, 1, 0 did you find?"

/David
 
P

Paul Rubin

I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100. I've tried with this:

use re.search or string.find. The simplest way is just read the whole
file into memory first. If the file is too big, you have to read it in
chunks and include some hair to notice if the four byte pattern straddles
two adjoining chunks.
 
P

pinkfloydhomer

I'm now down to:

f = open("filename", "rb")
s = f.read()
sub = "\x00\x00\x01\x00"
count = s.count(sub)
print count

Which is quite fast. The only problems is that the file might be huge.
I really have no need for reading the entire file into a string as I am
doing here. All I want is to count occurences this substring. Can I
somehow count occurences in a file without reading it into a string
first?

/David
 
G

Guest

f = open("filename", "rb")
s = f.read()
sub = "\x00\x00\x01\x00"
count = s.count(sub)
print count

That's a lot of lines. This is a bit off topic, but I just can't stand
unnecessary local variables.

print file("filename", "rb").read().count("\x00\x00\x01\x00")
 
J

Jorge Godoy

Which is quite fast. The only problems is that the file might be huge.
I really have no need for reading the entire file into a string as I am
doing here. All I want is to count occurences this substring. Can I
somehow count occurences in a file without reading it into a string
first?

How about iterating through the file? You can read it line by line, two lines
at a time. Pseudocode follows:

line1 = read_line
while line2 = read_line:
line_to_check = ''.join([line1, line2])
check_for_desired_string
line1 = line2

With that you always have two lines in the buffer and you can check all of
them for your desired string, no matter what the size of the file is.


Be seeing you,
 
B

Bernhard Herzog

Jorge Godoy said:
How about iterating through the file? You can read it line by line, two lines
at a time. Pseudocode follows:

line1 = read_line
while line2 = read_line:
line_to_check = ''.join([line1, line2])
check_for_desired_string
line1 = line2

With that you always have two lines in the buffer and you can check all of
them for your desired string, no matter what the size of the file is.

This will fail if the string to search for is e.g. "\n\n\n\n" and it
actually occcurs in the file.

Bernhard
 
P

pinkfloydhomer

First of all, this isn't a text file, it is a binary file. Secondly,
substrings can overlap. In the sequence 0010010 the substring 0010
occurs twice.

/David
 
A

Andrew McCarthy

I'm now down to:

f = open("filename", "rb")
s = f.read()
sub = "\x00\x00\x01\x00"
count = s.count(sub)
print count

Which is quite fast. The only problems is that the file might be huge.
I really have no need for reading the entire file into a string as I am
doing here. All I want is to count occurences this substring. Can I
somehow count occurences in a file without reading it into a string
first?

Yes - use memory mapping (the mmap module). An mmap object is like a
cross between a file and a string, but the data is only read into RAM
when, and for as long as, necessary. An mmap object doesn't have a
count() method, but you can just use find() in a while loop instead.

Andrew
 
J

Jeremy Sanders

Gerhard said:
I want to scan a file byte for byte [...]
while True:
ch = inputFile.read(1)
[...] But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

Read in blocks, not byte for byte. I had good experiences with block
sizes like 4096 or 8192.

It's difficult to handle overlaps. The four byte sequence may occur at the
end of one block and beginning of the next. You'd need to check for these
special cases.

Jeremy
 
P

Paul Watson

I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100. I've tried with this:

# start
import sys

numChars = 0
startCode = 0
count = 0

inputFile = sys.stdin

while True:
ch = inputFile.read(1)
numChars += 1

if len(ch) < 1: break

startCode = ((startCode << 8) & 0xffffffff) | (ord(ch))
if numChars < 4: continue

if startCode == 0x00000100:
count = count + 1

print count
# end

But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

/David

How about something like:

#!/usr/bin/env python

import sys

fn = 't.dat'
ss = '\x00\x00\x01\x00'

be = len(ss) - 1 # length of overlap to check
blocksize = 4 * 1024 # need to ensure that blocksize > overlap

fp = open(fn, 'rb')
b = fp.read(blocksize)
found = 0
while len(b) > be:
if b.find(ss) != -1:
found = 1
break
b = b[-be:] + fp.read(blocksize)
fp.close()
sys.exit(found)
 
J

James Stroud

That's a lot of lines. This is a bit off topic, but I just can't stand
unnecessary local variables.

print file("filename", "rb").read().count("\x00\x00\x01\x00")


The "f" is not terribly unnecessary, because the part of the code you didn't
see was

f.close()

Which would be considered good practice.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
M

Mike Meyer

Andrew McCarthy said:
Yes - use memory mapping (the mmap module). An mmap object is like a
cross between a file and a string, but the data is only read into RAM
when, and for as long as, necessary. An mmap object doesn't have a
count() method, but you can just use find() in a while loop instead.

Except if you can't read the file into memory because it's to large,
there's a pretty good chance you won't be able to mmap it either. To
deal with huge files, the only option is to read the file in in
chunks, count the occurences in each chunk, and then do some fiddling
to deal with the pattern landing on a boundary.

<mike
 
K

Kent Johnson

I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100.

data = sys.stdin.read()
print data.count('\x00\x00\x01\x00')

Kent
 
P

Paul Watson

I want to scan a file byte for byte for occurences of the the four byte
pattern 0x00000100. I've tried with this:

# start
import sys

numChars = 0
startCode = 0
count = 0

inputFile = sys.stdin

while True:
ch = inputFile.read(1)
numChars += 1

if len(ch) < 1: break

startCode = ((startCode << 8) & 0xffffffff) | (ord(ch))
if numChars < 4: continue

if startCode == 0x00000100:
count = count + 1

print count
# end

But it is very slow. What is the fastest way to do this? Using some
native call? Using a buffer? Using whatever?

/David

Here is an attempt at counting and using the mmap facility. There appear to
be some serious backward compatibility issues. I tried Python 2.1 on
Windows and AIX and had some odd results. If you are 2.4.1 or higher that
should not be a problem.

#!/usr/bin/env python
import sys
import os
import mmap

fn = 't.dat'
ss = '\x00\x00\x01\x00'

fp = open(fn, 'rb')
b = mmap.mmap(fp.fileno(), os.stat(fp.name).st_size,
access=mmap.ACCESS_READ)

count = 0
foundpoint = b.find(ss, 0)
while foundpoint != -1 and (foundpoint + 1) < b.size():
count = count + 1
foundpoint = b.find(ss, foundpoint + 1)
b.close()

print count

fp.close()
sys.exit(0)
 
S

Steven D'Aprano

Which is quite fast. The only problems is that the file might be huge.

What *you* call huge and what *Python* calls huge may be very different
indeed. What are you calling huge?
I really have no need for reading the entire file into a string as I am
doing here. All I want is to count occurences this substring. Can I
somehow count occurences in a file without reading it into a string
first?

Magic?

You have to read the file into memory at some stage, otherwise how can you
see what value the bytes are? The only question is, can you read it all
into one big string (in which case, your solution is unlikely to be
beaten), or do you have to read the file in chunks and deal with the
boundary cases (which is harder)?

Here is another thought. What are you going to do with the count when you
are done? That sounds to me like a pretty pointless result: "Hi user, the
file XYZ has 27 occurrences of bitpattern \x00\x00\x01\x00. Would you like
to do another file?"

If you are planning to use this count to do something, perhaps there is a
more efficient way to combine the two steps into one -- especially
valuable if your files really are huge.
 
S

Steven D'Aprano

That's a lot of lines. This is a bit off topic, but I just can't stand
unnecessary local variables.

print file("filename", "rb").read().count("\x00\x00\x01\x00")

Funny you should say that, because I can't stand unnecessary one-liners.

In any case, you are assuming that Python will automagically close the
file when you are done. That's good enough for a script, but not best
practice.

f = open("filename", "rb")
print f.read().count("\x00\x00\x01\x00")
f.close()

is safer, has no unnecessary local variables, and is not unnecessarily
terse. Unfortunately, it doesn't solve the original poster's problem,
because his file is too big to read into memory all at once -- or so he
tells us.
 
A

Alex Martelli

Mike Meyer said:
Except if you can't read the file into memory because it's to large,
there's a pretty good chance you won't be able to mmap it either. To
deal with huge files, the only option is to read the file in in
chunks, count the occurences in each chunk, and then do some fiddling
to deal with the pattern landing on a boundary.

That's the kind of things generators are for...:

def byblocks(f, blocksize, overlap):
block = f.read(blocksize)
yield block
while block:
block = block[-overlap:] + f.read(blocksize-overlap)
if block: yield block

Now, to look for a substring of length N in an open binary file f:

f = open(whatever, 'b')
count = 0
for block in byblocks(f, 1024*1024, len(subst)-1):
count += block.count(subst)
f.close()

not much "fiddling" needed, as you can see, and what little "fiddling"
is needed is entirely encompassed by the generator...


Alex
 
P

Paul Watson

Here is a better one that counts, and not just detects, the substring. This
is -much- faster than using mmap; especially for a large file that may cause
paging to start. Using mmap can be -very- slow.

#!/usr/bin/env python
import sys

fn = 't2.dat'
ss = '\x00\x00\x01\x00'

be = len(ss) - 1 # length of overlap to check
blocksize = 64 * 1024 # need to ensure that blocksize > overlap

fp = open(fn, 'rb')
b = fp.read(blocksize)
count = 0
while len(b) > be:
count += b.count(ss)
b = b[-be:] + fp.read(blocksize)
fp.close()

print count
sys.exit(0)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,044
Latest member
RonaldNen

Latest Threads

Top