Scanning a file

Mike Meyer · Oct 29, 2005

Paul Watson said:
Here is a better one that counts, and not just detects, the substring. This
is -much- faster than using mmap; especially for a large file that may cause
paging to start. Using mmap can be -very- slow.

#!/usr/bin/env python
import sys

fn = 't2.dat'
ss = '\x00\x00\x01\x00'

be = len(ss) - 1 # length of overlap to check
blocksize = 64 * 1024 # need to ensure that blocksize > overlap

fp = open(fn, 'rb')
b = fp.read(blocksize)
count = 0
while len(b) > be:
count += b.count(ss)
b = b[-be:] + fp.read(blocksize)
fp.close()

print count
sys.exit(0)

Did you do timings on it vs. mmap? Having to copy the data multiple
times to deal with the overlap - thanks to strings being immutable -
would seem to be a lose, and makes me wonder how it could be faster
than mmap in general.

Thanks,
<mike

Fredrik Lundh · Oct 29, 2005

Mike said:
Did you do timings on it vs. mmap? Having to copy the data multiple
times to deal with the overlap - thanks to strings being immutable -
would seem to be a lose, and makes me wonder how it could be faster
than mmap in general.

if you use "mmap" to read large files sequentially, without calling "madvise",
the system may try to keep old pages around just in case, and won't do as
much read-ahead as it can do. if you're low on memory, that means that
the system may waste some time swapping out data for other applications,
rather than throwing away data that you know that you will never look at
again.

if you have reasonably large files and you're not running on an overcrowded
machine, this is usually not a problem.

(does Python's mmap module support madvise, btw? doesn't look like it does...)

</F>

Bengt Richter · Oct 29, 2005

First of all, this isn't a text file, it is a binary file. Secondly,
substrings can overlap. In the sequence 0010010 the substring 0010
occurs twice.

ISTM you better let others know exactly what you mean by this, before
you use the various solutions suggested or your own ;-)

a) Are you talking about bit strings or byte strings?
b) Do you want to _count_ overlapping substrings??!!
Note result of s.count on your example:
1

vs. brute force counting overlapped substrings (not tested beyond what you see ;-)
... start = count = 0
... while True:
... start = s.find(sub, start) + 1
... if start==0: break
... count += 1
... return count
... 2

Regards,
Bengt Richter

Bengt Richter · Oct 29, 2005

Mike Meyer said:
Mike Meyer said:

Except if you can't read the file into memory because it's to large,
there's a pretty good chance you won't be able to mmap it either. To
deal with huge files, the only option is to read the file in in
chunks, count the occurences in each chunk, and then do some fiddling
to deal with the pattern landing on a boundary.

Click to expand...

That's the kind of things generators are for...:

def byblocks(f, blocksize, overlap):
block = f.read(blocksize)
yield block
while block:
block = block[-overlap:] + f.read(blocksize-overlap)
if block: yield block

Now, to look for a substring of length N in an open binary file f:

f = open(whatever, 'b')
count = 0
for block in byblocks(f, 1024*1024, len(subst)-1):
count += block.count(subst)
f.close()

not much "fiddling" needed, as you can see, and what little "fiddling"
is needed is entirely encompassed by the generator...

Do I get a job at google if I find something wrong with the above? ;-)

Regards,
Bengt Richter

Peter Otten · Oct 29, 2005

Bengt said:
Mike Meyer said:

Except if you can't read the file into memory because it's to large,
there's a pretty good chance you won't be able to mmap it either. To
deal with huge files, the only option is to read the file in in
chunks, count the occurences in each chunk, and then do some fiddling
to deal with the pattern landing on a boundary.

Click to expand...

That's the kind of things generators are for...:

def byblocks(f, blocksize, overlap):
block = f.read(blocksize)
yield block
while block:
block = block[-overlap:] + f.read(blocksize-overlap)
if block: yield block

Now, to look for a substring of length N in an open binary file f:

f = open(whatever, 'b')
count = 0
for block in byblocks(f, 1024*1024, len(subst)-1):
count += block.count(subst)
f.close()

not much "fiddling" needed, as you can see, and what little "fiddling"
is needed is entirely encompassed by the generator...

Click to expand...

Do I get a job at google if I find something wrong with the above? ;-)

Try it with a subst of length 1. Seems like you missed an opportunity

Peter

Bengt Richter · Oct 29, 2005

Bengt said:
Bengt said:

...
Except if you can't read the file into memory because it's to large,
there's a pretty good chance you won't be able to mmap it either. To
deal with huge files, the only option is to read the file in in
chunks, count the occurences in each chunk, and then do some fiddling
to deal with the pattern landing on a boundary.

That's the kind of things generators are for...:

def byblocks(f, blocksize, overlap):
block = f.read(blocksize)
yield block
while block:
block = block[-overlap:] + f.read(blocksize-overlap)
if block: yield block

Now, to look for a substring of length N in an open binary file f:

f = open(whatever, 'b')
count = 0
for block in byblocks(f, 1024*1024, len(subst)-1):
count += block.count(subst)
f.close()

not much "fiddling" needed, as you can see, and what little "fiddling"
is needed is entirely encompassed by the generator...

Click to expand...

Do I get a job at google if I find something wrong with the above? ;-)

Click to expand...

Try it with a subst of length 1. Seems like you missed an opportunity

I was thinking this was an example a la Alex's previous discussion
of interviewee code challenges ;-)

What struck me was

>>> gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1)
>>> [gen.next() for i in xrange(10)]

Click to expand...

Click to expand...

['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']

Regards,
Bengt Richter

Alex Martelli · Oct 29, 2005

Bengt Richter said:
while block:
block = block[-overlap:] + f.read(blocksize-overlap)
if block: yield block

Click to expand...

Click to expand...

...
I was thinking this was an example a la Alex's previous discussion
of interviewee code challenges ;-)

What struck me was

gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1)
[gen.next() for i in xrange(10)]

Click to expand...

Click to expand...

['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']

Heh, OK, I should get back into the habit of adding a "warning: untested
code" when I post code (particularly when it's late and I'm
jetlagged;-). The code I posted will never exit, since block always
keeps the last overlap bytes; it needs to be changed into something like
(warning -- untested code!-)

if overlap>0:
while True:
next = f.read(blocksize-overlap)
if not next: break
block = block[-overlap:] + next
yield block
else:
while True:
next = f.read(blocksize)
if not next: break
yield next

(the if/else is needed to handle requests for overlaps <= 0, if desired;
I think it's clearer to split the cases rather than to test inside the
loop's body).

Alex

Peter Otten · Oct 29, 2005

Bengt said:
What struck me was

gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1)
[gen.next() for i in xrange(10)]

Click to expand...

Click to expand...

['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']

Ouch. Seems like I spotted the subtle cornercase error and missed the big
one.

Peter

Paul Watson · Oct 29, 2005

Mike Meyer said:
Did you do timings on it vs. mmap? Having to copy the data multiple
times to deal with the overlap - thanks to strings being immutable -
would seem to be a lose, and makes me wonder how it could be faster
than mmap in general.

The only thing copied is a string one byte less than the search string for
each block.

I did not do due dilligence with respect to timings. Here is a small
dataset read sequentially and using mmap.

$ ls -lgG t.dat
-rw-r--r-- 1 16777216 Oct 28 16:32 t.dat
$ time ./scanfile.py
1048576
0.80s real 0.64s user 0.15s system
$ time ./scanfilemmap.py
1048576
20.33s real 6.09s user 14.24s system

With a larger file, the system time skyrockets. I assume that to be the
paging mechanism in the OS. This is Cyngwin on Windows XP.

$ ls -lgG t2.dat
-rw-r--r-- 1 268435456 Oct 28 16:33 t2.dat
$ time ./scanfile.py
16777216
28.85s real 16.37s user 0.93s system
$ time ./scanfilemmap.py
16777216
323.45s real 94.45s user 227.74s system

netvaibhav · Oct 29, 2005

I think implementing a finite state automaton would be a good (best?)
solution. I have drawn a FSM for you (try viewing the following in
fixed width font). Just increment the count when you reach state 5.

<---------------|
| |
0 0 | 1 0 |0
-->[1]--->[2]--->[3]--->[4]--->[5]-|
^ | | ^ | | |
1| |<---| | | |1 |1
|_| 1 |_| | |
^ 0 | |
|---------------------|<-----|

If you don't understand FSM's, try getting a book on computational
theory (the book by Hopcroft & Ullman is great.)

Here you don't have special cases whether reading in blocks or reading
whole at once (as you only need one byte at a time).

Vaibhav

Steve Holden · Oct 29, 2005

Peter said:
Bengt Richter wrote:

What struck me was

gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1)
[gen.next() for i in xrange(10)]

Click to expand...

['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']

Click to expand...

Ouch. Seems like I spotted the subtle cornercase error and missed the big
one.

No, you just realised subconsciously that we'd all spot the obvious one
and decided to point out the bug that would remain after the obvious one
had been fixed.

regards
Steve

Tim Roberts · Oct 29, 2005

Steven D'Aprano said:
Funny you should say that, because I can't stand unnecessary one-liners.

In any case, you are assuming that Python will automagically close the
file when you are done.

Nonsense. This behavior is deterministic. At the end of that line, the
anonymous file object out of scope, the object is deleted, and the file is
closed.

Scott David Daniels · Oct 29, 2005

Paul said:
Here is a better one that counts, and not just detects, the substring. This
is -much- faster than using mmap; especially for a large file that may cause
paging to start. Using mmap can be -very- slow.

> <ss = pattern, be = len(ss) - 1>
...
b = fp.read(blocksize)
count = 0
while len(b) > be:
count += b.count(ss)
b = b[-be:] + fp.read(blocksize)
...

In cases where that one wins and blocksize is big,
this should do even better:
...
block = fp.read(blocksize)
count = 0
while len(block) > be:
count += block.count(ss)
lead = block[-be :]
block = fp.read(blocksize)
count += (lead + block[: be]).count(ss)
...

Alex Martelli · Oct 29, 2005

Tim Roberts said:
Nonsense. This behavior is deterministic. At the end of that line, the
anonymous file object out of scope, the object is deleted, and the file is
closed.

In today's implementations of Classic Python, yes. In other equally
valid implementations of the language, such as Jython, IronPython, or,
for all we know, some future implementation of Classic, that may well
not be the case. Many, quite reasonably, dislike relying on a specific
implementation's peculiarities, and prefer to write code that relies
only on what the _language_ specs guarantee.

Alex

Steve Holden · Oct 29, 2005

I think implementing a finite state automaton would be a good (best?)
solution. I have drawn a FSM for you (try viewing the following in
fixed width font). Just increment the count when you reach state 5.

<---------------|
| |
0 0 | 1 0 |0
-->[1]--->[2]--->[3]--->[4]--->[5]-|
^ | | ^ | | |
1| |<---| | | |1 |1
|_| 1 |_| | |
^ 0 | |
|---------------------|<-----|

If you don't understand FSM's, try getting a book on computational
theory (the book by Hopcroft & Ullman is great.)

Here you don't have special cases whether reading in blocks or reading
whole at once (as you only need one byte at a time).

Indeed, but reading one byte at a time is about the slowest way to
process a file, in Python or any other language, because it fails to
amortize the overhead cost of function calls over many characters.

Buffering wasn't invented because early programmers had nothing better
to occupy their minds, remember

regards
Steve

Paul Watson · Oct 29, 2005

In today's implementations of Classic Python, yes. In other equally
valid implementations of the language, such as Jython, IronPython, or,
for all we know, some future implementation of Classic, that may well
not be the case. Many, quite reasonably, dislike relying on a specific
implementation's peculiarities, and prefer to write code that relies
only on what the _language_ specs guarantee.

How could I identify when Python code does not close files and depends on
the runtime to take care of this? I want to know that the code will work
well under other Python implementations and future implementations which may
not have this provided.

Paul Rubin · Oct 29, 2005

Paul Watson said:
How could I identify when Python code does not close files and depends on
the runtime to take care of this? I want to know that the code will work
well under other Python implementations and future implementations which may
not have this provided.

There is nothing in the Python language reference that guarantees the
files will be closed when the references go out of scope. That
CPython does it is simply an implementation artifact. If you want to
make sure they get closed, you have to close them explicitly. There
are some Python language extensions in the works to make this more
convenient (PEP 343) but for now you have to do it by hand.

David Rasmussen · Oct 29, 2005

I think implementing a finite state automaton would be a good (best?)
solution. I have drawn a FSM for you (try viewing the following in
fixed width font). Just increment the count when you reach state 5.

<---------------|
| |
0 0 | 1 0 |0
-->[1]--->[2]--->[3]--->[4]--->[5]-|
^ | | ^ | | |
1| |<---| | | |1 |1
|_| 1 |_| | |
^ 0 | |
|---------------------|<-----|

If you don't understand FSM's, try getting a book on computational
theory (the book by Hopcroft & Ullman is great.)

I already have that book. The above solution very slow in practice. None
of the solutions presented in this thread is nearly as fast as the

print file("filename", "rb").read().count("\x00\x00\x01\x00")

/David

Mike Meyer · Oct 29, 2005

Paul Watson said:
The only thing copied is a string one byte less than the search string for
each block.

Um - you removed the code, but I could have *sworn* that it did
something like:

buf = buf[testlen:] + f.read(bufsize - testlen)

which should cause the the creation of three strings: the last few
bytes of the old buffer, a new bufferfull from the read, then the sum
of those two - created by copying the first two into a new string. So
you wind up copying all the data.

Which, as you showed, doesn't take nearly as much time as using mmap.

Thanks,
<mike

Steven D'Aprano · Oct 29, 2005

Nonsense. This behavior is deterministic. At the end of that line, the
anonymous file object out of scope, the object is deleted, and the file is
closed.

That is an implementation detail. CPython may do that, but JPython does
not -- or at least it did not last time I looked. JPython doesn't
guarantee that the file will be closed at any particular time, just that
it will be closed eventually.

If all goes well. What if you have a circular dependence and the file
reference never gets garbage-collected? What happens if the JPython
process gets killed before the file is closed? You might not care about
one file not being released, but what if it is hundreds of files?

In general, it is best practice to release external resources as soon as
you're done with them, and not rely on a garbage collector which may or
may not release them in a timely manner.

There are circumstances where things do not go well and the file never
gets closed cleanly -- for example, when your disk is full, and the
buffer is only written to disk when you close the file. Would you
prefer that error to raise an exception, or to pass silently? If you want
close exceptions to pass silently, then by all means rely on the garbage
collector to close the file.

You might not care about these details in a short script -- when I'm
writing a use-once-and-throw-away script, that's what I do. But it isn't
best practice: explicit is better than implicit.

I should also point out that for really serious work, the idiom:

f = file("parrot")
handle(f)
f.close()

is insufficiently robust for production level code. That was a detail I
didn't think I needed to drop on the original newbie poster, but depending
on how paranoid you are, or how many exceptions you want to insulate the
user from, something like this might be needed:

try:
f = file("parrot")
try:
handle(f)
finally:
try:
f.close()
except:
print "The file could not be closed; see your sys admin."
except:
print "The file could not be opened."

How to try a range of hex values in C# code ?	0	Nov 19, 2022
Empty string when reading a file	1	Apr 5, 2009
ChatBot	4	Jan 19, 2021
Scanning file version 2.0	5	Jan 17, 2006
binary file to CString	2	Nov 19, 2008
Quesion about running a exe file in Python(Not enough memory)	5	Apr 25, 2013
A simple-to-use sound file writer	46	Jan 14, 2010
Flushing buffer on file copy on linux	0	Aug 15, 2012

Scanning a file

Mike Meyer

Fredrik Lundh

Bengt Richter

Bengt Richter

Peter Otten

Bengt Richter

Alex Martelli

Peter Otten

Paul Watson

netvaibhav

Steve Holden

Tim Roberts

Scott David Daniels

Alex Martelli

Steve Holden

Paul Watson

Paul Rubin

David Rasmussen

Mike Meyer

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads