Scanning a file

D

David Rasmussen

Steven said:
What *you* call huge and what *Python* calls huge may be very different
indeed. What are you calling huge?

I'm not saying that it is too big for Python. I am saying that it is too
big for the systems it is going to run on. These files can be 22 MB or 5
GB or ..., depending on the situation. It might not be okay to run a
tool that claims that much memory, even if it is available.

That would be nice :)

But you misunderstand me...
You have to read the file into memory at some stage, otherwise how can you
see what value the bytes are?

I haven't said that I would like to scan the file without reading it. I
am just saying that the .count() functionality implemented into strings
could just as well be applied to some abstraction such as a stream (I
come from C++). In C++, the count() functionality would be separated as
much as possible from any concrete datatype (such as a string),
precisely because it is a concept that is applicable at a more abstract
level. I should be able to say "count the substring occurences of this
stream" or "using this iterator" or something to that effect. If I could say

print file("filename", "rb").count("\x00\x00\x01\x00")

(or something like that)

instead of the original

print file("filename", "rb").read().count("\x00\x00\x01\x00")

it would be exactly what I am after. What is the conceptual difference?
The first solution should be at least as fast as the second. I have to
read and compare the characters anyway. I just don't need to store them
in a string. In essence, I should be able to use the "count occurences"
functionality on more things, such as a file, or even better, a file
read through a buffer with a size specified by me.
Here is another thought. What are you going to do with the count when you
are done? That sounds to me like a pretty pointless result: "Hi user, the
file XYZ has 27 occurrences of bitpattern \x00\x00\x01\x00. Would you like
to do another file?"

It might sound pointless to you, but it is not pointless for my purposes :)

If you must know, the above one-liner actually counts the number of
frames in an MPEG2 file. I want to know this number for a number of
files for various reasons. I don't want it to take forever.
If you are planning to use this count to do something, perhaps there is a
more efficient way to combine the two steps into one -- especially
valuable if your files really are huge.

Of course, but I don't need to do anything else in this case.

/David
 
A

Alex Martelli

Paul Watson said:
How could I identify when Python code does not close files and depends on
the runtime to take care of this? I want to know that the code will work
well under other Python implementations and future implementations which may
not have this provided.

Then you should use try/finally (to have your code run correctly in all
of today's implementations; Python 2.5 will have a 'with' statement to
offer nicer syntax sugar for that, but it will be a while before all the
implementations get around to adding it).

If you're trying to test your code to ensure it explicitly closes all
files, you could (from within your tests) rebind built-ins 'file' and
'open' to be a class wrapping the real thing, and adding a flag to
remember if the file is open; at __del__ time it would warn if the file
had not been explicitly closed. E.g. (untested code):

import __builtin__
import warnings

_f = __builtin__.file
class testing_file(_f):
def __init__(self, *a, **k):
_f.__init__(self, *a, **k)
self._opened = True
def close(self):
_f.close(self)
self._opened = False
def __del__(self):
if self._opened:
warnings.warn(...)
self.close()

__builtin__.file = __builtin__.open = testing_file



Alex
 
A

Alex Martelli

Steven D'Aprano said:
I should also point out that for really serious work, the idiom:

f = file("parrot")
handle(f)
f.close()

is insufficiently robust for production level code. That was a detail I
didn't think I needed to drop on the original newbie poster, but depending
on how paranoid you are, or how many exceptions you want to insulate the
user from, something like this might be needed:

try:
f = file("parrot")
try:
handle(f)
finally:
try:
f.close()
except:
print "The file could not be closed; see your sys admin."
except:
print "The file could not be opened."

The inner try/finally is fine, but both the try/except are total, utter,
unmitigated disasters: they will hide a lot of information about
problems, let the program continue in a totally erroneous state, give
mistaken messages if handle(f) causes any kind of error totally
unrelated to opening the file (or if the user hits control-C during a
lengthy run of handle(f)), emit messages that can erroneously end up in
the redirected stdout of your program... VERY, VERY bad things.

Don't ever catch and ``handle'' exceptions in such ways. In particular,
each time you're thinking of writing a bare 'except:' clause, think
again, and you'll most likely find a much better approach.


Alex
 
N

netvaibhav

Steve said:
Indeed, but reading one byte at a time is about the slowest way to
process a file, in Python or any other language, because it fails to
amortize the overhead cost of function calls over many characters.

Buffering wasn't invented because early programmers had nothing better
to occupy their minds, remember :)

Buffer, and then read one byte at a time from the buffer.

Vaibhav
 
F

Fredrik Lundh

Paul said:
This is Cyngwin on Windows XP.

using cygwin to analyze performance characteristics of portable API:s
is a really lousy idea.

here are corresponding figures from a real operating system:

using a 16 MB file:

$ time python2.4 scanmap.py
real 0m0.080s
user 0m0.070s
sys 0m0.010s

$ time python2.4 scanpaul.py
real 0m0.458s
user 0m0.450s
sys 0m0.010s

using a 256 MB file (50% of available memory):

$ time python2.4 scanmap.py
real 0m0.913s
user 0m0.810s
sys 0m0.100s

$ time python2.4 scanpaul.py
real 0m7.149s
user 0m6.950s
sys 0m0.200s

using a 1024 MB file (200% of available memory):

$ time python2.4 scanpaul.py
real 0m34.274s
user 0m28.030s
sys 0m1.350s

$ time python2.4 scanmap.py
real 0m20.221s
user 0m3.120s
sys 0m1.520s

(Intel(R) Pentium(R) 4 CPU 2.80GHz, 512 MB RAM, relatively slow ATA
disks, relatively recent Linux, best result of multiple mixed runs shown.
scanmap performance would probably improve if Python supported the
"madvise" API, but I don't have time to test that today...)

</F>
 
P

Peter Otten

David said:
None of the solutions presented in this thread is nearly as fast as the

print file("filename", "rb").read().count("\x00\x00\x01\x00")

Have you already timed Scott David Daniel's approach with a /large/
blocksize? It looks promising.

Peter
 
P

Peter Otten

David said:
None of the solutions presented in this thread is nearly as fast as the

print file("filename", "rb").read().count("\x00\x00\x01\x00")

Have you already timed Scott David Daniels' approach with a /large/
blocksize? It looks promising.

Peter
 
S

Steven D'Aprano

The inner try/finally is fine, but both the try/except are total, utter,
unmitigated disasters: they will hide a lot of information about
problems, let the program continue in a totally erroneous state, give
mistaken messages if handle(f) causes any kind of error totally
unrelated to opening the file (or if the user hits control-C during a
lengthy run of handle(f)), emit messages that can erroneously end up in
the redirected stdout of your program... VERY, VERY bad things.

Of course. That's why I said "something like this" and not "do this" :) I
don't expect people to "handle" exceptions with a print statement in
anything more serious than a throw-away script.

For serious, production-level code, more often than not you will end up
spending more time and effort handling errors than you spend on handling
the task your application is actually meant to do. But I suspect I'm not
telling Alex anything he doesn't already know.

Don't ever catch and ``handle'' exceptions in such ways. In particular,
each time you're thinking of writing a bare 'except:' clause, think
again, and you'll most likely find a much better approach.

What would you -- or anyone else -- recommend as a better approach?

Is there a canonical list somewhere that states every possible exception
from a file open or close?
 
A

Alex Martelli

Steven D'Aprano said:
What would you -- or anyone else -- recommend as a better approach?

That depends on your application, and what you're trying to accomplish
at this point.

Is there a canonical list somewhere that states every possible exception
from a file open or close?

No. But if you get a totally unexpected exception, something that shows
the world has gone crazy and most likely any further action you perform
would run the risk of damaging the user's persistent data since the
macchine appears to be careening wildly out of control... WHY would you
want to perform any further action? Crashing and burning (ideally
leaving as detailed a core-dump as feasible for later post-mortem)
appears to be preferable. (Detailed information for post-mortem
purposes is best dumped in a sys.excepthook handler, since wild
unexpected exceptions may occur anywhere and it's impractical to pepper
your application code with bare except clauses for such purposes).

Obviously, if your program is so life-crucial that it cannot be missing
for a long period of time, you will have separately set up a "hot spare"
system, ready to take over at the behest of a separate monitor program
as soon as your program develops problems of such magnitude (a
"heartbeat" system helps with monitoring). You do need redundant
hardware for that, since the root cause of unexpected problems may well
be in a hardware fault -- the disk has crashed, a memory chip just
melted, the CPU's on strike, locusts...! Not stuff any program can do
much about in the short term, except by switching to a different
machine.


Alex
 
J

John J. Lee

If you're trying to test your code to ensure it explicitly closes all
files, you could (from within your tests) rebind built-ins 'file' and
'open' to be a class wrapping the real thing, and adding a flag to
remember if the file is open; at __del__ time it would warn if the file
had not been explicitly closed. E.g. (untested code):
[...]

In general __del__ methods interfere with garbage collection, don't
they? I guess in the case of file objects this is unlikely to be
problematic (because unlikely to be any reference cycles), but I
thought it might be worth warning people that in general this
debugging strategy might get rather confusing since the __del__ could
actually change the operation of the program by preventing garbage
collection of the objects whose lifetime you're trying to
investigate...


John
 
S

Steven D'Aprano

That depends on your application, and what you're trying to accomplish
at this point.



No. But if you get a totally unexpected exception,

I'm more concerned about getting an expected exception -- or more
accurately, *missing* an expected exception. Matching on Exception is too
high. EOFError will probably need to be handled separately, since it often
isn't an error at all, just a change of state. IOError is the bad one.
What else can go wrong?

something that shows
the world has gone crazy and most likely any further action you perform
would run the risk of damaging the user's persistent data since the
macchine appears to be careening wildly out of control... WHY would you
want to perform any further action?

In circumstances where, as you put it, the hard disk has crashed, the CPU
is on strike, or the memory has melted down, not only can you not recover
gracefully, but you probably can't even fail gracefully -- at least not
without a special purpose fail-safe operating system.

I'm not concerned with mylist.append(None) unexpectedly -- and
impossibly? -- raising an ImportError. I can't predict every way things
can explode, and even if they do, I can't safely recover from them. But I
can fail gracefully from *expected* errors: rather than the application
just crashing, I can at least try to put up a somewhat informative dialog
box, or at least print something to the console. If opening a preferences
file fails, I can fall back on default settings. If writing the
preferences file fails, I can tell the user that their settings won't be
saved, and continue. Just because the disk is full or their disk quota is
exceeded, doesn't necessarily mean the app can't continue with the job on
hand.
 
A

Alex Martelli

John J. Lee said:
If you're trying to test your code to ensure it explicitly closes all
files, you could (from within your tests) rebind built-ins 'file' and
'open' to be a class wrapping the real thing, and adding a flag to
remember if the file is open; at __del__ time it would warn if the file
had not been explicitly closed. E.g. (untested code):
[...]

In general __del__ methods interfere with garbage collection, don't

Yes, cyclic gc only.
they? I guess in the case of file objects this is unlikely to be
problematic (because unlikely to be any reference cycles), but I
thought it might be worth warning people that in general this
debugging strategy might get rather confusing since the __del__ could
actually change the operation of the program by preventing garbage
collection of the objects whose lifetime you're trying to
investigate...

Yeah, but you'll find them all listed in gc.garbage for your perusal --
they DO get collected, but they get put in gc.garbage rather than FREED,
that's all. E.g.:
.... def __del__(self): print 'del', self.__class__.__name__
.... [<__main__.a object at 0x64cf0>, <__main__.b object at 0x58510>]

So, no big deal -- run a gc.collect() and parse through gc.garbage for
any instances of your "wrapper of file" class, and you'll find ones that
were forgotten as part of a cyclic garbage loop and you can check
whether they were explicitly closed or not.


Alex
 
A

Alex Martelli

Steven D'Aprano said:
I'm more concerned about getting an expected exception -- or more
accurately, *missing* an expected exception. Matching on Exception is too
high. EOFError will probably need to be handled separately, since it often
isn't an error at all, just a change of state. IOError is the bad one.
What else can go wrong?

Lots of things, but not ones you should WISH your application to
survive.
In circumstances where, as you put it, the hard disk has crashed, the CPU
is on strike, or the memory has melted down, not only can you not recover
gracefully, but you probably can't even fail gracefully -- at least not
without a special purpose fail-safe operating system.

Right: as I explained, all you can do is switch operation over to a "hot
spare" server (a different machine), through the good services of a
"monitor" machine - and hope the redundancy you've built into your
storage system (presumably a database with good solid mirroring running
on other machines yet, if your app is as critical as that) has survived.

I'm not concerned with mylist.append(None) unexpectedly -- and
impossibly? -- raising an ImportError. I can't predict every way things
can explode, and even if they do, I can't safely recover from them. But I

Exactly my point -- and yet if you use "except:", that's exactly what
you're TRYING to do, rather than let your app die.
can fail gracefully from *expected* errors: rather than the application
just crashing, I can at least try to put up a somewhat informative dialog
box, or at least print something to the console.

You can do that from your sys.excepthook routine, if it's OK for your
app to die afterwards. What we're discussing here are only, and
strictly, cases in which you wish your app to NOT die, but rather keep
processing (not just give nice error diagnostics and then terminate,
that's what sys.excepthook is for). And what I'm saying is that you
should keep processing only for errors you *DO* expect, and should not
even try if the error is NOT expected.
. If opening a preferences
file fails, I can fall back on default settings. If writing the
preferences file fails, I can tell the user that their settings won't be
saved, and continue. Just because the disk is full or their disk quota is
exceeded, doesn't necessarily mean the app can't continue with the job on
hand.

Sure, that's why you catch IOError, which covers these _expected_ cases
(or, if you want to be a bit wider, maybe OSError) AND check its errno
attribute to ensure you haven't mistakenly caught something you did NOT
in fact expect (and thus can't really handle), to use a bare raise
statement to re-raise the "accidentally caught" exception if needed.

But if something goes wrong that you had NOT anticipated, just log as
much info as you can for the postmortem, give nice diagnostics to the
user if you wish, and do NOT keep processing -- and for these
diagnostic-only purposes, use sys.excepthook, not a slew of try/except
all over your application making it crufty and unmaintainable.


Alex
 
B

Bengt Richter

Peter said:
Bengt Richter wrote:

What struck me was


gen = byblocks(StringIO.StringIO('no'),1024,len('end?')-1)
[gen.next() for i in xrange(10)]

['no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no', 'no']


Ouch. Seems like I spotted the subtle cornercase error and missed the big
one.

No, you just realised subconsciously that we'd all spot the obvious one
and decided to point out the bug that would remain after the obvious one
had been fixed.
I still smelled a bug in the counting of substring in the overlap region,
and you motivated me to find it (obvious in hindsight, but aren't most ;-)

A substring can get over-counted if the "overlap" region joins infelicitously with the next
input. E.g., try counting 'xx' in 10*'xx' with a read chunk of 4 instead of 1024*1024:

Assuming corrections so far posted as I understand them:
... block = f.read(blocksize)
... yield block
... if overlap>0:
... while True:
... next = f.read(blocksize-overlap)
... if not next: break
... block = block[-overlap:] + next
... yield block
... else:
... while True:
... next = f.read(blocksize)
... if not next: break
... yield next
... ... count = 0
... for block in byblocks(f, blksize, len(subst)-1):
... count += block.count(subst)
... f.close()
... return count
...
['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx']

Of course, a large read chunk will make the problem either
go away
10

or might make it low probability depending on the data.

Regards,
Bengt Richter
 
T

Tony Nelson

Buffer, and then read one byte at a time from the buffer.

Have you mesured it?

#!/usr/bin/python
'''Time some file scanning.
'''

import sys, time

f = open(sys.argv[1])
t = time.time()
while True:
b = f.read(256*1024)
if not b:
break
print 'initial read', time.time() - t
f.close()

f = open(sys.argv[1])
t = time.time()
while True:
b = f.read(256*1024)
if not b:
break
print 'second read', time.time() - t
f.close()

if 1:
f = open(sys.argv[1])
t = time.time()
while True:
b = f.read(256*1024)
if not b:
break
for c in b:
pass
print 'third chars', time.time() - t
f.close()

f = open(sys.argv[1])
t = time.time()
n = 0
srch = '\x00\x00\x01\x00'
laplen = len(srch)-1
lap = ''
while True:
b = f.read(256*1024)
if not b:
break
n += (lap+b[:laplen]).count(srch)
n += b.count(srch)
lap = b[-laplen:]
print 'fourth scan', time.time() - t, n
f.close()


On my (old) system, with a 512 MB file so it won't all buffer, the
second time I get:

initial read 14.513395071
second read 14.8771388531
third chars 178.250257969
fourth scan 26.1602909565 1
________________________________________________________________________
TonyN.:' *firstname*nlsnews@georgea*lastname*.com
' <http://www.georgeanelson.com/>
 
P

Paul Watson

Fredrik said:
using cygwin to analyze performance characteristics of portable API:s
is a really lousy idea.

Ok. So, I agree. That is just what I had at hand. Here are some other
numbers to which due diligence has also not been applied. Source code
is at the bottom for both file and mmap process. I would be willing for
someone to tell me what I could improve.

$ python -V
Python 2.4.1

$ uname -a
Linux ruth 2.6.13-1.1532_FC4 #1 Thu Oct 20 01:30:08 EDT 2005 i686

$ cat /proc/meminfo|head -2
MemTotal: 514232 kB
MemFree: 47080 kB

$ time ./scanfile.py
16384

real 0m0.06s
user 0m0.03s
sys 0m0.01s

$ time ./scanfilemmap.py
16384

real 0m0.10s
user 0m0.06s
sys 0m0.00s

Using a ~ 250 MB file, not even half of physical memory.

$ time ./scanfile.py
16777216

real 0m11.19s
user 0m10.98s
sys 0m0.17s

$ time ./scanfilemmap.py
16777216

real 0m55.09s
user 0m43.12s
sys 0m11.92s

==============================

$ cat scanfile.py
#!/usr/bin/env python

import sys

fn = 't.dat'
ss = '\x00\x00\x01\x00'
ss = 'time'

be = len(ss) - 1 # length of overlap to check
blocksize = 64 * 1024 # need to ensure that blocksize > overlap

fp = open(fn, 'rb')
b = fp.read(blocksize)
count = 0
while len(b) > be:
count += b.count(ss)
b = b[-be:] + fp.read(blocksize)
fp.close()

print count
sys.exit(0)

===================================

$ cat scanfilemmap.py
#!/usr/bin/env python

import sys
import os
import mmap

fn = 't.dat'
ss = '\x00\x00\x01\x00'
ss='time'

fp = open(fn, 'rb')
b = mmap.mmap(fp.fileno(), os.stat(fp.name).st_size,
access=mmap.ACCESS_READ)

count = 0
foundpoint = b.find(ss, 0)
while foundpoint != -1 and (foundpoint + 1) < b.size():
#print foundpoint
count = count + 1
foundpoint = b.find(ss, foundpoint + 1)
b.close()

print count

fp.close()
sys.exit(0)
 
P

Paul Watson

Fredrik said:
using cygwin to analyze performance characteristics of portable API:s
is a really lousy idea.

Ok. So, I agree. That is just what I had at hand. Here are some other
numbers to which due diligence has also not been applied. Source code
is at the bottom for both file and mmap process. I would be willing for
someone to tell me what I could improve.

$ python -V
Python 2.4.1

$ uname -a
Linux ruth 2.6.13-1.1532_FC4 #1 Thu Oct 20 01:30:08 EDT 2005 i686

$ cat /proc/meminfo|head -2
MemTotal: 514232 kB
MemFree: 47080 kB

$ time ./scanfile.py
16384

real 0m0.06s
user 0m0.03s
sys 0m0.01s

$ time ./scanfilemmap.py
16384

real 0m0.10s
user 0m0.06s
sys 0m0.00s

Using a ~ 250 MB file, not even half of physical memory.

$ time ./scanfile.py
16777216

real 0m11.19s
user 0m10.98s
sys 0m0.17s

$ time ./scanfilemmap.py
16777216

real 0m55.09s
user 0m43.12s
sys 0m11.92s

==============================

$ cat scanfile.py
#!/usr/bin/env python

import sys

fn = 't.dat'
ss = '\x00\x00\x01\x00'
ss = 'time'

be = len(ss) - 1 # length of overlap to check
blocksize = 64 * 1024 # need to ensure that blocksize > overlap

fp = open(fn, 'rb')
b = fp.read(blocksize)
count = 0
while len(b) > be:
count += b.count(ss)
b = b[-be:] + fp.read(blocksize)
fp.close()

print count
sys.exit(0)

===================================

$ cat scanfilemmap.py
#!/usr/bin/env python

import sys
import os
import mmap

fn = 't.dat'
ss = '\x00\x00\x01\x00'
ss='time'

fp = open(fn, 'rb')
b = mmap.mmap(fp.fileno(), os.stat(fp.name).st_size,
access=mmap.ACCESS_READ)

count = 0
foundpoint = b.find(ss, 0)
while foundpoint != -1 and (foundpoint + 1) < b.size():
#print foundpoint
count = count + 1
foundpoint = b.find(ss, foundpoint + 1)
b.close()

print count

fp.close()
sys.exit(0)
 
S

Steven D'Aprano

Alex said:
I'm more concerned about getting an expected exception -- or more
accurately, *missing* an expected exception. Matching on Exception is too
high. EOFError will probably need to be handled separately, since it often
isn't an error at all, just a change of state. IOError is the bad one.
What else can go wrong?


Lots of things, but not ones you should WISH your application to
survive.
[snip]

Sure, that's why you catch IOError, which covers these _expected_ cases
(or, if you want to be a bit wider, maybe OSError) AND check its errno
attribute to ensure you haven't mistakenly caught something you did NOT
in fact expect (and thus can't really handle), to use a bare raise
statement to re-raise the "accidentally caught" exception if needed.

Ah, that's precisely the answer I was looking for:
IOError and OSError, and then check the errno.
Excellent: thanks for that.
But if something goes wrong that you had NOT anticipated, just log as
much info as you can for the postmortem, give nice diagnostics to the
user if you wish, and do NOT keep processing -- and for these
diagnostic-only purposes, use sys.excepthook, not a slew of try/except
all over your application making it crufty and unmaintainable.

Agreed -- I never intended people to draw the
conclusion that every line, or even every logical block
of lines, should be wrapped in try/except.
 
P

Peter Otten

Bengt said:
I still smelled a bug in the counting of substring in the overlap region,
and you motivated me to find it (obvious in hindsight, but aren't most ;-)

A substring can get over-counted if the "overlap" region joins
infelicitously with the next input. E.g., try counting 'xx' in 10*'xx'
with a read chunk of 4 instead of 1024*1024:

Assuming corrections so far posted as I understand them:
... block = f.read(blocksize)
... yield block
... if overlap>0:
... while True:
... next = f.read(blocksize-overlap)
... if not next: break
... block = block[-overlap:] + next
... yield block
... else:
... while True:
... next = f.read(blocksize)
... if not next: break
... yield next
...... count = 0
... for block in byblocks(f, blksize, len(subst)-1):
... count += block.count(subst)
... f.close()
... return count
...
['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx']

Of course, a large read chunk will make the problem either
go away
10

or might make it low probability depending on the data.

[David Rasmussen]
First of all, this isn't a text file, it is a binary file. Secondly,
substrings can overlap. In the sequence 0010010 the substring 0010
occurs twice.

Coincidentally the "always overlap" case seems the easiest to fix. It
suffices to replace the count() method with

def count_overlap(s, token):
pos = -1
n = 0
while 1:
try:
pos = s.index(token, pos+1)
except ValueError:
break
n += 1
return n

Or so I hope, without the thorough tests that are indispensable as we should
have learned by now...

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,144
Latest member
KetoBaseReviews
Top