Scanning a file

?

=?ISO-8859-1?Q?Lasse_V=E5gs=E6ther_Karlsen?=

David Rasmussen wrote:
If you must know, the above one-liner actually counts the number of
frames in an MPEG2 file. I want to know this number for a number of
files for various reasons. I don't want it to take forever.
<snip>

Don't you risk getting more "frames" than the file actually have? What
if the encoded data happens to have the magic byte values for something
else?
 
S

Steven D'Aprano

David said:
I'm not saying that it is too big for Python. I am saying that it is too
big for the systems it is going to run on. These files can be 22 MB or 5
GB or ..., depending on the situation. It might not be okay to run a
tool that claims that much memory, even if it is available.

If your files can reach multiple gigabytes, you will
definitely need an algorithm that avoids reading the
entire file into memory at once.


[snip]
print file("filename", "rb").count("\x00\x00\x01\x00")

(or something like that)

instead of the original

print file("filename", "rb").read().count("\x00\x00\x01\x00")

it would be exactly what I am after.

I think I can say, without risk of contradiction, that
there is no built-in method to do that.

> What is the conceptual difference?
The first solution should be at least as fast as the second. I have to
read and compare the characters anyway. I just don't need to store them
in a string. In essence, I should be able to use the "count occurences"
functionality on more things, such as a file, or even better, a file
read through a buffer with a size specified by me.

Of course, if you feel like coding the algorithm and
submitting it to be included in the next release of
Python... :)


I can't help feeling that a generator with a buffer is
the way to go, but I just can't *quite* deal with the
case where the pattern overlaps the boundary... it is
very annoying.

But not half as annoying as it must be to you :)

However, there may be a simpler solution *fingers
crossed* -- you are searching for a sub-string
"\x00\x00\x01\x00", which is hex 0x100. Surely you
don't want any old substring of "\x00\x00\x01\x00", but
only the ones which align on word boundaries?

So "ABCD\x00\x00\x01\x00" would match (in hex, it is
0x41424344 0x100), but "AB\x00\x00\x01\x00CD" should
not, because that is 0x41420000 0x1004344 in hex.

If that is the case, your problem is simpler: you don't
have to worry about the pattern crossing a boundary, so
long as your buffer is a multiple of four bytes.
 
P

Paul Watson

Alex Martelli wrote:
....
[<__main__.a object at 0x64cf0>, <__main__.b object at 0x58510>]

So, no big deal -- run a gc.collect() and parse through gc.garbage for
any instances of your "wrapper of file" class, and you'll find ones that
were forgotten as part of a cyclic garbage loop and you can check
whether they were explicitly closed or not.


Alex

Since everyone needs this, how about building it in such that files
which are closed by the runtime, and not user code, are reported or
queryable? Perhaps a command line switch to either invoke or suppress
reporting them on exit.

Is there any facility for another program to peer into the state of a
Python program? Would this be a security problem?
 
S

Steve Holden

Paul said:
Alex Martelli wrote:
...
gc.garbage

[<__main__.a object at 0x64cf0>, <__main__.b object at 0x58510>]

So, no big deal -- run a gc.collect() and parse through gc.garbage for
any instances of your "wrapper of file" class, and you'll find ones that
were forgotten as part of a cyclic garbage loop and you can check
whether they were explicitly closed or not.


Alex


Since everyone needs this, how about building it in such that files
which are closed by the runtime, and not user code, are reported or
queryable? Perhaps a command line switch to either invoke or suppress
reporting them on exit.
This is a rather poor substitute from correct program design and
implementation. It also begs the question of exactly what constitutes a
"file". What about a network socket that the user has run makefile() on?
What about a pipe to another process? This suggestion is rather ill-defined.
Is there any facility for another program to peer into the state of a
Python program? Would this be a security problem?

It would indeed be a security problem, and there are enough of those
already without adding more.

regards
Steve
 
B

Bengt Richter

David Rasmussen wrote:

<snip>

Don't you risk getting more "frames" than the file actually have? What
if the encoded data happens to have the magic byte values for something
else?
Good point, but perhaps the bit pattern the OP is looking for is guaranteed
(e.g. by some kind of HDLC-like bit or byte stuffing or escaping) not to occur
except as frame marker (which might make sense re the problem of re-synching
to frames in a glitched video stream).

The OP probably knows. I imagine this thread would have gone differently if the
title had been "How to count frames in an MPEG2 file?" and the OP had supplied
the info about what marks a frame and whether it is guaranteed not to occur
in the data ;-)

Regards,
Bengt Richter
 
B

Bengt Richter

Bengt said:
I still smelled a bug in the counting of substring in the overlap region,
and you motivated me to find it (obvious in hindsight, but aren't most ;-)

A substring can get over-counted if the "overlap" region joins
infelicitously with the next input. E.g., try counting 'xx' in 10*'xx'
with a read chunk of 4 instead of 1024*1024:

Assuming corrections so far posted as I understand them:
def byblocks(f, blocksize, overlap):
... block = f.read(blocksize)
... yield block
... if overlap>0:
... while True:
... next = f.read(blocksize-overlap)
... if not next: break
... block = block[-overlap:] + next
... yield block
... else:
... while True:
... next = f.read(blocksize)
... if not next: break
... yield next
...
def countsubst(f, subst, blksize=1024*1024):
... count = 0
... for block in byblocks(f, blksize, len(subst)-1):
... count += block.count(subst)
... f.close()
... return count
...
from StringIO import StringIO as S
countsubst(S('xx'*10), 'xx', 4) 13
10
list(byblocks(S('xx'*10), 4, len('xx')-1))
['xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xxxx', 'xx']

Of course, a large read chunk will make the problem either
go away
countsubst(S('xx'*10), 'xx', 1024)
10

or might make it low probability depending on the data.

[David Rasmussen]
First of all, this isn't a text file, it is a binary file. Secondly,
substrings can overlap. In the sequence 0010010 the substring 0010
occurs twice.
The OP didn't reply to my post re the above for some reason
http://groups.google.com/group/comp.lang.python/msg/dd4125bf38a54b7c?hl=en&
Coincidentally the "always overlap" case seems the easiest to fix. It
suffices to replace the count() method with

def count_overlap(s, token):
pos = -1
n = 0
while 1:
try:
pos = s.index(token, pos+1)
except ValueError:
break
n += 1
return n

Or so I hope, without the thorough tests that are indispensable as we should
have learned by now...
Unfortunately, there is such a thing as a correct implementation of an incorrect spec ;-)
I have some doubts about the OP's really wanting to count overlapping patterns as above,
which is what I asked about in the above referenced post. Elsewhere he later reveals:

[David Rasmussen]
In which case I doubt whether he wants to count as above. Scanning for the
particular 4 bytes would assume that non-frame-marker data is escaped
one way or another so it can't contain the marker byte sequence.
(If it did, you'd want to skip it, not count it, I presume). Robust streaming video
format would presumably be designed for unambigous re-synching, meaning
the data stream can't contain the sync mark. But I don't know if that
is guaranteed in conversion from file to stream a la HDLC or some link packet protocol
or whether it is actually encoded with escaping in the file. If framing in the file is with
length-specifying packet headers and no data escaping, then the filebytes.count(pattern)
approach is not going to do the job reliably, as Lasse was pointing out.

Requirements, requirements ;-)

Regards,
Bengt Richter
 
P

Paul Watson

Steve said:
This is a rather poor substitute from correct program design and
implementation. It also begs the question of exactly what constitutes a
"file". What about a network socket that the user has run makefile() on?
What about a pipe to another process? This suggestion is rather
ill-defined.


It would indeed be a security problem, and there are enough of those
already without adding more.

regards
Steve

All I am looking for is the runtime to tell me when it is doing things
that are outside the language specification and that the developer
should have coded.

How "ill" will things be when large bodies of code cannot run
successfully on a future version of Python or a non-CPython
implementation which does not close files. Might as well put file
closing on exit into the specification.

The runtime knows it is doing it. Please allow the runtime to tell me
what it knows it is doing. Thanks.
 
S

Steve Holden

Paul said:
All I am looking for is the runtime to tell me when it is doing things
that are outside the language specification and that the developer
should have coded.

How "ill" will things be when large bodies of code cannot run
successfully on a future version of Python or a non-CPython
implementation which does not close files. Might as well put file
closing on exit into the specification.

The runtime knows it is doing it. Please allow the runtime to tell me
what it knows it is doing. Thanks.

In point oif fact I don't believe the runtime does any such thing
(though I must admit I haven't checked the source, so you may prove me
wrong).

As far as I know, Python simply relies on the opreating system to close
files left open at the end of the program.

regards
Steve
 
J

John J. Lee

Paul Watson said:
How "ill" will things be when large bodies of code cannot run
successfully on a future version of Python or a non-CPython
implementation which does not close files. Might as well put file
closing on exit into the specification.
[...]

There are many, many ways of making a large body of code "ill".

Closing off this particular one would make it harder to get benefit of
non-C implementations of Python, so it has been judged "not worth it".
I think I agree with that judgement.


John
 
P

Paul Rubin

Closing off this particular one would make it harder to get benefit of
non-C implementations of Python, so it has been judged "not worth it".
I think I agree with that judgement.

The right fix is PEP 343.
 
A

Alex Martelli

Steve Holden said:
In point oif fact I don't believe the runtime does any such thing
(though I must admit I haven't checked the source, so you may prove me
wrong).

As far as I know, Python simply relies on the opreating system to close
files left open at the end of the program.

Nope, see
<http://cvs.sourceforge.net/viewcvs.py/python/python/dist/src/Objects/fi
leobject.c?rev=2.164.2.3&view=markup> :

"""
static void
file_dealloc(PyFileObject *f)
{
int sts = 0;
if (f->weakreflist != NULL)
PyObject_ClearWeakRefs((PyObject *) f);
if (f->f_fp != NULL && f->f_close != NULL) {
Py_BEGIN_ALLOW_THREADS
sts = (*f->f_close)(f->f_fp);
"""
etc.

Exactly how the OP wants to "allow the runtime to tell [him] what it
knows it is doing", that is not equivalent to reading the freely
available sources of that runtime, is totally opaque to me, though.

"The runtime" (implementation of built-in object type `file`) could be
doing or not doing a bazillion things (in its ..._dealloc function as
well as many other functions), up to and including emailing the OP's
cousin if it detects the OP is up later than his or her bedtime -- the
language specs neither mandate nor forbid such behavior. How, exactly,
does the OP believe the language specs should "allow" (presumably,
REQUIRE) ``the runtime'' to communicate the sum total of all that it's
doing or not doing (beyond whatever the language specs themselves may
require or forbid it to do) on any particular occasion...?!


Alex
 
P

Paul Watson

Paul said:
The right fix is PEP 343.

I am sure you are right. However, PEP 343 will not change the existing
body of Python source code. Nor will it, alone, change the existing
body of Python programmers who are writing code which does not close files.
 
P

Paul Watson

Alex said:
In point oif fact I don't believe the runtime does any such thing
(though I must admit I haven't checked the source, so you may prove me
wrong).

As far as I know, Python simply relies on the opreating system to close
files left open at the end of the program.


Nope, see
<http://cvs.sourceforge.net/viewcvs.py/python/python/dist/src/Objects/fi
leobject.c?rev=2.164.2.3&view=markup> :

"""
static void
file_dealloc(PyFileObject *f)
{
int sts = 0;
if (f->weakreflist != NULL)
PyObject_ClearWeakRefs((PyObject *) f);
if (f->f_fp != NULL && f->f_close != NULL) {
Py_BEGIN_ALLOW_THREADS
sts = (*f->f_close)(f->f_fp);
"""
etc.

Exactly how the OP wants to "allow the runtime to tell [him] what it
knows it is doing", that is not equivalent to reading the freely
available sources of that runtime, is totally opaque to me, though.

"The runtime" (implementation of built-in object type `file`) could be
doing or not doing a bazillion things (in its ..._dealloc function as
well as many other functions), up to and including emailing the OP's
cousin if it detects the OP is up later than his or her bedtime -- the
language specs neither mandate nor forbid such behavior. How, exactly,
does the OP believe the language specs should "allow" (presumably,
REQUIRE) ``the runtime'' to communicate the sum total of all that it's
doing or not doing (beyond whatever the language specs themselves may
require or forbid it to do) on any particular occasion...?!


Alex

The OP wants to know which files the runtime is closing automatically.
This may or may not occur on other or future Python implementations.
Identifying this condition will accelerate remediation efforts to avoid
the deleterious impact of failure to close().


The mechanism to implement such a capability might be similar to the -v
switch which traces imports, reporting to stdout. It might be a
callback function.
 
F

Fredrik Lundh

D

David Rasmussen

Lasse said:
David Rasmussen wrote:


Don't you risk getting more "frames" than the file actually have? What
if the encoded data happens to have the magic byte values for something
else?

I am not too sure about the details, but I've been told from a reliable
source that 0x00000100 only occurs as a "begin frame" marker, and not
anywhere else. So far, it has been true on the files I have tried it on.

/David
 
D

David Rasmussen

Bengt said:
Good point, but perhaps the bit pattern the OP is looking for is guaranteed
(e.g. by some kind of HDLC-like bit or byte stuffing or escaping) not to occur
except as frame marker (which might make sense re the problem of re-synching
to frames in a glitched video stream).

Exactly.

The OP probably knows. I imagine this thread would have gone differently if the
title had been "How to count frames in an MPEG2 file?" and the OP had supplied
the info about what marks a frame and whether it is guaranteed not to occur
in the data ;-)

Sure, but I wanted to ask the general question :) I am new to Python and
I want to learn about the language.

/David
 
D

David Rasmussen

Steven said:
However, there may be a simpler solution *fingers crossed* -- you are
searching for a sub-string "\x00\x00\x01\x00", which is hex 0x100.
Surely you don't want any old substring of "\x00\x00\x01\x00", but only
the ones which align on word boundaries?

Nope, sorry. On the files I have tried this on, the pattern could occur
on any byte boundary.

/David
 
S

Steven D'Aprano

David said:
I am not too sure about the details, but I've been told from a reliable
source that 0x00000100 only occurs as a "begin frame" marker, and not
anywhere else. So far, it has been true on the files I have tried it on.

Not too reliable then.

0x00000100 is one of a number of unique start codes in
the MPEG2 standard. It is guaranteed to be unique in
the video stream, however when searching for codes
within the video stream, make sure you're in the video
stream!

See, for example,
http://forum.doom9.org/archive/index.php/t-29262.html

"Actually, one easy way (DVD specific) is to look for
00 00 01 e0 at byte offset 00e of the pack. Then look
at byte 016, it contains the size of the extension.
Resume your scan at 017 + contents of 016."

Right. Glad that's the easy way.

I really suspect that you need a proper MPEG2 parser,
and not just blindly counting bytes -- at least if you
want reliable, accurate counts and not just "number of
frames, plus some file-specific random number". And
heaven help you if you want to support MPEGs that are
slightly broken...

(It has to be said, depending on your ultimate needs,
"close enough" may very well be, um, close enough.)

Good luck!
 
B

Bengt Richter

I am sure you are right. However, PEP 343 will not change the existing
body of Python source code. Nor will it, alone, change the existing
body of Python programmers who are writing code which does not close files.

It might be possible to recompile existing code (unchanged) to capture most
typical cpython use cases, I think...

E.g., I can imagine a family of command line options based on hooking import on
startup and passing option info to the selected and hooked import module,
which module would do extra things at the AST stage of compiling and executing modules
during import, to accomplish various things.

(I did a little proof of concept a while back, see

http://mail.python.org/pipermail/python-list/2005-August/296594.html

that gives me the feeling I could do this kind of thing).

E.g., for the purposes of guaranteeing close() on files opened in typical cpython
one-liners or single-suiters) like e.g.

for i, line in enumerate(open(fpath)):
print '%04d: %s' %(i, line.rstrip())

I think a custom import could recognize the open call
in the AST and extract it and wrap it up in a try/finally AST structure implementing
something like the following in the place of the above;

__f = open(fpath) # (suitable algorithm for non-colliding __f names is required)
try:
for i, line in enumerate(__f):
print '%04d: %s' %(i, line.rstrip())
finally:
__f.close()

In this case, the command line info passed to the special import might look like
python -with open script.py

meaning calls of open in a statement/suite should be recognized and extracted like
__f = open(fpath) above, and the try/finally be wrapped around the use of it.

I think this would capture a lot of typical usage, but of course I haven't bumped into
the gotchas yet, since I haven't implemented it ;-)

On a related note, I think one could implement macros of a sort in a similar way.
The command line parameter would pass the name of a class which is actually extracted
at AST-time, and whose methods and other class variables represent macro definitions
to be used in the processing of the rest of the module's AST, before compilation per se.

Thus you could implement e.g. in-lining, so that

----
#example.py
class inline:
def mac(acc, x, y):
acc += x*y

tot = 0
for i in xrange(10):
mac(tot, i*i)
----

Could be run with

python -macros inline example.py

and get the same identical .pyc as you would with the source

----
#example.py
tot = 0
for i in xrange(10):
tot += i*i
----

IOW, a copy of the macro body AST is substituted for the macro call AST, with
parameter names translated to actual macro call arg names. (Another variant
would also permit putting the macros in a separate module, and recognize their
import into other modules, and "do the right thing" instead of just translating
the import. Maybe specify the module by python - macromodule inline example.py
and then recognize "import inline" in example.py's AST).

Again, I just have a hunch I could make this work (and a number of people
here could beat me to it if they were motivated, I'm sure). Also have a hunch
I might need some flame shielding. ;-)

OTOH, it could be an easy way to experiment with some kinds of language
tweaks. The only limitation really is the necessity for the source to
look legal enough that an AST is formed and preserves the requisite info.
After that, there's no limit to what an AST-munger could do, especially
if it is allowed to call arbitrary tools and create auxiliary files such
as e.g. .dlls for synthesized imports plugging stuff into the final translated context ;-)
(I imagine this is essentially what the various machine code generating optimizers do).

IMO the concept of modules and their (optionally specially controlled) translation
and use could evolve in may interesting directions. E.g., __import__ could grow
keyword parameters too ... Good thing there is a BDFL with a veto, eh? ;-)

Should I bother trying to implement this import for with and macros from
the pieces I have (plus imp, to do it "right") ?

BTW, I haven't experimented with command line dependent site.py/sitecustomize.py stuff.
Would that be a place to do sessionwise import hooking and could one rewrite sys.argv
so the special import command line opts would not be visible to subsequent
processing (and the import hook would be in effect)? IWT so, but probably should read
site.py again and figure it out, but appreciate any hints on pitfalls ;-)

Regards,
Bengt Richter
 
D

David Rasmussen

Steven said:
0x00000100 is one of a number of unique start codes in the MPEG2
standard. It is guaranteed to be unique in the video stream, however
when searching for codes within the video stream, make sure you're in
the video stream!

I know I am in the cases I am interested in.
And heaven help you if you want to support MPEGs that are slightly
broken...

I don't. This tool is for use in house only. And on MPEGs that are
generated in house too.

/David
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
SterlingLa
Top