Buffer pair for lexical analysis of raw binary data

A

Angus Rodgers

Partly as an educational exercise, and partly for its practical
benefit, I'm trying to pick up a programming project from where
I left off in 2001. It implemented in slightly generalised form
the "buffer pair" scheme for lexical analysis described on pp.
88--92 of Aho et al., /Compilers: Principles, Techniques and
Tools/ (1986). (I'm afraid I don't have a page reference for the
2007 second edition. Presumably it's also in Knuth somewhere.)

Documentation for one of the C++ header files describes it thus
(but I never quite got the hang of C++, so some of the language-
specific details may be very poorly conceived):

"An <ipfile> object incorporates a handle to a file, opened in
read-only mode, and a buffer containing (by default) raw binary
data from that file. The constructor also has an option to open
a file in text mode.

The buffer may, optionally, consist of several segments, linked
to one another in cyclic sequence. The number of segments is a
constant class member, nblocks (1 <= nblocks <= 32,767). A second
constant class member, block (1 <= block <= 32,767) gives the size
of each of the segments in bytes.

The purpose of creating a buffer in cyclically linked segments
is to allow reference to the history of reading the file, even
though it is being read sequentially. The bare class <ipfile>
does not do this itself, but is designed so that classes derived
from it may incorporate one or more pointers to parts of the buffer
that have already been read (assuming these parts have not yet been
overwritten).

If there were only one segment, the length of available history
would periodically be reduced to zero, when the buffer is re-
freshed. In general, the available history occupies at least
a fraction (nblocks - 1)/nblocks of a full buffer."

Aho et al. describe the scheme thus (p. 90):

"Two pointers to the input buffer are maintained. The string
of characters between the two pointers is the current lexeme.
Initially, both pointers point to the first character of the
next lexeme to be found. One, called the forward pointer, scans
ahead until a match for a pattern is found. Once the next lexeme
is determined, the forward pointer is set to the character at
its right end. After the lexeme is processed, both pointers
are set to the character immediately past the lexeme."

[There follows a description of the use of "sentinels" to test
efficiently for pointers moving past the end of input to date.]

I seem to remember (but my memory is still very hazy) that there
was some annoying difficulty in coding the raw binary input file
reading operation in C++ in an implementation-independent way;
and I'm reluctant to go back and perhaps get bogged down again
in whatever way I got bogged down before; so I would prefer to
use Python for the whole thing, if possible (either using some
existing library, or else by recoding it all myself in Python).

Does some Python library already provide some functionality like
this? (It's enough to do it with nblocks = 2, as in Aho et al.)

If not, is this a reasonable thing to try to program in Python?
(At the same time as learning the language, and partly as a
fairly demanding exercise intended to help me to learn it.)

Or should I just get my hands dirty with some C++ compiler or
other, and get my original code working on my present machine
(possibly in ANSI C instead of C++), and call it from Python?
 
A

Aahz

Partly as an educational exercise, and partly for its practical
benefit, I'm trying to pick up a programming project from where
I left off in 2001. It implemented in slightly generalised form
the "buffer pair" scheme for lexical analysis described on pp.
88--92 of Aho et al., /Compilers: Principles, Techniques and
Tools/ (1986). (I'm afraid I don't have a page reference for the
2007 second edition. Presumably it's also in Knuth somewhere.)

[...]

Does some Python library already provide some functionality like
this? (It's enough to do it with nblocks = 2, as in Aho et al.)

Not AFAIK, but there may well be something in the recipes or PyPI; have
you tried searching them?
If not, is this a reasonable thing to try to program in Python?

Definitely! It should be lots easier to program this with Python.
 
A

Angus Rodgers

Partly as an educational exercise, and partly for its practical
benefit, I'm trying to pick up a programming project from where
I left off in 2001. It implemented in slightly generalised form
the "buffer pair" scheme for lexical analysis described on pp.
88--92 of Aho et al., /Compilers: Principles, Techniques and
Tools/ (1986). (I'm afraid I don't have a page reference for the
2007 second edition. Presumably it's also in Knuth somewhere.)

[...]

Does some Python library already provide some functionality like
this? (It's enough to do it with nblocks = 2, as in Aho et al.)

Not AFAIK, but there may well be something in the recipes or PyPI; have
you tried searching them?

Searching for "buffer" at <http://pypi.python.org/pypi> (which I
didn't know about) gives quite a few hits (including reflex 0.1,
"A lightweight regex-based lexical scanner library").

By "recipes", do you mean
<http://code.activestate.com/recipes/langs/python/> (also new to me)?

There is certainly a lot of relevant code there (e.g. "Recipe 392150:
Buffered Stream with Multiple Forward-Only Readers"), which I can try
to learn from, even if I can't use it directly.

Thanks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top