Approach to tokenizing

C

Charlton Wilbur

All,

I'm working on something that requires a relatively easy to
tokenize/parse data format. I've solved it in C, but I'm interested
in how people with a richer toolset would approach it. In short, it's
based on LISP S-expressions, with a rough spec like this:

* ( and ) are tokens.
* Whitespace is not significant.
* Any sequence of non-space characters is a token.
* ; introduces a comment that lasts to the end of the line.
* The exception to all of this is double-quoted character strings,
which are an entire token from opening quote to closing quote.
Quotes that need to be in the string can be backslashed.
* If a double-quoted string extends beyond a line boundary, or is open
at the end of the file, a warning should be issued.
* If a quote is found in the middle of a non-quoted token, a warning
should be issued.

So, as an example, an input file like this:

(a b (c d) "quoted string"
non-quoted-long-token)))

would produce the following list of tokens (separated by semicolons):

(; a; b; (; c; d; quoted string; non-quoted-long-token; ); ); )

And an input file like this:

embedded"quote ("with \"quotes\"") "this
spans two lines"

would produce the following list of tokens:

embedded"quote; (; with "quotes"; "this
spans two lines"

and issue a warning because of the quote in mid-token and because of
the newline in the quotes.

Note that matching parentheses is *not* required; that's handled in
another layer. All this layer needs to do is split the input file
into tokens.

Assume for the sake of making things interesting that
Parse::RecDescent isn't available. How would you solve this problem
Perlishly? Efficiency counts, but clarity and elegance count for more.

Charlton
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top