C
Charlton Wilbur
All,
I'm working on something that requires a relatively easy to
tokenize/parse data format. I've solved it in C, but I'm interested
in how people with a richer toolset would approach it. In short, it's
based on LISP S-expressions, with a rough spec like this:
* ( and ) are tokens.
* Whitespace is not significant.
* Any sequence of non-space characters is a token.
* ; introduces a comment that lasts to the end of the line.
* The exception to all of this is double-quoted character strings,
which are an entire token from opening quote to closing quote.
Quotes that need to be in the string can be backslashed.
* If a double-quoted string extends beyond a line boundary, or is open
at the end of the file, a warning should be issued.
* If a quote is found in the middle of a non-quoted token, a warning
should be issued.
So, as an example, an input file like this:
(a b (c d) "quoted string"
non-quoted-long-token)))
would produce the following list of tokens (separated by semicolons):
(; a; b; (; c; d; quoted string; non-quoted-long-token; ); ); )
And an input file like this:
embedded"quote ("with \"quotes\"") "this
spans two lines"
would produce the following list of tokens:
embedded"quote; (; with "quotes"; "this
spans two lines"
and issue a warning because of the quote in mid-token and because of
the newline in the quotes.
Note that matching parentheses is *not* required; that's handled in
another layer. All this layer needs to do is split the input file
into tokens.
Assume for the sake of making things interesting that
Parse::RecDescent isn't available. How would you solve this problem
Perlishly? Efficiency counts, but clarity and elegance count for more.
Charlton
I'm working on something that requires a relatively easy to
tokenize/parse data format. I've solved it in C, but I'm interested
in how people with a richer toolset would approach it. In short, it's
based on LISP S-expressions, with a rough spec like this:
* ( and ) are tokens.
* Whitespace is not significant.
* Any sequence of non-space characters is a token.
* ; introduces a comment that lasts to the end of the line.
* The exception to all of this is double-quoted character strings,
which are an entire token from opening quote to closing quote.
Quotes that need to be in the string can be backslashed.
* If a double-quoted string extends beyond a line boundary, or is open
at the end of the file, a warning should be issued.
* If a quote is found in the middle of a non-quoted token, a warning
should be issued.
So, as an example, an input file like this:
(a b (c d) "quoted string"
non-quoted-long-token)))
would produce the following list of tokens (separated by semicolons):
(; a; b; (; c; d; quoted string; non-quoted-long-token; ); ); )
And an input file like this:
embedded"quote ("with \"quotes\"") "this
spans two lines"
would produce the following list of tokens:
embedded"quote; (; with "quotes"; "this
spans two lines"
and issue a warning because of the quote in mid-token and because of
the newline in the quotes.
Note that matching parentheses is *not* required; that's handled in
another layer. All this layer needs to do is split the input file
into tokens.
Assume for the sake of making things interesting that
Parse::RecDescent isn't available. How would you solve this problem
Perlishly? Efficiency counts, but clarity and elegance count for more.
Charlton