Approach to tokenizing

Charlton Wilbur · Aug 29, 2003

All,

I'm working on something that requires a relatively easy to
tokenize/parse data format. I've solved it in C, but I'm interested
in how people with a richer toolset would approach it. In short, it's
based on LISP S-expressions, with a rough spec like this:

* ( and ) are tokens.
* Whitespace is not significant.
* Any sequence of non-space characters is a token.
* ; introduces a comment that lasts to the end of the line.
* The exception to all of this is double-quoted character strings,
which are an entire token from opening quote to closing quote.
Quotes that need to be in the string can be backslashed.
* If a double-quoted string extends beyond a line boundary, or is open
at the end of the file, a warning should be issued.
* If a quote is found in the middle of a non-quoted token, a warning
should be issued.

So, as an example, an input file like this:

(a b (c d) "quoted string"
non-quoted-long-token)))

would produce the following list of tokens (separated by semicolons):

(; a; b; (; c; d; quoted string; non-quoted-long-token; ); ); )

And an input file like this:

embedded"quote ("with \"quotes\"") "this
spans two lines"

would produce the following list of tokens:

embedded"quote; (; with "quotes"; "this
spans two lines"

and issue a warning because of the quote in mid-token and because of
the newline in the quotes.

Note that matching parentheses is *not* required; that's handled in
another layer. All this layer needs to do is split the input file
into tokens.

Assume for the sake of making things interesting that
Parse::RecDescent isn't available. How would you solve this problem
Perlishly? Efficiency counts, but clarity and elegance count for more.

Charlton

I want to simulate social network activity. How to approach the problem?	1	Oct 17, 2022
Can't solve problems! please Help	0	Sep 26, 2022
approach to calculating pay	4	Aug 27, 2013
Tokenizing a large file	8	Apr 15, 2009
Informatica's Update approach allows you to delete and update records without using SCD.	1	Apr 3, 2023
Tokenizing text	0	Feb 24, 2009
daemon thread cleanup approach	7	May 29, 2014
Confused approach to Pyinstaller	0	Jun 29, 2013

Approach to tokenizing

Charlton Wilbur

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads