Approach to tokenizing

Discussion in 'Perl Misc' started by Charlton Wilbur, Aug 29, 2003.

  1. All,

    I'm working on something that requires a relatively easy to
    tokenize/parse data format. I've solved it in C, but I'm interested
    in how people with a richer toolset would approach it. In short, it's
    based on LISP S-expressions, with a rough spec like this:

    * ( and ) are tokens.
    * Whitespace is not significant.
    * Any sequence of non-space characters is a token.
    * ; introduces a comment that lasts to the end of the line.
    * The exception to all of this is double-quoted character strings,
    which are an entire token from opening quote to closing quote.
    Quotes that need to be in the string can be backslashed.
    * If a double-quoted string extends beyond a line boundary, or is open
    at the end of the file, a warning should be issued.
    * If a quote is found in the middle of a non-quoted token, a warning
    should be issued.

    So, as an example, an input file like this:

    (a b (c d) "quoted string"
    non-quoted-long-token)))

    would produce the following list of tokens (separated by semicolons):

    (; a; b; (; c; d; quoted string; non-quoted-long-token; ); ); )

    And an input file like this:

    embedded"quote ("with \"quotes\"") "this
    spans two lines"

    would produce the following list of tokens:

    embedded"quote; (; with "quotes"; "this
    spans two lines"

    and issue a warning because of the quote in mid-token and because of
    the newline in the quotes.

    Note that matching parentheses is *not* required; that's handled in
    another layer. All this layer needs to do is split the input file
    into tokens.

    Assume for the sake of making things interesting that
    Parse::RecDescent isn't available. How would you solve this problem
    Perlishly? Efficiency counts, but clarity and elegance count for more.

    Charlton


    --
    cwilbur at chromatico dot net
    cwilbur at mac dot com
     
    Charlton Wilbur, Aug 29, 2003
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David Rubin

    string tokenizing

    David Rubin, Oct 7, 2003, in forum: C++
    Replies:
    28
    Views:
    1,019
    Jerry Coffin
    Oct 22, 2003
  2. 9GB

    Tokenizing Problem...

    9GB, Jun 19, 2006, in forum: Java
    Replies:
    5
    Views:
    367
    Eric Sosman
    Jun 20, 2006
  3. Queue

    Tokenizing files?

    Queue, Sep 9, 2006, in forum: C Programming
    Replies:
    0
    Views:
    449
    Queue
    Sep 9, 2006
  4. wreckingcru
    Replies:
    11
    Views:
    1,227
    red floyd
    Feb 1, 2006
  5. magedoll13
    Replies:
    0
    Views:
    401
    magedoll13
    Jan 30, 2009
Loading...

Share This Page