Very simple parser... not for me

Discussion in 'C Programming' started by pozz, Jan 27, 2014.

  1. pozz

    BartC Guest

    I'm thinking about a special case where character constants are used to
    specify each end of the range. If that's going to cause too many problems,
    then alternate range operators will do, even if something like
    letterrange('A','Z'); anything is better than having to write out 26 cases.
    Or perhaps allowing case "ABCDEF":
    Sure, but how many EBCDIC machines are around? I don't think I've ever come
    across any EBCDIC data (not since around 1977 or so). And in the 30+ years
    I've been programming, I've used ASCII. I would have been put out, if I'd be
    using C all that time, to have been denied the use of ranges as I've
    described, even when no character codes were involved.

    And then what do you do with
    In this non-special case (where the author most probably has ASCII, ANSI,
    UTF-8 or the bottom end of Unicode in mind), the range would be 65 to 127.
    But how would this be expressed now in C?
     
    BartC, Jan 29, 2014
    #21
    1. Advertisements

  2. pozz

    James Kuyper Guest


    if(nextchar >= '0' && nextchar <= '9')
    {
    }
    else switch(nextchar)
    {
    }

    It's a little more clumsy than specifying a range inside the switch, but
    it has the advantage of being fully supported by the current version of
    C, and it's a bit simpler than your current approach.
     
    James Kuyper, Jan 29, 2014
    #22
    1. Advertisements

  3. pozz

    pozzugno Guest

    Is there something tricky I've not noticed? It seems to me
    Yes, you're right, it's simple. I was thinking to change the state
    for each incoming character.

    Anyway this is a ad-hoc implementation. If I change the sequence of
    characters/fields admitted, the finite state machine implementation
    could change dramatically.

    I think it's better a "token approach", a function that parses the
    string and generates a list of tokens #, *, <c> and <g> with the
    associated values (for <c> and <g>). Depending on the state, I can
    process different sequences of tokens and discard others as invalid.
     
    pozzugno, Jan 29, 2014
    #23
  4. pozz

    Eric Sosman Guest

    You could do that, too, if it seems easier. There'd be
    more states, but they'd be simpler -- for example, "each state
    consumes exactly one character" might be a helpful assertion.
    Well, yeah. If you switch to accepting JSON strings the
    parser will have to change, too. What are the chances?
    One thing that makes your parser particularly easy is the
    fact that the grammar is so simple that it never requires any
    look-ahead or backtracking: Given any (valid) prefix, you need
    only examine the immediately following token to determine what
    to do next. That eases the <c> and <g> recognition, because you
    always know which one you're looking for. That is, you never
    need to look at a '2' and wonder whether it's a <g> or just the
    start of a <c>; you always know whether a <c> or <g> is due,
    and the question becomes simply "Is this the Right Stuff?" In
    parsing lingo your grammar can be recognized by an LR(0) parser
    rather than a more complicated LR(1) or LR(k). (For completeness:
    LR(k) is not the only family of parsing algorithms, just a very
    popular type.)
     
    Eric Sosman, Jan 29, 2014
    #24
  5. But it's wrong 100% of the time if your program runs on a system where
    'A'..'Z' are not contiguous.

    Correctness is not entirely about probabilities.
     
    Keith Thompson, Jan 29, 2014
    #25
  6. Or

    if (isdigit((unsigned char)nextchar)
    {
    /* ... */
    }
    else {
    switch (nextchar) {
    /* ... */
    }
    }

    (Yes, the need for the cast is annoying.)
     
    Keith Thompson, Jan 29, 2014
    #26
  7. pozz

    BartC Guest

    [reposted; may be duplicate]
    From: "Keith Thompson" <>
    Sent: Wednesday, January 29, 2014 4:52 PM
    Newsgroups: comp.lang.c
    Subject: Re: Very simple parser... not for me
    But it is worth worrying about to the extent that hundreds of thousands of
    programmers are forced to write things like:

    case 1: case 2: case 3: .... case 77:

    for decade after decade, just because someone might one day compile the
    source code on some IBM machine or whatever obscure hardware still uses
    EBCDIC? And this is an example that is unrelated to character set encoding
    so it wouldn't have made any difference.

    Besides I suggested a few ways of making 'A'...'Z' work even with EBCDIC.
     
    BartC, Jan 29, 2014
    #27
  8. pozz

    pozz Guest

    Il 29/01/2014 16:20, Eric Sosman ha scritto:
    You are right for the combinations I listed in my original post. What I
    was trying to do is to implement a more general parser that could be
    modified in a simple way if I (or someone) decides to change the valid
    combinations of tokens (I'm in a developing stage and I don't know
    exactly what will be the valid combinations, they may change during
    development of the product and this could depend on the customer too...
    of course the keypad has only 0-9, # and * keys, so I will never change
    the parser to understand JSON strings or similar complex grammars).

    For example, what you wrote is right for the following combinations:
    #
    *#
    *<g>#
    <c>*<g>#
    <c>#
    <c>*#
    What happens if the combinations would be the following:
    #
    *#
    <g>*#
    <c>*<g>#
    <c>#
    <c>*#
    Now there are two combinations where <g> and <c> could appear at the
    beginning of the string, so I have to look ahead to understand if I'm
    reading a <g> field (just one char '1', '2' or '3') or a <c> field (a
    numerical value of 4- to 6-digits).
     
    pozz, Jan 29, 2014
    #28
  9. pozz

    BartC Guest

    You can try a token-based parser, so, you'd have the symbols: c, g, star,
    hash and perhaps end-of-sequence and error. Or, more generally, digits, star
    and hash, where you have to analyze digits to find if they are a g or c
    field.

    Introducing a one-character look-ahead might be simpler if the format isn't
    going to get much more complicated than this.

    However we don't know that; could g become long enough to get confused with
    c? Could there be more different kinds of fields? Could repeated instances
    of * and # be meaningful? Could you have one group of digits followed
    immediately by another (delimited by a certain length perhaps), or could
    there be consecutive g's (in which case "1331" might be one c, or four g's)?
     
    BartC, Jan 29, 2014
    #29
  10. Well, that isn't so much of a problem, but 'J' might be more than
    one more than 'I', and, even more, '}' might be between 'I' and 'J'.

    -- glen
     
    glen herrmannsfeldt, Jan 30, 2014
    #30
  11. In practice, yes. In principle, though, the C standard provides no
    guarantees about the values of the 26 uppercase letters (beyond the
    guarantees it provides for all printable characters in the basic
    character set).

    In the real world, the vast majority of computer systems (of those that
    use character sets at all) use character sets based on ASCII, in which
    each of the ranges '0'..'9', 'A'..'Z', 'a'..'z' is contiguous.

    The vast majority of non-ASCII systems use some form of EBCDCIC, which
    has gaps between 'i' and 'j', between 'r' and 's', between 'I' and 'J',
    and between 'R' and 'S'.

    I've never heard of a C implementation using any other character set.

    The C standard probably could have gotten away with, for example,
    mandating that 'a'..'f' and 'A'..'F' are contiguous, or that the letters
    are in order even if they're not contiguous ('a' < 'z', etc.). At the
    time the standard was being written, it was less clear than it is now
    that ASCII-based systems were going to take over the world, with EBCDIC
    as the only significant competitor. (And if I had a time machine, I
    might go back and encourage IBM to adopt ASCII, which was developed at
    about the same time as EBCDIC.)

    But there's another point to consider. If you write something like

    if (c >= 'a' && c <= 'z')

    even if it never runs on an EBCDIC-based system, you're still ignoring
    everything other than the 26 letters of the Latin alphabet. (I count
    7689 occurrences of "LETTER" in a list of Unicode characters.)
     
    Keith Thompson, Jan 30, 2014
    #31
  12. (snip)
    For S/360, IBM tried to get ASCII-8, which isn't just regular
    ASCII (I think sometimes called ASCII-7) with an extra bit, but
    with half the table moved over. But it never happened.
    Well, in Java you can actually use those letters, for example, as
    variable names. You need a UTF8 or Unicode editor, though.

    I did try it once just to see it. There is a free, downloadable
    for Windows Unicode editor that will write files in a form that
    javac can read.

    And, as with C, there are methods to indicate which characters
    are letter, digit, etc. (There are more digit, too.)

    -- glen
     
    glen herrmannsfeldt, Jan 30, 2014
    #32
  13. pozz

    Eric Sosman Guest

    Only you can guess how complex the grammar might someday
    become, and whether it might grow gnarly enough to justify pulling
    out all the machinery of lex and yacc or their brethren. But for
    the teeny-tiny grammar at hand, and for the teeny-tiny repertoire
    of symbols and roles, my counsel would be "YAGNI." Or "KISS."
    Yeah. Note also that if you're getting the input one
    button-press at a time, the revised grammar makes it impossible
    to act upon a leading <g> until the next button is pressed,[*]
    no matter what kind of parser you use. So: If "button-by-button
    response" is an important feature (you haven't described the
    application) you simply *will not* make this modification. And
    if you *do* make it, you still have a simple grammar for which
    plugging in a new parser will be equally simple. (Look a bit at
    the parser I showed you; observe that it's not "random logic" but
    a systematic enumeration of prefixes. You can learn to do this
    yourself with little effort.)

    [*] There's an old story about a fellow who wanted to attend
    a lecture by an eminent speaker. The lecture, though, was to be
    delivered in German, which the listener didn't know. But he was
    so keen to attend that he arranged for a simultaneous translation:
    He hired a translator, bought him a second lecture ticket, and
    anticipated hearing the Great Man's German while the English
    equivalent was whispered in his ear. So the guy and his translator
    take their seats, the speaker takes the stage, and the lecture
    begins -- and the translator is silent. The speaker goes on and
    on and on, and the translator says nothing. "Psst, hey!" says the
    increasingly agitated attendee. "What's he saying?" "Patience,
    Sir," says the translator, "I am waiting for the verb."
     
    Eric Sosman, Jan 30, 2014
    #33
  14. (snip on ASCII and IBM)
    Yes, the seem to call it ASCII-8.

    But it isn't close enough if you are copying files back and forth.
    There would have been fewer problems with translation tables, though.

    Among others, EBCDIC has a logical NOT sign, used in PL/I for
    the logical and relational operators. (I think some other languages,
    such as REXX, also use it.) That complicates the translation
    tables for ordinary text files.

    -- glen
     
    glen herrmannsfeldt, Jan 30, 2014
    #34
  15. pozz

    Kaz Kylheku Guest

    If you use something like

    isalpha(ch);

    you're also ignoring everything other than the 26 letters of the Latin
    alphabet, unless you called setlocale somewhere.
     
    Kaz Kylheku, Jan 31, 2014
    #35
  16. Not necessarily ignoring. Treating as non-alphabetical.
    Sometimes that's right, sometimes it's wrong, quite often it's not
    well-defined, people often specify "alphanumerical characters" without
    being clear about which alphabets are allowed, or whether decorated
    Latin characters are the same as or different to their base letters.
    If you work in an English-only environment, it's easy to shunt that sort
    of issue into a siding, because you don't know or understand what non-English
    speakers might want, and you're not sure what language these putative
    non-Anglophones will speak.
     
    Malcolm McLean, Feb 1, 2014
    #36
  17. pozz

    Tim Rentsch Guest

    One approach:

    1. Devise a data structure that can represent all the
    different patterns you have to recognize;

    2. Write a function that can scan a string and see if
    it matches a particular pattern;

    3. Make an instance of the pattern structure for each
    of your allowed patterns;

    4. Call the scanning function repeatedly, for each pattern,
    until a match is found.

    For example, we might represent a pattern as a sequence (that is,
    an array) of "fields". A field is a set of allowed characters
    and a minimum and maximum number of characters allowed in the
    field:

    typedef struct {
    const char *allowed_set;
    unsigned char m, n;
    } Field;

    The final Field of a pattern has 'allowed_set' == 0.

    Now we can make an array to hold pointers to the first
    element of each legal pattern:

    static Field *patterns[] = {
    (Field[]){ {"*", 1,1 }, {"123", 1,1 }, {"#", 1,1 }, 0},
    (Field[]){ {"*", 1,1 }, {"#", 1,1 }, 0},
    (Field[]){ {"#", 1,1 }, 0},
    (Field[]){ {"0123456789", 4,6}, {"*", 1,1}, {"123", 1,1}, {"#", 1,1}, 0},
    (Field[]){ {"0123456789", 4,6}, {"*", 1,1}, {"#", 1,1}, 0},
    (Field[]){ {"0123456789", 4,6}, {"#", 1,1}, 0},
    };

    A function to scan a string and see if it matches a given pattern
    can be written very concisely by using tail recursion:

    int
    matches_string( Field f[], const char *s ){
    if( ! f->allowed_set ) return *s == 0;
    size_t k = strspn( s, f->allowed_set );
    return k < f->m || k > f->n ? 0 : matches_string( f+1, s+k );
    }

    (Incidentally I am using the standard library function strspn()
    here, but if it isn't available it should be easy enough to write
    a replacement - certainly no more than 10 lines of code.)

    Putting all this together, here is a program that looks at
    each of its arguments and reports which pattern is matched:

    #include <string.h>
    #include <stdio.h>

    typedef struct {
    const char *allowed_set;
    unsigned char m, n;
    } Field;

    static Field *patterns[] = {
    (Field[]){ {"*", 1,1 }, {"123", 1,1 }, {"#", 1,1 }, 0},
    (Field[]){ {"*", 1,1 }, {"#", 1,1 }, 0},
    (Field[]){ {"#", 1,1 }, 0},
    (Field[]){ {"0123456789", 4,6}, {"*", 1,1}, {"123", 1,1}, {"#", 1,1}, 0},
    (Field[]){ {"0123456789", 4,6}, {"*", 1,1}, {"#", 1,1}, 0},
    (Field[]){ {"0123456789", 4,6}, {"#", 1,1}, 0},
    };

    int which_pattern_matches( const char * );
    int matches_string( Field f[], const char *s );

    int
    main( int argc, char *argv[] ){
    int i;
    for( i = 1; i < argc; i++ ){
    int k = which_pattern_matches( argv );
    printf( "Argument %15s matches pattern %3d\n", argv, k );
    }
    return 0;
    }

    int
    which_pattern_matches( const char *string ){
    for( unsigned n = 0; n < sizeof patterns / sizeof patterns[0]; n++ ){
    if( matches_string( patterns[n], string ) ) return n;
    }
    return -1;
    }

    int
    matches_string( Field f[], const char *s ){
    if( ! f->allowed_set ) return *s == 0;
    size_t k = strspn( s, f->allowed_set );
    return k < f->m || k > f->n ? 0 : matches_string( f+1, s+k );
    }

    A sample compile/run command:

    gcc -std=c99 -pedantic example.c && a.out '#' '*#' '*1#' '01234*2#' 'foo'

    By the way, it might be useful for the scanning routine to
    remember the start of each field as it goes. That can be
    done by revising matches_string to take an extra parameter,
    the address of a buffer into which to record the field start
    locations:

    int
    matches_string( Field f[], const char *s, const char *r[] ){
    ... similar to above, with one extra line ...
    }
     
    Tim Rentsch, Feb 7, 2014
    #37
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.