Very simple parser... not for me

Discussion in 'C Programming' started by pozz, Jan 27, 2014.

  1. pozz

    pozz Guest

    I have a string that is generated from a numerical keypad with decimal
    numbers 0-9 and the two special characters * and #, similar to a phone
    keypad.

    I want to parse the string and make something or something else
    depending on the current state and the string itself. Generally the
    string can be:
    #
    *#
    *<g>#
    <c>*<g>#
    <c>#
    <c>*#
    where
    <c> is a 4- to 6-digits code
    <g> is the exact digit '1', '2' or '3'.

    Depending on the current state, some combinations are valid, others not.
    In some cases the code should be valid (already saved in the system),
    in other cases it could be new (when a new code must be entered to
    change the current one).

    I think I have solved the problem trying to split the string in to three
    fields: <c>, '*' and <g>. If the first char is a digit, it must be a 4-
    to 6- code, otherwise it must be a '*' or a '#', and so on...

    Even if my code works, I don't think this is an elegant solution. What
    happens if in the feature I decide to change one combination? For
    example, from "#" to "###"? Or from "<c>*#" to "*<c>#"?
    I have to recode everything, because my assumption the string can be
    split in three fields in that order is not true anymore.

    What is the best approach to implement a simple parser like this?
     
    pozz, Jan 27, 2014
    #1
    1. Advertisements

  2. Go to my website
    http://www.malcolmmclean.site11.com/www

    and check the MiniBasic pages to understand parsing in general, or buy a copy
    of the book. It's very cheap and money well spent.

    If you can guarantee that you always have to deal with special characters,
    4-6 digit cdes, or single digits, the main step is in the tokeniser.
    A token is # * <c> or <g>. You have to do a bit of lookahead to distinguish <c>
    from <g>, but that's easy enough to implement.
    So the tokenizer returns 1 of 6 states (the four token, error, and end of
    input). It also keeps track of the length of the token.
    You implement two functions in the toeknize, gettoken(), which returns the
    current token, and match(). Match() moves on the tokenizer to the next token,
    but only if passed the correct token type. Otherwise it flags an error.

    A tokenizer makes the parser very easy to write.
    Get the first token. # * and <c> are legal, anything else is a parse error.
    switch on the result, match the token you read, and read the next token.
    Some are legal, some are not. *<g>* is legal, but *<g>$ (where $ = end of
    input) is not. You can handle this easily by calling match('*'), without even
    reading it.

    Errors are sticky. So you pass an "error" structure to every function. If
    an error is set it doesn't report more errors and, generally, tries to
    return you as quickly as possible, but without funny jumps or other abnormal
    flow control. A good plan is to say if(error_reported) return '$' in the
    tokenizer.
     
    Malcolm McLean, Jan 27, 2014
    #2
    1. Advertisements

  3. It's not clear if these constraints need to be part of the parse. I'd
    argue that you should exclude such considerations from the parsing. If
    you don't, it can get very much more complex.

    You might be in danger of over-engineering this. The more future
    changes you try to incorporate, the less you'll ever do now! However,
    the pattern you've described is a classic application of regular
    expressions. There are many RE matching libraries available, and it's
    not too hard to write one if you can't use any of these due to space or
    other restrictions. But consider that you might have a suitable
    solution already doing it ad hoc.
     
    Ben Bacarisse, Jan 27, 2014
    #3
  4. pozz

    pozz Guest

    Il 27/01/2014 23:33, Ben Bacarisse ha scritto:
    Ok, exclude these constraints from the parsing process.

    Simple regular expressions, yes, you're right. My valid combinations of
    characters 0-9, *, # can be described as regular expressions.

    I'm using a very small microcontroller, how to code a simple regular
    expression matching algorithm?
     
    pozz, Jan 27, 2014
    #4
  5. pozz

    BartC Guest

    I don't know about regular expressions, but I can parse the above using
    random logic, and it's less than 100 lines. My test result in these numeric
    values:

    c: -1 if not present, or 0 to 999999 (or a bit more actually as I don't
    check the length. Is c a numeric value or a string consisting of digits? Is
    0000 allowed? If that is distinct from 00000, then you don't want c as a
    number, but you need a code then that tells you when c is not present)

    g: 0 if not present, or 1 to 3

    star: 0 if not present, or 1

    hash: always 1

    (I can't post the C at the minute because it's code-generated and I haven't
    time to tidy it up. It's not that great either.. But I don't think this task
    will present problems with a microcontroller.)
     
    BartC, Jan 27, 2014
    #5
  6. pozz

    Kaz Kylheku Guest

    What you can do is write the regexes out and possibly even debug them using
    some scripting language (that doesn't run on the embedded system, but on your
    development box).

    Then whiteboard the automata corresponding to the regexes and convert into a
    table-driven machine by hand.

    For a simple input language, I would skip the regexes and non-deterministic
    automata. Rather, I would start by diagramming a deterministic automaton right
    from the beginning, from which the state transition table pops out directly.

    In a deterministic finite automaton (DFA), there are no empty transitions
    (transitions from one state to another without consuming an input token), and
    no ambiguous transitions (transitions for the same input token, into different
    target states).

    If you can directly model your machine's actions as a DFA, things are easy.
     
    Kaz Kylheku, Jan 27, 2014
    #6
  7. Yes. In fact I said "You might be in danger of over-engineering this"
    before offering a suggestion, but the OP specifically asked for an
    alternative to the "obvious" ad-hoc parse.

    <snip>
     
    Ben Bacarisse, Jan 28, 2014
    #7
  8. pozz

    BartC Guest

    As I said, I don't know regular expressions, or whether they can do much
    beyond recognising a particular pattern, such as extracting the actual data
    to operate on (which I assumed can be represented by three values (C, G and
    'STAR').)

    A dedicated parser can take care of that easily. If the format changes, then
    yes some code changes are needed, but in the recursive descent approach I
    started to use (as I'm familiar with that; also not quite as random as I
    said), that's straightforward, unless the format will change every five
    minutes. I think also that other constraints could be applied more easily
    this way.

    (But we don't have all the info; maybe the details of each field in the
    string
    are less important than verifying or classifying the format.)
     
    BartC, Jan 28, 2014
    #8
  9. pozz

    pozz Guest

    Il 28/01/2014 00:42, BartC ha scritto:
    I'm sorry for the trivial question, but... what is "random logic"?

    Yes, the code should be treated as a string because the code "0000" is
    completely different than "000000".

    Is it code generated? Could you explain how, please?
     
    pozz, Jan 28, 2014
    #9
  10. pozz

    pozz Guest

    Il 28/01/2014 00:56, Kaz Kylheku ha scritto:
    How do you make this?

    I tried to represent my simple language in a deterministic finite state
    machine, but it's a too much complex task for me. I arrived to have too
    many states and automaton wasn't readable.
     
    pozz, Jan 28, 2014
    #10
  11. pozz

    Kaz Kylheku Guest

    "Random" is sometimes used in place of "ad hoc".
     
    Kaz Kylheku, Jan 28, 2014
    #11
  12. pozz

    Eric Sosman Guest

    Is there something tricky I've not noticed? It seems to me
    that a recognizer for these inputs would have rather few states,
    if we permit the recognition of <c> and <g> as "primitives." For
    reference, here's your grammar again:

    #
    *#
    *<g>#
    <c>*<g>#
    <c>#
    <c>*#
    where
    <c> is a 4- to 6-digits code
    <g> is the exact digit '1', '2' or '3'.

    .... and a recognizer might go something like this:

    State 0: // starting out
    if input is #, recognize "#" and stop
    if input is *, move ahead and enter State 1
    if input is <c>, move ahead and enter State 3
    else syntax error

    State 1: // saw "*", what follows?
    if input is #, recognize "*#" and stop
    if input is <g>, move ahead and enter State 2
    else syntax error

    State 2: // saw "*<g>", what follows?
    if input is #, recognize "*<g>#" and stop
    else syntax error

    State 3: // saw "<c>", what follows?
    if input is *, move ahead and enter State 4
    if input is #, recognize "<c>#" and stop
    else syntax error

    State 4: // saw "<c>*", what follows?
    if input is <g>, move ahead and enter State 5
    if input is #, recognize "<c>*#" and stop
    else syntax error

    State 5: // saw "<c>*<g>", what follows
    if input is #, recognize "<c>*<g>#" and stop
    else syntax error

    The "stop" action might (or might not) want to check that the
    input has been completely consumed; it depends on what else
    you're trying to do.
     
    Eric Sosman, Jan 28, 2014
    #12
  13. pozz

    Jorgen Grahn Guest

    Yes. And it doesn't strike me as something which will change very
    often! If this is something users will have to input, they need to
    learn the syntax; then they will also need to relearn it if it's
    changed. That in itself is a barrier to change.

    /Jorgen
     
    Jorgen Grahn, Jan 28, 2014
    #13
  14. pozz

    Jorgen Grahn Guest

    .
    Some of them can, but when you have verified that input indeed matches
    a certain pattern, it's IMO often easier to pick out the pieces you
    want using strchr() and strtoul() or whatever. Life is easy when
    there's no error handling to be done.

    /Jorgen
     
    Jorgen Grahn, Jan 28, 2014
    #14
  15. pozz

    BartC Guest

    Not a good term, perhaps 'dedicated' logic, ie. program code, is better. Or
    'ad hoc'...
    That's an unimportant detail; I was coding in a somewhat different language
    to C, which is translated to C using a tool. Anyway I've shown that C, a
    little tidied up but still odd in places, below. (This is rather more than
    100 lines, the original was a bit less.)

    The parsing is done in 'parseinput()'. Sample testing and display of results
    in main(). The actual code is unimportant, it's just to show what it might
    look like. Quite a bit depends on how the input is obtained, and what needs
    to be done with the output.

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    static void resetinput(char *s);
    static void nextinput(void);
    static int readg(void);
    static int readc(char *c);
    static void error(char *m);
    static int parseinput(char *s,char *c,int *g,int *star);

    static char *inputptr; /* the parser uses these file-scope state variables
    *
    static char nextchar;

    static void resetinput(char *s) { /* initialise the parser */
    inputptr = s;
    }

    static void nextinput(void) { /* obtain next input character */
    if (*inputptr==0) {
    nextchar = 0;
    } else {
    nextchar = *inputptr++;
    }
    }

    static int readg(void) {
    int g=nextchar-'0';
    nextinput();
    return g;
    }

    static int readc(char *c) {
    int n=0;
    do {
    c[n++] = nextchar;
    if ((n>6)) {
    error("C too long");
    return 0;
    }
    nextinput();
    } while (!((nextchar<'0'||nextchar>'9')));
    if (n<4) {
    error("C too short");
    return 0;
    }
    c[n] = 0;
    return 1;
    }

    static void error(char *m) {
    printf("Error: %s",m);
    puts("");
    }

    static int parseinput(char *s,char *c,int *g,int *star) {
    resetinput(s);
    *c = 0;
    *g = 0;
    *star = 0;

    nextinput();

    switch (nextchar) {
    case '#':
    break;
    case '*':
    *star = 1;
    nextinput();
    if (nextchar>='1'&&nextchar<='3') {
    *g = readg();
    }
    if (nextchar!=35) {
    error("# expected");
    return 0;
    }
    break;
    case 48: case 49: case 50: case 51: case 52: case 53: case 54: case 55:
    case 56: case 57: // '0' to '9'
    if (!readc(c)) {
    return 0;
    }
    if (nextchar=='*') {
    *star = 1;
    nextinput();
    }
    if (nextchar>='1'&&nextchar<='3') {
    *g = readg();
    }
    if (nextchar!='#') {
    error("# expected");
    return 0;
    }
    break;
    default:
    error("Bad start symbol");
    return 0;
    }
    return 1;
    }

    int main(void) {
    char c[20];
    int g;
    int star;
    char input[]="9999*3#";
    // char input[]="*2#";

    printf("Parsing: %s",input);
    puts("");
    if (parseinput(input,c,&g,&star)) {
    printf("C =<%s> G=%d Star=%d",c,g,star);
    puts("");
    } else {
    printf("Couldn't parse");
    puts("");
    }
    }
     
    BartC, Jan 28, 2014
    #15
  16. (snip)
    As far as I know, the term is used to describe the way computers
    were built before microprogramming. If you look inside a computer
    (maybe in the 1950's) there were wires going all different directions,
    looking somewhat random.

    Microprogramming replaces random logic with much simpler logic
    and puts all the "randomness" into the microprogram.

    -- glen
     
    glen herrmannsfeldt, Jan 28, 2014
    #16
  17. pozz

    Ike Naar Guest

    Why not

    case '0': case '1': case '2': case '3': case '4':
    case '5': case '6': case '7': case '8': case '9':

    ?
     
    Ike Naar, Jan 29, 2014
    #17
  18. pozz

    BartC Guest

    (I mentioned this was code-generated, the original was '0' .. '9', which is
    expanded for the C version. Unfortunately the char constants are just mapped
    to integers so are difficult to restore. And it was a lot of fiddly typing
    to fix them all for this simple demo.

    Perhaps if C allowed such a range in switch cases, then I would have had
    less of an excuse, but it's been discussed here a few times and there are
    always arguments against it, mainly that it encourages people to type 'A' ..
    'Z', even though 99.99% of the time (or 100% of the time for 99.99% of
    programs), that would work as expected.)
     
    BartC, Jan 29, 2014
    #18
  19. pozz

    Paul N Guest

    Not that it helps in this case (where we're using a switch), but the C standard does guarantee that '0', '1', etc, up to '9', are consecutive numbers. Whereas there is no guarantee that, say, 'B' is one more than 'A'.
     
    Paul N, Jan 29, 2014
    #19
  20. pozz

    BartC Guest

    Perhaps then a compiler can issue a warning when a range involving
    alphabetic characters is used (and which can be turned off). Because a lot
    of the time the compiler and/or the programmer will know when these
    sequences are going to be consecutive.

    And even when there is uncertainty, it wouldn't be out of the question for a
    compiler to convert a range such as 'A' to 'F', to the explicit sequence
    'A', 'B', 'C', 'D', 'E', 'F'.

    (I'm talking about an extension where switch cases can specify a range.)
     
    BartC, Jan 29, 2014
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.