A very simple parser with scanf & C

Discussion in 'C Programming' started by pozz, Sep 23, 2011.

  1. pozz

    pozz Guest

    I want to write a very simple (at least, I thought it was very simple)
    parser of a string. This string has the following format:
    - a string ID that matches one of a known list (no whitespaces
    before), case insensitive
    - one or more whitespaces (spaces, tabs, newlines...)
    - the word "door" (without quotes), case insensitive
    - an optional number of whitespaces, even zero
    - a number in the range 0..10
    - no other characters after the number
    Of course, I want to write a good parser, so avoiding any possible
    memory leak, buffer overrun, segmentation fault...

    What do you think about the following code? Is there a better way? Did
    I forget something? Is it portable (I don't know about strcasecmp()
    and strncasecmp())?

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <ctype.h>

    const char DOOR[] = "DOOR";

    const char *ids[] = {
    "OPEN",
    "CLOSE",
    "LOCK"
    };
    #define ID_NUM (sizeof(ids) / sizeof(ids[0]))

    const int DOOR_MIN = 0;
    const int DOOR_MAX = 10;

    struct {
    char *s;
    int idx; /* -1 if the parsing shouldn't pass */
    int num;
    } test[] = {
    { "Open Door 1", 0, 1 },
    { "close DOor 3", 1, 3 },
    { "LoCk \t doOR7", 2, 7 },
    { " Open Door 1", -1 }, /* spaces at the beginning */
    { "close window 1", -1 }, /* window instead of door */
    { "lockk door 1", -1 }, /* lockk instead of lock */
    { "open doorr 1", -1 }, /* doorr instead of door */
    { "lock door 34", -1 }, /* number outside the range 1..10 */
    { "open door main", -1 }, /* number absent */
    { "lock door", -1 }, /* number absent */
    { "close door 1a", -1 } /* the string continues after the
    number */
    };
    #define TEST_NUM (sizeof(test) / sizeof(test[0]))

    int
    parse(const char *s, int *id_idx, int *num)
    {
    int i;

    if (!isspace(*s)) {
    *id_idx = -1;
    for (i = 0; i < ID_NUM; i++) {
    if (!strncasecmp(ids, s, strlen(ids))) {
    *id_idx = i;
    break;
    }
    }
    if (*id_idx >= 0) {
    s += strlen(ids[*id_idx]);
    while (isspace(*s)) {
    s++;
    }
    if (!strncasecmp(DOOR, s, strlen(DOOR))) {
    char *ss;
    s += strlen(DOOR);
    *num = strtol(s, &ss, 10);
    if ((ss != s) && (*ss == '\0') && (*num >= DOOR_MIN) && (*num <=
    DOOR_MAX)) {
    return 1;
    }
    }
    }
    }
    return -1;
    }


    int
    main(void)
    {
    int i;

    for (i = 0; i < TEST_NUM; i++) {
    int ret;
    int num;
    int id_idx;

    printf("Test %2d ", i);
    ret = parse(test.s, &id_idx, &num);
    if (ret < 0) {
    if (test.idx == -1) {
    printf("ok\n");
    } else {
    printf("ERR\n");
    }
    } else {
    if ((test.idx == id_idx) && (test.num == num)) {
    printf("ok\n");
    } else {
    printf("ERR\n");
    }
    }
    }
    return 0;
    }
     
    pozz, Sep 23, 2011
    #1
    1. Advertising

  2. pozz

    Nobody Guest

    On Fri, 23 Sep 2011 07:03:06 -0700, pozz wrote:

    > I want to write a very simple (at least, I thought it was very simple)
    > parser of a string. This string has the following format:


    > What do you think about the following code? Is there a better way?


    I would recommend learning how "real" parsers (e.g. those built
    using lex and yacc) work. Even though it's not really worthwhile using
    such tools for such a simple grammar, familiarity with state machines
    would probably be useful.

    If you ever need to parse a non-trivial grammar, you should definitely
    learn how real parsers work before constructing a "Rube Goldberg" parser
    out of dozens of "if"s and strcmp()s and even (and I've seen this in
    production code) fseek()s.

    > Did I forget something? Is it portable (I don't know about strcasecmp()
    > and strncasecmp())?


    str[n]casecmp() aren't part of the C standard; they're POSIX-specific.
    Windows has _stricmp() and _strnicmp().
     
    Nobody, Sep 23, 2011
    #2
    1. Advertising

  3. pozz <> writes:

    > I want to write a very simple (at least, I thought it was very simple)
    > parser of a string. This string has the following format:
    > - a string ID that matches one of a known list (no whitespaces
    > before), case insensitive
    > - one or more whitespaces (spaces, tabs, newlines...)
    > - the word "door" (without quotes), case insensitive
    > - an optional number of whitespaces, even zero
    > - a number in the range 0..10
    > - no other characters after the number
    > Of course, I want to write a good parser, so avoiding any possible
    > memory leak, buffer overrun, segmentation fault...
    >
    > What do you think about the following code? Is there a better way? Did
    > I forget something? Is it portable (I don't know about strcasecmp()
    > and strncasecmp())?


    They are not standard C. You either have to (a) write your own, (b)
    modify the string to make is one case or the other, or (c) accept that
    your code will work only on POSIX systems.

    > #include <stdio.h>
    > #include <string.h>
    > #include <stdlib.h>
    > #include <ctype.h>
    >
    > const char DOOR[] = "DOOR";
    >
    > const char *ids[] = {
    > "OPEN",
    > "CLOSE",
    > "LOCK"
    > };
    > #define ID_NUM (sizeof(ids) / sizeof(ids[0]))
    >
    > const int DOOR_MIN = 0;
    > const int DOOR_MAX = 10;


    <snip test cases>

    > int
    > parse(const char *s, int *id_idx, int *num)
    > {
    > int i;
    >
    > if (!isspace(*s)) {


    I don't think you need this test. Any string that starts with a space
    will fail the following test.

    > *id_idx = -1;
    > for (i = 0; i < ID_NUM; i++) {
    > if (!strncasecmp(ids, s, strlen(ids))) {
    > *id_idx = i;
    > break;
    > }
    > }
    > if (*id_idx >= 0) {
    > s += strlen(ids[*id_idx]);
    > while (isspace(*s)) {
    > s++;
    > }
    > if (!strncasecmp(DOOR, s, strlen(DOOR))) {
    > char *ss;
    > s += strlen(DOOR);
    > *num = strtol(s, &ss, 10);
    > if ((ss != s) && (*ss == '\0') && (*num >= DOOR_MIN) && (*num <=
    > DOOR_MAX)) {


    The only trouble I can see is that strtoul accepts +/- and you don't
    permit that in your description.

    > return 1;
    > }
    > }
    > }
    > }
    > return -1;
    > }

    <snip driver>

    --
    Ben.
     
    Ben Bacarisse, Sep 23, 2011
    #3
  4. pozz

    BartC Guest

    "pozz" <> wrote in message
    news:...

    > What do you think about the following code? Is there a better way?


    It's fine for a first attempt, other than it doesn't seem to work! All your
    test cases came out as 'OK'.

    > Did
    > I forget something? Is it portable (I don't know about strcasecmp()
    > and strncasecmp())?


    They didn't work on my main compiler, I had to use gcc.

    > if (!strncasecmp(ids, s, strlen(ids))) {


    You're testing a string in situ, against a constant string of certain
    length.

    This will match a string which is longer, but has the same common characters
    (although the next stage may or may not pick up the extra characters).

    If the string in the source is much shorter, it will be comparing unknown
    data (for example off the end of the source string).

    The usual way is to read and extract a token string from the source, into
    it's own string, and work with that. If case insensitive, this step will
    convert to upper and lower case, depending on how your expected keywords are
    stored. (And usually there will be a better lookup method than a linear
    search, although that is not critical at this stage, or if the list will
    stay small.)

    --
    Bartc
     
    BartC, Sep 23, 2011
    #4
  5. "BartC" <> writes:
    <snip>
    >> if (!strncasecmp(ids, s, strlen(ids))) {

    >
    > You're testing a string in situ, against a constant string of certain
    > length.
    >
    > This will match a string which is longer, but has the same common
    > characters (although the next stage may or may not pick up the extra
    > characters).


    There's no problem here that I can see. The spec (grammar if you like)
    says that some specific characters are required -- looking to see if
    they are there is exactly what you must do. The next stage *must* check
    the rest of the string.

    > If the string in the source is much shorter, it will be comparing
    > unknown data (for example off the end of the source string).


    No. Everything is a string so if either string "runs out" before the
    other, the compare stops on a mismatch. The size kicks in only when
    neither string runs out. It's not a request to always compare that
    number of bytes.

    <snip>
    --
    Ben.
     
    Ben Bacarisse, Sep 23, 2011
    #5
  6. pozz

    BartC Guest

    "pozz" <> wrote in message
    news:j5itcg$8tf$...
    > Il 23/09/2011 22:24, BartC ha scritto:


    >> It's fine for a first attempt, other than it doesn't seem to work! All
    >> your test cases came out as 'OK'.

    >
    > Indeed all the tests should pass for correct and malformed strings, if the
    > algorithm correctly works. As you can see, in the test array I set idx=-1
    > for malformed strings. If the parse() function correctly detects the
    > malformed string, the test is classified with "OK".


    I get how it works now. But I used a slightly different main() function so I
    could see what was happening; see below. (I also put in some spacing;
    indentation seems to have got lost in your post.)

    Now, the only error I see is that it doesn't detect a missing space between
    the ID and the rest (your spec suggested there should be at least one
    space).

    int
    main(void)
    {
    int i;

    for (i = 0; i < TEST_NUM; i++) {
    int ret;
    int num;
    int id_idx;

    printf("Test %2d <%s>\t", i,test.s);
    ret = parse(test.s, &id_idx, &num);
    if (ret < 0) {
    if (test.idx == -1) {
    printf("OK: failed as expected\n");
    }
    else {
    printf("ERR\n");
    }
    }
    else {
    if ((test.idx == id_idx) && (test.num == num)) {
    printf("OK: passed as expected\n");
    }
    else {
    printf("ERR\n");
    }
    }
    }
    return 0;
    }

    --
    Bartc
     
    BartC, Sep 23, 2011
    #6
  7. pozz

    BartC Guest

    "Ben Bacarisse" <> wrote in message
    news:...
    > "BartC" <> writes:


    >> This will match a string which is longer, but has the same common
    >> characters (although the next stage may or may not pick up the extra
    >> characters).

    >
    > There's no problem here that I can see. The spec (grammar if you like)
    > says that some specific characters are required -- looking to see if
    > they are there is exactly what you must do. The next stage *must* check
    > the rest of the string.
    >
    >> If the string in the source is much shorter, it will be comparing
    >> unknown data (for example off the end of the source string).

    >
    > No. Everything is a string so if either string "runs out" before the
    > other, the compare stops on a mismatch. The size kicks in only when
    > neither string runs out. It's not a request to always compare that
    > number of bytes.


    OK, that's fine then, if zero-terminators are taken account of.

    But I would still be uneasy using a compare like this on a test string that
    isn't properly delimited and that might, effectively, contain garbage near
    the end.

    For example, looking for "ABC" in the incorrect test string "ABCDEF" would
    pass; but DEF might just match what's expected next! Also both "ABC" and
    "ABCDEF" might be valid keywords, and the wrong one will be picked. Probably
    the error will be picked up eventually, but it's no longer clear where it
    occurred.

    --
    Bartc
     
    BartC, Sep 24, 2011
    #7
  8. pozz <> writes:

    > Il 23/09/2011 19:29, Ben Bacarisse ha scritto:
    >>> int
    >>> parse(const char *s, int *id_idx, int *num)
    >>> {
    >>> int i;
    >>>
    >>> if (!isspace(*s)) {

    >>
    >> I don't think you need this test. Any string that starts with a space
    >> will fail the following test.

    >
    > Oh yes, you are right.
    >
    >
    >>> *id_idx = -1;
    >>> for (i = 0; i< ID_NUM; i++) {
    >>> if (!strncasecmp(ids, s, strlen(ids))) {
    >>> *id_idx = i;
    >>> break;
    >>> }
    >>> }
    >>> if (*id_idx>= 0) {
    >>> s += strlen(ids[*id_idx]);
    >>> while (isspace(*s)) {
    >>> s++;
    >>> }
    >>> if (!strncasecmp(DOOR, s, strlen(DOOR))) {
    >>> char *ss;
    >>> s += strlen(DOOR);
    >>> *num = strtol(s,&ss, 10);
    >>> if ((ss != s)&& (*ss == '\0')&& (*num>= DOOR_MIN)&& (*num<=
    >>> DOOR_MAX)) {

    >>
    >> The only trouble I can see is that strtoul accepts +/- and you don't
    >> permit that in your description.

    >
    > If the number starts with minus sign, it will not pass the "range
    > test", because DOOR_MIN is 0.


    -0 is 0 and 0 passes the range test.

    > Moreover, I could admit positive
    > numbers in the range DOOR_MIN...DOOR_MAX starting with a plus sign.
    >
    > However, how could I avoid to accept numbers starting with a sign
    > character?


    You'd end up scanning the string prefix yourself, I think: skip space
    then and then ensure that there was no following + or -. But, as you
    say, why not why not just accept +1 -0 and so on?

    > At first, I started writing the parser with sscanf(), but I found it
    > is tricky to make it works well. "%s" format converter is dangerous,
    > because it has no limits. Is there a simpler way to write the parser
    > using sscanf()?


    I don't think so.

    --
    Ben.
     
    Ben Bacarisse, Sep 24, 2011
    #8
  9. "BartC" <> writes:

    > "Ben Bacarisse" <> wrote in message
    > news:...
    >> "BartC" <> writes:

    >
    >>> This will match a string which is longer, but has the same common
    >>> characters (although the next stage may or may not pick up the extra
    >>> characters).

    >>
    >> There's no problem here that I can see. The spec (grammar if you like)
    >> says that some specific characters are required -- looking to see if
    >> they are there is exactly what you must do. The next stage *must* check
    >> the rest of the string.
    >>
    >>> If the string in the source is much shorter, it will be comparing
    >>> unknown data (for example off the end of the source string).

    >>
    >> No. Everything is a string so if either string "runs out" before the
    >> other, the compare stops on a mismatch. The size kicks in only when
    >> neither string runs out. It's not a request to always compare that
    >> number of bytes.

    >
    > OK, that's fine then, if zero-terminators are taken account of.
    >
    > But I would still be uneasy using a compare like this on a test string
    > that isn't properly delimited and that might, effectively, contain
    > garbage near the end.
    >
    > For example, looking for "ABC" in the incorrect test string "ABCDEF"
    > would pass; but DEF might just match what's expected next!


    Then why should the string not match?

    > Also both
    > "ABC" and "ABCDEF" might be valid keywords, and the wrong one will be
    > picked. Probably the error will be picked up eventually, but it's no
    > longer clear where it occurred.


    I am not sure I follow. I think you are saying that you should not
    match "ABC" followed by "DEF" when the longer string "ABCDEF" should be
    a single (preferred) match. That's like saying "you need to code the
    matching correctly". Sure, don't accept "ABC" early if you might need
    to accept something longer. That does not seem to be the fault of using
    a length-limited compare -- it's the fault of getting the algorithm
    wrong. I'd be happy to match both using a length-limited compare, I'd
    just make sure to match from longest token to shortest.

    --
    Ben.
     
    Ben Bacarisse, Sep 24, 2011
    #9
  10. pozz

    BartC Guest

    "Ben Bacarisse" <> wrote in message
    news:...
    > "BartC" <> writes:


    >> Also both
    >> "ABC" and "ABCDEF" might be valid keywords, and the wrong one will be
    >> picked. Probably the error will be picked up eventually, but it's no
    >> longer clear where it occurred.

    >
    > I am not sure I follow. I think you are saying that you should not
    > match "ABC" followed by "DEF" when the longer string "ABCDEF" should be
    > a single (preferred) match. That's like saying "you need to code the
    > matching correctly". Sure, don't accept "ABC" early if you might need
    > to accept something longer. That does not seem to be the fault of using
    > a length-limited compare -- it's the fault of getting the algorithm
    > wrong. I'd be happy to match both using a length-limited compare, I'd
    > just make sure to match from longest token to shortest.


    Clearly this can be made to work, *if* you're aware of the implications and
    can work around any potential issues.

    But as I said, I'd be a little uneasy using it, if the test string is not
    well defined. Because you're trying to match, say, "ABC" again, with a test
    string that might be A*****... or one that might be ABC*****..., where *
    represents any possible character, and which might not be intended to form
    part of your test token.

    I would rather find the extent of the test token, then you know the lengths
    of both strings being compared. Now, if the lengths are unequal, you can
    move on to the next test. And if they the strings are in fact identical, you
    know for sure that first token is correct.

    (Of course some programming languages (famously Fortran and Basic) had
    keywords that could run on to a following identifier without whitespace, and
    they somehow worked. But I would have done it differently...)

    --
    Bartc
     
    BartC, Sep 24, 2011
    #10
  11. "BartC" <> writes:

    > "Ben Bacarisse" <> wrote in message
    > news:...
    >> "BartC" <> writes:

    >
    >>> Also both
    >>> "ABC" and "ABCDEF" might be valid keywords, and the wrong one will be
    >>> picked. Probably the error will be picked up eventually, but it's no
    >>> longer clear where it occurred.

    >>
    >> I am not sure I follow. I think you are saying that you should not
    >> match "ABC" followed by "DEF" when the longer string "ABCDEF" should be
    >> a single (preferred) match. That's like saying "you need to code the
    >> matching correctly". Sure, don't accept "ABC" early if you might need
    >> to accept something longer. That does not seem to be the fault of using
    >> a length-limited compare -- it's the fault of getting the algorithm
    >> wrong. I'd be happy to match both using a length-limited compare, I'd
    >> just make sure to match from longest token to shortest.

    >
    > Clearly this can be made to work, *if* you're aware of the implications and
    > can work around any potential issues.
    >
    > But as I said, I'd be a little uneasy using it, if the test string is not
    > well defined. Because you're trying to match, say, "ABC" again, with a test
    > string that might be A*****... or one that might be ABC*****..., where *
    > represents any possible character, and which might not be intended to form
    > part of your test token.


    I just don't follow. If you are looking for ABC and you find ABC you
    have found what you are looking for. You seem to be saying that the BC
    might be there somehow "by accident" but that can't be what you mean.
    If you can't rely on the string contents being "intended", how can you
    parse it using any method?. Maybe you could give a concrete example --
    a simple set of tokens and some example strings which justifies your
    unease.

    > I would rather find the extent of the test token, then you know the lengths
    > of both strings being compared. Now, if the lengths are unequal, you can
    > move on to the next test. And if they the strings are in fact identical, you
    > know for sure that first token is correct.


    How does this help with characters that are there but not "intended to
    form part of" the token? If you are looking for a C operator and > is
    followed by =, say, you have to take it. You can't "find the extent" of
    the token without looking at the characters that might make it up.
    "1>=+2" is no different to "1ABC+2". If BC are there, and ABC is a
    valid token, you've found one (always presuming >=+ and ABC+ are not
    also, longer, valid tokens).

    <snip>
    --
    Ben.
     
    Ben Bacarisse, Sep 24, 2011
    #11
  12. pozz

    BartC Guest

    "Ben Bacarisse" <> wrote in message
    news:...
    > "BartC" <> writes:


    >> But as I said, I'd be a little uneasy using it, if the test string is not
    >> well defined. Because you're trying to match, say, "ABC" again, with a
    >> test
    >> string that might be A*****... or one that might be ABC*****..., where *
    >> represents any possible character, and which might not be intended to
    >> form
    >> part of your test token.

    >


    > Maybe you could give a concrete example --
    > a simple set of tokens and some example strings which justifies your
    > unease.


    Token: "proc" (define a procedure).

    Input: processno=(a+b) (an assignment)

    The parser would match the "proc" in "processno" and tell me it's a
    procedure definition, or the start of one:

    'proc essno=(a+b)'

    If the input is *supposed* to be an assignment, I'd be rather cross with my
    parser if it told me the next token was a 'proc' keyword! It would likely
    generate a syntax error at some point, but that's not right because there's
    nothing wrong with the input.

    It depends also on how it deals with expected whitespace or separator after
    the token; the parser could check for this and only recognise a 'proc' token
    if that separator was present. But this is little different from my
    suggestion to find the extent of the token in the first place.

    The very fact that you have to think about this stuff at all, suggests that
    there is a better way! Which I consider to be simpler and more efficient
    than blindly testing every possible length of input sequence, for example is
    there really much point in trying to see if "unsigned" matches
    "C->sd=AE..."? Instead you get the token "C", and while you still have to do
    a lookup, you only need an actual string compare against single-letter
    keywords, if there are any..




    Bartc
     
    BartC, Sep 24, 2011
    #12
  13. pozz

    BartC Guest

    "BartC" <> wrote in message news:j5ke3c$dp5$...
    > "Ben Bacarisse" <> wrote in message


    >> Maybe you could give a concrete example --


    >for example is
    > there really much point in trying to see if "unsigned" matches
    > "C->sd=AE..."? Instead you get the token "C", and while you still have to
    > do
    > a lookup, you only need an actual string compare against single-letter
    > keywords, if there are any..


    Which brings up another point (before I clicked the wrong button..); exactly
    how *do* you lookup a string S in a table, if you don't even know it's
    length?

    Some lookup methods will work (the linear one the OP used, for a start), and
    perhaps some hairy ones which navigate a tree by indexing by each letter in
    turn or something (and can still turn up false positives). But the hash
    table method I normally use would have difficulty.

    --
    Bartc
     
    BartC, Sep 24, 2011
    #13
  14. "BartC" <> writes:

    > "Ben Bacarisse" <> wrote in message
    > news:...
    >> "BartC" <> writes:

    >
    >>> But as I said, I'd be a little uneasy using it, if the test string is not
    >>> well defined. Because you're trying to match, say, "ABC" again, with a
    >>> test
    >>> string that might be A*****... or one that might be ABC*****..., where *
    >>> represents any possible character, and which might not be intended to
    >>> form
    >>> part of your test token.

    >>

    >
    >> Maybe you could give a concrete example --
    >> a simple set of tokens and some example strings which justifies your
    >> unease.

    >
    > Token: "proc" (define a procedure).
    >
    > Input: processno=(a+b) (an assignment)
    >
    > The parser would match the "proc" in "processno" and tell me it's a
    > procedure definition, or the start of one:
    >
    > 'proc essno=(a+b)'


    And it would be right to do so. You haven't given the full example
    whereas the OP did. The full example includes other tokens which can be
    longer that "proc" and can have proc as a prefix. If the OP had
    specified that arbitrary identifiers have to be matched as well, and
    that some of them may start with "DOOR", do you think I've had said his
    parser was OK? Of course not, but he didn't. The specification was
    admirably clear which made me not at all uneasy about using a
    length-limited compare.

    Of course, you may be right. Maybe this the start of larger project
    that will one day include arbitrary identifiers, or keywords like "door"
    and "doors", when the OP will need some full-blown lexer and parser.
    But maybe not; sometimes the simple solution is the right one.

    > If the input is *supposed* to be an assignment, I'd be rather cross with my
    > parser if it told me the next token was a 'proc' keyword! It would likely
    > generate a syntax error at some point, but that's not right because there's
    > nothing wrong with the input.
    >
    > It depends also on how it deals with expected whitespace or separator after
    > the token; the parser could check for this and only recognise a 'proc' token
    > if that separator was present. But this is little different from my
    > suggestion to find the extent of the token in the first place.
    >
    > The very fact that you have to think about this stuff at all, suggests that
    > there is a better way! Which I consider to be simpler and more efficient
    > than blindly testing every possible length of input sequence,


    There possibly is. Please code it up and we can compare. The OP's code
    was about 16 lines, not counting }s and the unnecessary if. Also I
    think it was probably reasonably efficient, though that's hard to
    measure.

    > for example is
    > there really much point in trying to see if "unsigned" matches
    > "C->sd=AE..."? Instead you get the token "C", and while you still have to do
    > a lookup, you only need an actual string compare against single-letter
    > keywords, if there are any..


    We may be talking at cross purposes. You are saying, I think, that
    there is a better way to write a full-blown lexer and parser. I was
    curious about your unease at using a specific kind of test in a specific
    case where I felt no such unease. Maybe what I should have got from
    your comment is that you think that this is one of the cases that is
    just bound to get more complex, and therefore OP should be thinking in
    more general terms right form the start. You might be right about that,
    and if I'd got that that was your point, I may not have commented at
    all.

    --
    Ben.
     
    Ben Bacarisse, Sep 24, 2011
    #14
  15. "BartC" <> writes:

    > "BartC" <> wrote in message news:j5ke3c$dp5$...
    >> "Ben Bacarisse" <> wrote in message

    >
    >>> Maybe you could give a concrete example --

    >
    >>for example is
    >> there really much point in trying to see if "unsigned" matches
    >> "C->sd=AE..."? Instead you get the token "C", and while you still
    >> have to do
    >> a lookup, you only need an actual string compare against single-letter
    >> keywords, if there are any..

    >
    > Which brings up another point (before I clicked the wrong button..);
    > exactly how *do* you lookup a string S in a table, if you don't even
    > know it's length?
    >
    > Some lookup methods will work (the linear one the OP used, for a
    > start), and perhaps some hairy ones which navigate a tree by indexing
    > by each letter in turn or something (and can still turn up false
    > positives). But the hash table method I normally use would have
    > difficulty.


    You answered your own question allbeit with a delightfully biased
    description of a trie ("hairy" and prone to "false positives").

    There is something a little circular about your desire to delimit the
    tokens before trying to match them. This is usually the correct way to
    do it because most programming languages have been designed with this
    sort of lexical analysis in mind. But if the keywords include 2d, 3d
    (and 47d, etc), and things like +more and <p> then it may be hard to
    delimit possible candidate strings before matching them.

    --
    Ben.
     
    Ben Bacarisse, Sep 24, 2011
    #15
  16. pozz

    BartC Guest

    "Ben Bacarisse" <> wrote in message
    news:...
    > "BartC" <> writes:



    > We may be talking at cross purposes. You are saying, I think, that
    > there is a better way to write a full-blown lexer and parser. I was
    > curious about your unease at using a specific kind of test in a specific
    > case where I felt no such unease. Maybe what I should have got from
    > your comment is that you think that this is one of the cases that is
    > just bound to get more complex, and therefore OP should be thinking in
    > more general terms right form the start. You might be right about that,
    > and if I'd got that that was your point, I may not have commented at
    > all.


    My original reply just pointed out some possible issues with the method that
    was used. And suggested that it could also be done another way. I don't
    think it was too critical.

    I made no comment about the overall approach (the parsing bit rather than
    the lexing).

    > You answered your own question allbeit with a delightfully biased
    > description of a trie ("hairy" and prone to "false positives").


    I'm not sure I was was biased. I think that trie thing is hairier than some
    other methods. And the false positives will apply to any lookup method when
    the word you're looking up doesn't have a well-defined length.

    --
    Bartc
     
    BartC, Sep 24, 2011
    #16
  17. "BartC" <> writes:

    > "Ben Bacarisse" <> wrote in message
    > news:...

    <snip>
    >> You answered your own question allbeit with a delightfully biased
    >> description of a trie ("hairy" and prone to "false positives").

    >
    > I'm not sure I was was biased. I think that trie thing is hairier than
    > some other methods. And the false positives will apply to any lookup
    > method when the word you're looking up doesn't have a well-defined
    > length.


    We've done this to death. I know that was what you meant by false
    positives, but it's a deliberately polemical way of putting it! If
    matching "abc", right here, right now (no matter what follows) is
    correct then it's not a false positive -- it's a true positive. If you
    can't match "abc", right here, right now, you have a bug. Your code is
    wrong. Calling it a "false positive" (rather like your original word
    "uneasy") suggest some sort of statistical correctness or, worse, some
    sort of uncertainty about is right at this point of the match. The same
    sense of uncertainty is carried by the notion of not having "a
    well-defined length". The word we are looking up (all of them in the
    case of a trie) has a perfectly well defined length and so does every
    part of the sub-string we may be matching it against.

    I doubt we disagree on any technical matter in the debate anymore. It's
    mostly down to writing style I suspect. At least I think it is. How
    would you write the OP matcher?

    --
    Ben.
     
    Ben Bacarisse, Sep 24, 2011
    #17
  18. pozz

    BartC Guest

    "Ben Bacarisse" <> wrote in message

    > How would you write the OP matcher?


    I posted the code below. I hope this does all the things I suggested!

    There is a function lex() which extracts the next token from the input in a
    general manner (and reports, to some extent, what type of token it is).

    And there is a function parsetest(), which takes a given string, and checks
    it corresponds to the OP's syntax spec.

    Here, the details of the character-level handling, and the syntax parsing,
    are largely separate.

    (I haven't bothered with a formal table lookup; lex() uses file globals for
    simplicity; and overflow of the lextoken[] array isn't checked. Some checks,
    eg. lex()!='A', could probably be dispensed with. And there's supposed to be
    something funny about using atoi(), but I can't remember what..)

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <ctype.h>

    /* globals set by lex() */
    int lextype; /* One of 'A' 'N' 'P' 'L' 'E' '?' means Alpha, Number,
    Punctuation, EOL, End of string, Error */
    int lexleadsp; /* Set to 1 when leading whitespace seen (requirement
    of OP) */
    char lextoken[1000]; /* Contains token for 'A', 'N' and 'P' lexical types */
    char *lexptr; /* Points to next character of input */

    int lex(void) {
    char c;
    int i;

    lextoken[0]=0;
    lexleadsp=0;

    while (*lexptr!='\n' && isspace(*lexptr)) {
    lexleadsp=1;
    ++lexptr;
    }

    c=*lexptr;
    i=0;

    if (c==0)
    lextype='E';
    else if (isalpha(c)) {
    do
    lextoken[i++]=toupper(*lexptr++);
    while (isalpha(*lexptr));
    lextype='A';
    }
    else if (isdigit(c)) {
    do
    lextoken[i++]=toupper(*lexptr++);
    while (isdigit(*lexptr));
    lextype='N';
    }
    else if (c=='\n') {
    lextype='L';
    ++lexptr;
    }
    else if (ispunct(c)) {
    lextoken[i++]=*lexptr++;
    lextype='P';
    }
    else {
    lextype='?';
    ++lexptr;
    }

    lextoken=0;
    return lextype;
    }

    /* Make use of lex() function to test an input string in this syntax:
    OPEN|CLOSE|LOCK DOOR [+]n where n must be 1 to 10, and without
    leading or trailing white space.
    Returns 1 (passed) or 0 (failed)
    */
    int parsetest(char *input) {
    int n,t;

    lexptr=input;

    if (lex()!='A' || lexleadsp) return 0;
    if (strcmp(lextoken,"OPEN")!=0 &&
    strcmp(lextoken,"CLOSE")!=0 &&
    strcmp(lextoken,"LOCK")!=0)
    return 0;

    if (lex()!='A' || strcmp(lextoken,"DOOR")!=0) return 0;

    t=lex();
    if (t=='P' && lextoken[0]=='+') t=lex();

    if (t!='N') return 0;
    n=atoi(lextoken);
    if (n<1 || n>10) return 0;

    switch (lex()) {
    case 'E': case 'L':
    if (lexleadsp) return 0;
    break;
    default:
    return 0;
    }
    return 1;
    }

    int main(void) {
    char* teststr = "Lock DOOR 10";
    int status;

    printf("Testing: \"%s\"\n",teststr);

    status=parsetest(teststr);

    printf("Result: %s\n",status?"Passed":"Failed");
    }

    --
    bartc
     
    BartC, Sep 25, 2011
    #18
  19. "BartC" <> writes:

    > "Ben Bacarisse" <> wrote in message
    >
    >> How would you write the OP matcher?

    >
    > I posted the code below. I hope this does all the things I suggested!
    >
    > There is a function lex() which extracts the next token from the input in a
    > general manner (and reports, to some extent, what type of token it is).
    >
    > And there is a function parsetest(), which takes a given string, and checks
    > it corresponds to the OP's syntax spec.
    >
    > Here, the details of the character-level handling, and the syntax parsing,
    > are largely separate.
    >
    > (I haven't bothered with a formal table lookup; lex() uses file
    > globals for simplicity; and overflow of the lextoken[] array isn't
    > checked. Some checks, eg. lex()!='A', could probably be dispensed
    > with.


    I think the extra globals are a problem. To avoid the length-limited
    compare, you have to store the tokens somewhere, and in C that means
    either a lot of ugliness (passed-in buffers, globals arrays, what have
    you) or a lot of machinery. Even taking the "no machinery" approach, 16
    lines with no globals has turned into about 50 with four of them.
    Upthread you said this approach was simpler and more efficient. OK, the
    "more efficient" probably referred to using hashing to check keywords,
    but then there would be a lot more then 50 lines. Is it really simpler?
    More extensible, yes, but not, I'd say, simpler.

    > And there's supposed to be
    > something funny about using atoi(), but I can't remember what..)


    It's undefined on integer overflow.

    <snip code>
    Thanks for posting that.

    --
    Ben.
     
    Ben Bacarisse, Sep 25, 2011
    #19
  20. pozz

    BartC Guest

    "Ben Bacarisse" <> wrote in message
    news:...
    > "BartC" <> writes:


    >> I posted the code below. I hope this does all the things I suggested!


    > Upthread you said this approach was simpler and more efficient. OK, the
    > "more efficient" probably referred to using hashing to check keywords,
    > but then there would be a lot more then 50 lines. Is it really simpler?
    > More extensible, yes, but not, I'd say, simpler.


    Simplicity is not just about line-count. It's also knowing exactly what is
    being compared and being confident about the results.

    Also, the actual syntax parsing *is* simpler, once the lexical stuff has
    been encapsulated.

    The efficiency will come from opportunities to do more streamlined lookups.
    As it is, with an extra step to tokenise input before doing anything with it
    (and using all those isxxx() functions), it's about half the speed of the
    OP's original.

    (Tokenising can also leave the token strings in-place, especially if the
    source string can be modified, but it's bit more awkward.)

    --
    Bartc
     
    BartC, Sep 25, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Raymond Arthur St. Marie II of III

    very Very VERY dumb Question About The new Set( ) 's

    Raymond Arthur St. Marie II of III, Jul 23, 2003, in forum: Python
    Replies:
    4
    Views:
    499
    Raymond Hettinger
    Jul 27, 2003
  2. shanx__=|;-

    very very very long integer

    shanx__=|;-, Oct 16, 2004, in forum: C Programming
    Replies:
    19
    Views:
    1,679
    Merrill & Michele
    Oct 19, 2004
  3. =?ISO-8859-1?Q?Martin_J=F8rgensen?=

    scanf (yes/no) - doesn't work + deprecation errors scanf, fopen etc.

    =?ISO-8859-1?Q?Martin_J=F8rgensen?=, Feb 16, 2006, in forum: C Programming
    Replies:
    185
    Views:
    3,449
    those who know me have no need of my name
    Apr 3, 2006
  4. =?ISO-8859-1?Q?Martin_J=F8rgensen?=

    difference between scanf("%i") and scanf("%d") ??? perhaps bug inVS2005?

    =?ISO-8859-1?Q?Martin_J=F8rgensen?=, Apr 26, 2006, in forum: C Programming
    Replies:
    18
    Views:
    689
    Richard Bos
    May 2, 2006
  5. olivier.melcher

    Help running a very very very simple code

    olivier.melcher, May 12, 2008, in forum: Java
    Replies:
    8
    Views:
    2,329
Loading...

Share This Page