A very simple parser with scanf & C

P

pozz

I want to write a very simple (at least, I thought it was very simple)
parser of a string. This string has the following format:
- a string ID that matches one of a known list (no whitespaces
before), case insensitive
- one or more whitespaces (spaces, tabs, newlines...)
- the word "door" (without quotes), case insensitive
- an optional number of whitespaces, even zero
- a number in the range 0..10
- no other characters after the number
Of course, I want to write a good parser, so avoiding any possible
memory leak, buffer overrun, segmentation fault...

What do you think about the following code? Is there a better way? Did
I forget something? Is it portable (I don't know about strcasecmp()
and strncasecmp())?

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

const char DOOR[] = "DOOR";

const char *ids[] = {
"OPEN",
"CLOSE",
"LOCK"
};
#define ID_NUM (sizeof(ids) / sizeof(ids[0]))

const int DOOR_MIN = 0;
const int DOOR_MAX = 10;

struct {
char *s;
int idx; /* -1 if the parsing shouldn't pass */
int num;
} test[] = {
{ "Open Door 1", 0, 1 },
{ "close DOor 3", 1, 3 },
{ "LoCk \t doOR7", 2, 7 },
{ " Open Door 1", -1 }, /* spaces at the beginning */
{ "close window 1", -1 }, /* window instead of door */
{ "lockk door 1", -1 }, /* lockk instead of lock */
{ "open doorr 1", -1 }, /* doorr instead of door */
{ "lock door 34", -1 }, /* number outside the range 1..10 */
{ "open door main", -1 }, /* number absent */
{ "lock door", -1 }, /* number absent */
{ "close door 1a", -1 } /* the string continues after the
number */
};
#define TEST_NUM (sizeof(test) / sizeof(test[0]))

int
parse(const char *s, int *id_idx, int *num)
{
int i;

if (!isspace(*s)) {
*id_idx = -1;
for (i = 0; i < ID_NUM; i++) {
if (!strncasecmp(ids, s, strlen(ids))) {
*id_idx = i;
break;
}
}
if (*id_idx >= 0) {
s += strlen(ids[*id_idx]);
while (isspace(*s)) {
s++;
}
if (!strncasecmp(DOOR, s, strlen(DOOR))) {
char *ss;
s += strlen(DOOR);
*num = strtol(s, &ss, 10);
if ((ss != s) && (*ss == '\0') && (*num >= DOOR_MIN) && (*num <=
DOOR_MAX)) {
return 1;
}
}
}
}
return -1;
}


int
main(void)
{
int i;

for (i = 0; i < TEST_NUM; i++) {
int ret;
int num;
int id_idx;

printf("Test %2d ", i);
ret = parse(test.s, &id_idx, &num);
if (ret < 0) {
if (test.idx == -1) {
printf("ok\n");
} else {
printf("ERR\n");
}
} else {
if ((test.idx == id_idx) && (test.num == num)) {
printf("ok\n");
} else {
printf("ERR\n");
}
}
}
return 0;
}
 
N

Nobody

I want to write a very simple (at least, I thought it was very simple)
parser of a string. This string has the following format:
What do you think about the following code? Is there a better way?

I would recommend learning how "real" parsers (e.g. those built
using lex and yacc) work. Even though it's not really worthwhile using
such tools for such a simple grammar, familiarity with state machines
would probably be useful.

If you ever need to parse a non-trivial grammar, you should definitely
learn how real parsers work before constructing a "Rube Goldberg" parser
out of dozens of "if"s and strcmp()s and even (and I've seen this in
production code) fseek()s.
Did I forget something? Is it portable (I don't know about strcasecmp()
and strncasecmp())?

str[n]casecmp() aren't part of the C standard; they're POSIX-specific.
Windows has _stricmp() and _strnicmp().
 
B

Ben Bacarisse

pozz said:
I want to write a very simple (at least, I thought it was very simple)
parser of a string. This string has the following format:
- a string ID that matches one of a known list (no whitespaces
before), case insensitive
- one or more whitespaces (spaces, tabs, newlines...)
- the word "door" (without quotes), case insensitive
- an optional number of whitespaces, even zero
- a number in the range 0..10
- no other characters after the number
Of course, I want to write a good parser, so avoiding any possible
memory leak, buffer overrun, segmentation fault...

What do you think about the following code? Is there a better way? Did
I forget something? Is it portable (I don't know about strcasecmp()
and strncasecmp())?

They are not standard C. You either have to (a) write your own, (b)
modify the string to make is one case or the other, or (c) accept that
your code will work only on POSIX systems.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

const char DOOR[] = "DOOR";

const char *ids[] = {
"OPEN",
"CLOSE",
"LOCK"
};
#define ID_NUM (sizeof(ids) / sizeof(ids[0]))

const int DOOR_MIN = 0;
const int DOOR_MAX = 10;

int
parse(const char *s, int *id_idx, int *num)
{
int i;

if (!isspace(*s)) {

I don't think you need this test. Any string that starts with a space
will fail the following test.
*id_idx = -1;
for (i = 0; i < ID_NUM; i++) {
if (!strncasecmp(ids, s, strlen(ids))) {
*id_idx = i;
break;
}
}
if (*id_idx >= 0) {
s += strlen(ids[*id_idx]);
while (isspace(*s)) {
s++;
}
if (!strncasecmp(DOOR, s, strlen(DOOR))) {
char *ss;
s += strlen(DOOR);
*num = strtol(s, &ss, 10);
if ((ss != s) && (*ss == '\0') && (*num >= DOOR_MIN) && (*num <=
DOOR_MAX)) {


The only trouble I can see is that strtoul accepts +/- and you don't
permit that in your description.
return 1;
}
}
}
}
return -1;
}
<snip driver>
 
B

BartC

What do you think about the following code? Is there a better way?

It's fine for a first attempt, other than it doesn't seem to work! All your
test cases came out as 'OK'.
Did
I forget something? Is it portable (I don't know about strcasecmp()
and strncasecmp())?

They didn't work on my main compiler, I had to use gcc.
if (!strncasecmp(ids, s, strlen(ids))) {


You're testing a string in situ, against a constant string of certain
length.

This will match a string which is longer, but has the same common characters
(although the next stage may or may not pick up the extra characters).

If the string in the source is much shorter, it will be comparing unknown
data (for example off the end of the source string).

The usual way is to read and extract a token string from the source, into
it's own string, and work with that. If case insensitive, this step will
convert to upper and lower case, depending on how your expected keywords are
stored. (And usually there will be a better lookup method than a linear
search, although that is not critical at this stage, or if the list will
stay small.)
 
B

Ben Bacarisse

BartC said:
if (!strncasecmp(ids, s, strlen(ids))) {


You're testing a string in situ, against a constant string of certain
length.

This will match a string which is longer, but has the same common
characters (although the next stage may or may not pick up the extra
characters).


There's no problem here that I can see. The spec (grammar if you like)
says that some specific characters are required -- looking to see if
they are there is exactly what you must do. The next stage *must* check
the rest of the string.
If the string in the source is much shorter, it will be comparing
unknown data (for example off the end of the source string).

No. Everything is a string so if either string "runs out" before the
other, the compare stops on a mismatch. The size kicks in only when
neither string runs out. It's not a request to always compare that
number of bytes.

<snip>
 
B

BartC

pozz said:
Il 23/09/2011 22:24, BartC ha scritto:

Indeed all the tests should pass for correct and malformed strings, if the
algorithm correctly works. As you can see, in the test array I set idx=-1
for malformed strings. If the parse() function correctly detects the
malformed string, the test is classified with "OK".

I get how it works now. But I used a slightly different main() function so I
could see what was happening; see below. (I also put in some spacing;
indentation seems to have got lost in your post.)

Now, the only error I see is that it doesn't detect a missing space between
the ID and the rest (your spec suggested there should be at least one
space).

int
main(void)
{
int i;

for (i = 0; i < TEST_NUM; i++) {
int ret;
int num;
int id_idx;

printf("Test %2d <%s>\t", i,test.s);
ret = parse(test.s, &id_idx, &num);
if (ret < 0) {
if (test.idx == -1) {
printf("OK: failed as expected\n");
}
else {
printf("ERR\n");
}
}
else {
if ((test.idx == id_idx) && (test.num == num)) {
printf("OK: passed as expected\n");
}
else {
printf("ERR\n");
}
}
}
return 0;
}
 
B

BartC

Ben Bacarisse said:
There's no problem here that I can see. The spec (grammar if you like)
says that some specific characters are required -- looking to see if
they are there is exactly what you must do. The next stage *must* check
the rest of the string.


No. Everything is a string so if either string "runs out" before the
other, the compare stops on a mismatch. The size kicks in only when
neither string runs out. It's not a request to always compare that
number of bytes.

OK, that's fine then, if zero-terminators are taken account of.

But I would still be uneasy using a compare like this on a test string that
isn't properly delimited and that might, effectively, contain garbage near
the end.

For example, looking for "ABC" in the incorrect test string "ABCDEF" would
pass; but DEF might just match what's expected next! Also both "ABC" and
"ABCDEF" might be valid keywords, and the wrong one will be picked. Probably
the error will be picked up eventually, but it's no longer clear where it
occurred.
 
B

Ben Bacarisse

pozz said:
Il 23/09/2011 19:29, Ben Bacarisse ha scritto:
I don't think you need this test. Any string that starts with a space
will fail the following test.

Oh yes, you are right.

*id_idx = -1;
for (i = 0; i< ID_NUM; i++) {
if (!strncasecmp(ids, s, strlen(ids))) {
*id_idx = i;
break;
}
}
if (*id_idx>= 0) {
s += strlen(ids[*id_idx]);
while (isspace(*s)) {
s++;
}
if (!strncasecmp(DOOR, s, strlen(DOOR))) {
char *ss;
s += strlen(DOOR);
*num = strtol(s,&ss, 10);
if ((ss != s)&& (*ss == '\0')&& (*num>= DOOR_MIN)&& (*num<=
DOOR_MAX)) {


The only trouble I can see is that strtoul accepts +/- and you don't
permit that in your description.


If the number starts with minus sign, it will not pass the "range
test", because DOOR_MIN is 0.


-0 is 0 and 0 passes the range test.
Moreover, I could admit positive
numbers in the range DOOR_MIN...DOOR_MAX starting with a plus sign.

However, how could I avoid to accept numbers starting with a sign
character?

You'd end up scanning the string prefix yourself, I think: skip space
then and then ensure that there was no following + or -. But, as you
say, why not why not just accept +1 -0 and so on?
At first, I started writing the parser with sscanf(), but I found it
is tricky to make it works well. "%s" format converter is dangerous,
because it has no limits. Is there a simpler way to write the parser
using sscanf()?

I don't think so.
 
B

Ben Bacarisse

BartC said:
OK, that's fine then, if zero-terminators are taken account of.

But I would still be uneasy using a compare like this on a test string
that isn't properly delimited and that might, effectively, contain
garbage near the end.

For example, looking for "ABC" in the incorrect test string "ABCDEF"
would pass; but DEF might just match what's expected next!

Then why should the string not match?
Also both
"ABC" and "ABCDEF" might be valid keywords, and the wrong one will be
picked. Probably the error will be picked up eventually, but it's no
longer clear where it occurred.

I am not sure I follow. I think you are saying that you should not
match "ABC" followed by "DEF" when the longer string "ABCDEF" should be
a single (preferred) match. That's like saying "you need to code the
matching correctly". Sure, don't accept "ABC" early if you might need
to accept something longer. That does not seem to be the fault of using
a length-limited compare -- it's the fault of getting the algorithm
wrong. I'd be happy to match both using a length-limited compare, I'd
just make sure to match from longest token to shortest.
 
B

BartC

Ben Bacarisse said:
I am not sure I follow. I think you are saying that you should not
match "ABC" followed by "DEF" when the longer string "ABCDEF" should be
a single (preferred) match. That's like saying "you need to code the
matching correctly". Sure, don't accept "ABC" early if you might need
to accept something longer. That does not seem to be the fault of using
a length-limited compare -- it's the fault of getting the algorithm
wrong. I'd be happy to match both using a length-limited compare, I'd
just make sure to match from longest token to shortest.

Clearly this can be made to work, *if* you're aware of the implications and
can work around any potential issues.

But as I said, I'd be a little uneasy using it, if the test string is not
well defined. Because you're trying to match, say, "ABC" again, with a test
string that might be A*****... or one that might be ABC*****..., where *
represents any possible character, and which might not be intended to form
part of your test token.

I would rather find the extent of the test token, then you know the lengths
of both strings being compared. Now, if the lengths are unequal, you can
move on to the next test. And if they the strings are in fact identical, you
know for sure that first token is correct.

(Of course some programming languages (famously Fortran and Basic) had
keywords that could run on to a following identifier without whitespace, and
they somehow worked. But I would have done it differently...)
 
B

Ben Bacarisse

BartC said:
Clearly this can be made to work, *if* you're aware of the implications and
can work around any potential issues.

But as I said, I'd be a little uneasy using it, if the test string is not
well defined. Because you're trying to match, say, "ABC" again, with a test
string that might be A*****... or one that might be ABC*****..., where *
represents any possible character, and which might not be intended to form
part of your test token.

I just don't follow. If you are looking for ABC and you find ABC you
have found what you are looking for. You seem to be saying that the BC
might be there somehow "by accident" but that can't be what you mean.
If you can't rely on the string contents being "intended", how can you
parse it using any method?. Maybe you could give a concrete example --
a simple set of tokens and some example strings which justifies your
unease.
I would rather find the extent of the test token, then you know the lengths
of both strings being compared. Now, if the lengths are unequal, you can
move on to the next test. And if they the strings are in fact identical, you
know for sure that first token is correct.

How does this help with characters that are there but not "intended to
form part of" the token? If you are looking for a C operator and > is
followed by =, say, you have to take it. You can't "find the extent" of
the token without looking at the characters that might make it up.
"1>=+2" is no different to "1ABC+2". If BC are there, and ABC is a
valid token, you've found one (always presuming >=+ and ABC+ are not
also, longer, valid tokens).

<snip>
 
B

BartC

Ben Bacarisse said:
Maybe you could give a concrete example --
a simple set of tokens and some example strings which justifies your
unease.

Token: "proc" (define a procedure).

Input: processno=(a+b) (an assignment)

The parser would match the "proc" in "processno" and tell me it's a
procedure definition, or the start of one:

'proc essno=(a+b)'

If the input is *supposed* to be an assignment, I'd be rather cross with my
parser if it told me the next token was a 'proc' keyword! It would likely
generate a syntax error at some point, but that's not right because there's
nothing wrong with the input.

It depends also on how it deals with expected whitespace or separator after
the token; the parser could check for this and only recognise a 'proc' token
if that separator was present. But this is little different from my
suggestion to find the extent of the token in the first place.

The very fact that you have to think about this stuff at all, suggests that
there is a better way! Which I consider to be simpler and more efficient
than blindly testing every possible length of input sequence, for example is
there really much point in trying to see if "unsigned" matches
"C->sd=AE..."? Instead you get the token "C", and while you still have to do
a lookup, you only need an actual string compare against single-letter
keywords, if there are any..




Bartc
 
B

BartC

for example is
there really much point in trying to see if "unsigned" matches
"C->sd=AE..."? Instead you get the token "C", and while you still have to
do
a lookup, you only need an actual string compare against single-letter
keywords, if there are any..

Which brings up another point (before I clicked the wrong button..); exactly
how *do* you lookup a string S in a table, if you don't even know it's
length?

Some lookup methods will work (the linear one the OP used, for a start), and
perhaps some hairy ones which navigate a tree by indexing by each letter in
turn or something (and can still turn up false positives). But the hash
table method I normally use would have difficulty.
 
B

Ben Bacarisse

BartC said:
Token: "proc" (define a procedure).

Input: processno=(a+b) (an assignment)

The parser would match the "proc" in "processno" and tell me it's a
procedure definition, or the start of one:

'proc essno=(a+b)'

And it would be right to do so. You haven't given the full example
whereas the OP did. The full example includes other tokens which can be
longer that "proc" and can have proc as a prefix. If the OP had
specified that arbitrary identifiers have to be matched as well, and
that some of them may start with "DOOR", do you think I've had said his
parser was OK? Of course not, but he didn't. The specification was
admirably clear which made me not at all uneasy about using a
length-limited compare.

Of course, you may be right. Maybe this the start of larger project
that will one day include arbitrary identifiers, or keywords like "door"
and "doors", when the OP will need some full-blown lexer and parser.
But maybe not; sometimes the simple solution is the right one.
If the input is *supposed* to be an assignment, I'd be rather cross with my
parser if it told me the next token was a 'proc' keyword! It would likely
generate a syntax error at some point, but that's not right because there's
nothing wrong with the input.

It depends also on how it deals with expected whitespace or separator after
the token; the parser could check for this and only recognise a 'proc' token
if that separator was present. But this is little different from my
suggestion to find the extent of the token in the first place.

The very fact that you have to think about this stuff at all, suggests that
there is a better way! Which I consider to be simpler and more efficient
than blindly testing every possible length of input sequence,

There possibly is. Please code it up and we can compare. The OP's code
was about 16 lines, not counting }s and the unnecessary if. Also I
think it was probably reasonably efficient, though that's hard to
measure.
for example is
there really much point in trying to see if "unsigned" matches
"C->sd=AE..."? Instead you get the token "C", and while you still have to do
a lookup, you only need an actual string compare against single-letter
keywords, if there are any..

We may be talking at cross purposes. You are saying, I think, that
there is a better way to write a full-blown lexer and parser. I was
curious about your unease at using a specific kind of test in a specific
case where I felt no such unease. Maybe what I should have got from
your comment is that you think that this is one of the cases that is
just bound to get more complex, and therefore OP should be thinking in
more general terms right form the start. You might be right about that,
and if I'd got that that was your point, I may not have commented at
all.
 
B

Ben Bacarisse

BartC said:
Which brings up another point (before I clicked the wrong button..);
exactly how *do* you lookup a string S in a table, if you don't even
know it's length?

Some lookup methods will work (the linear one the OP used, for a
start), and perhaps some hairy ones which navigate a tree by indexing
by each letter in turn or something (and can still turn up false
positives). But the hash table method I normally use would have
difficulty.

You answered your own question allbeit with a delightfully biased
description of a trie ("hairy" and prone to "false positives").

There is something a little circular about your desire to delimit the
tokens before trying to match them. This is usually the correct way to
do it because most programming languages have been designed with this
sort of lexical analysis in mind. But if the keywords include 2d, 3d
(and 47d, etc), and things like +more and <p> then it may be hard to
delimit possible candidate strings before matching them.
 
B

BartC

Ben Bacarisse said:
We may be talking at cross purposes. You are saying, I think, that
there is a better way to write a full-blown lexer and parser. I was
curious about your unease at using a specific kind of test in a specific
case where I felt no such unease. Maybe what I should have got from
your comment is that you think that this is one of the cases that is
just bound to get more complex, and therefore OP should be thinking in
more general terms right form the start. You might be right about that,
and if I'd got that that was your point, I may not have commented at
all.

My original reply just pointed out some possible issues with the method that
was used. And suggested that it could also be done another way. I don't
think it was too critical.

I made no comment about the overall approach (the parsing bit rather than
the lexing).
You answered your own question allbeit with a delightfully biased
description of a trie ("hairy" and prone to "false positives").

I'm not sure I was was biased. I think that trie thing is hairier than some
other methods. And the false positives will apply to any lookup method when
the word you're looking up doesn't have a well-defined length.
 
B

Ben Bacarisse

BartC said:
I'm not sure I was was biased. I think that trie thing is hairier than
some other methods. And the false positives will apply to any lookup
method when the word you're looking up doesn't have a well-defined
length.

We've done this to death. I know that was what you meant by false
positives, but it's a deliberately polemical way of putting it! If
matching "abc", right here, right now (no matter what follows) is
correct then it's not a false positive -- it's a true positive. If you
can't match "abc", right here, right now, you have a bug. Your code is
wrong. Calling it a "false positive" (rather like your original word
"uneasy") suggest some sort of statistical correctness or, worse, some
sort of uncertainty about is right at this point of the match. The same
sense of uncertainty is carried by the notion of not having "a
well-defined length". The word we are looking up (all of them in the
case of a trie) has a perfectly well defined length and so does every
part of the sub-string we may be matching it against.

I doubt we disagree on any technical matter in the debate anymore. It's
mostly down to writing style I suspect. At least I think it is. How
would you write the OP matcher?
 
B

BartC

Ben Bacarisse said:
How would you write the OP matcher?

I posted the code below. I hope this does all the things I suggested!

There is a function lex() which extracts the next token from the input in a
general manner (and reports, to some extent, what type of token it is).

And there is a function parsetest(), which takes a given string, and checks
it corresponds to the OP's syntax spec.

Here, the details of the character-level handling, and the syntax parsing,
are largely separate.

(I haven't bothered with a formal table lookup; lex() uses file globals for
simplicity; and overflow of the lextoken[] array isn't checked. Some checks,
eg. lex()!='A', could probably be dispensed with. And there's supposed to be
something funny about using atoi(), but I can't remember what..)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

/* globals set by lex() */
int lextype; /* One of 'A' 'N' 'P' 'L' 'E' '?' means Alpha, Number,
Punctuation, EOL, End of string, Error */
int lexleadsp; /* Set to 1 when leading whitespace seen (requirement
of OP) */
char lextoken[1000]; /* Contains token for 'A', 'N' and 'P' lexical types */
char *lexptr; /* Points to next character of input */

int lex(void) {
char c;
int i;

lextoken[0]=0;
lexleadsp=0;

while (*lexptr!='\n' && isspace(*lexptr)) {
lexleadsp=1;
++lexptr;
}

c=*lexptr;
i=0;

if (c==0)
lextype='E';
else if (isalpha(c)) {
do
lextoken[i++]=toupper(*lexptr++);
while (isalpha(*lexptr));
lextype='A';
}
else if (isdigit(c)) {
do
lextoken[i++]=toupper(*lexptr++);
while (isdigit(*lexptr));
lextype='N';
}
else if (c=='\n') {
lextype='L';
++lexptr;
}
else if (ispunct(c)) {
lextoken[i++]=*lexptr++;
lextype='P';
}
else {
lextype='?';
++lexptr;
}

lextoken=0;
return lextype;
}

/* Make use of lex() function to test an input string in this syntax:
OPEN|CLOSE|LOCK DOOR [+]n where n must be 1 to 10, and without
leading or trailing white space.
Returns 1 (passed) or 0 (failed)
*/
int parsetest(char *input) {
int n,t;

lexptr=input;

if (lex()!='A' || lexleadsp) return 0;
if (strcmp(lextoken,"OPEN")!=0 &&
strcmp(lextoken,"CLOSE")!=0 &&
strcmp(lextoken,"LOCK")!=0)
return 0;

if (lex()!='A' || strcmp(lextoken,"DOOR")!=0) return 0;

t=lex();
if (t=='P' && lextoken[0]=='+') t=lex();

if (t!='N') return 0;
n=atoi(lextoken);
if (n<1 || n>10) return 0;

switch (lex()) {
case 'E': case 'L':
if (lexleadsp) return 0;
break;
default:
return 0;
}
return 1;
}

int main(void) {
char* teststr = "Lock DOOR 10";
int status;

printf("Testing: \"%s\"\n",teststr);

status=parsetest(teststr);

printf("Result: %s\n",status?"Passed":"Failed");
}
 
B

Ben Bacarisse

BartC said:
Ben Bacarisse said:
How would you write the OP matcher?

I posted the code below. I hope this does all the things I suggested!

There is a function lex() which extracts the next token from the input in a
general manner (and reports, to some extent, what type of token it is).

And there is a function parsetest(), which takes a given string, and checks
it corresponds to the OP's syntax spec.

Here, the details of the character-level handling, and the syntax parsing,
are largely separate.

(I haven't bothered with a formal table lookup; lex() uses file
globals for simplicity; and overflow of the lextoken[] array isn't
checked. Some checks, eg. lex()!='A', could probably be dispensed
with.

I think the extra globals are a problem. To avoid the length-limited
compare, you have to store the tokens somewhere, and in C that means
either a lot of ugliness (passed-in buffers, globals arrays, what have
you) or a lot of machinery. Even taking the "no machinery" approach, 16
lines with no globals has turned into about 50 with four of them.
Upthread you said this approach was simpler and more efficient. OK, the
"more efficient" probably referred to using hashing to check keywords,
but then there would be a lot more then 50 lines. Is it really simpler?
More extensible, yes, but not, I'd say, simpler.
And there's supposed to be
something funny about using atoi(), but I can't remember what..)

It's undefined on integer overflow.

<snip code>
Thanks for posting that.
 
B

BartC

Upthread you said this approach was simpler and more efficient. OK, the
"more efficient" probably referred to using hashing to check keywords,
but then there would be a lot more then 50 lines. Is it really simpler?
More extensible, yes, but not, I'd say, simpler.

Simplicity is not just about line-count. It's also knowing exactly what is
being compared and being confident about the results.

Also, the actual syntax parsing *is* simpler, once the lexical stuff has
been encapsulated.

The efficiency will come from opportunities to do more streamlined lookups.
As it is, with an extra step to tokenise input before doing anything with it
(and using all those isxxx() functions), it's about half the speed of the
OP's original.

(Tokenising can also leave the token strings in-place, especially if the
source string can be modified, but it's bit more awkward.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top