A very simple parser with scanf & C

G

Gene

Ben Bacarisse said:
 How would you write the OP matcher?

I posted the code below. I hope this does all the things I suggested!

There is a function lex() which extracts the next token from the input ina
general manner (and reports, to some extent, what type of token it is).

And there is a function parsetest(), which takes a given string, and checks
it corresponds to the OP's syntax spec.

Here, the details of the character-level handling, and the syntax parsing,
are largely separate.

(I haven't bothered with a formal table lookup; lex() uses file globals for
simplicity; and overflow of the lextoken[] array isn't checked. Some checks,
eg. lex()!='A', could probably be dispensed with. And there's supposed to be
something funny about using atoi(), but I can't remember what..)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

/* globals set by lex() */
int lextype;         /* One of 'A' 'N' 'P' 'L' 'E' '?' means Alpha, Number,
                        Punctuation, EOL, End of string, Error */
int lexleadsp;       /* Set to 1 when leading whitespace seen (requirement
of OP) */
char lextoken[1000]; /* Contains token for 'A', 'N' and 'P' lexical types*/
char *lexptr;        /* Points to next character of input */

int lex(void) {
  char c;
  int i;

  lextoken[0]=0;
  lexleadsp=0;

  while (*lexptr!='\n' && isspace(*lexptr)) {
    lexleadsp=1;
    ++lexptr;
  }

  c=*lexptr;
  i=0;

  if (c==0)
    lextype='E';
  else if (isalpha(c)) {
    do
      lextoken[i++]=toupper(*lexptr++);
    while (isalpha(*lexptr));
    lextype='A';
  }
  else if (isdigit(c)) {
    do
      lextoken[i++]=toupper(*lexptr++);
    while (isdigit(*lexptr));
    lextype='N';
  }
  else if (c=='\n') {
    lextype='L';
    ++lexptr;
  }
  else if (ispunct(c)) {
    lextoken[i++]=*lexptr++;
    lextype='P';
  }
  else {
    lextype='?';
    ++lexptr;
  }

  lextoken=0;
  return lextype;

}

/* Make use of lex() function to test an input string in this syntax:
  OPEN|CLOSE|LOCK DOOR [+]n where n must be 1 to 10, and without
  leading or trailing white space.
  Returns 1 (passed) or 0 (failed)
*/
int parsetest(char *input) {
  int n,t;

  lexptr=input;

  if (lex()!='A' || lexleadsp) return 0;
  if (strcmp(lextoken,"OPEN")!=0 &&
      strcmp(lextoken,"CLOSE")!=0 &&
      strcmp(lextoken,"LOCK")!=0)
    return 0;

  if (lex()!='A' || strcmp(lextoken,"DOOR")!=0) return 0;

  t=lex();
  if (t=='P' && lextoken[0]=='+') t=lex();

  if (t!='N') return 0;
  n=atoi(lextoken);
  if (n<1 || n>10) return 0;

  switch (lex()) {
  case 'E': case 'L':
    if (lexleadsp) return 0;
    break;
  default:
   return 0;
  }
 return 1;

}

int main(void) {
  char* teststr = "Lock DOOR 10";
  int status;

  printf("Testing: \"%s\"\n",teststr);

  status=parsetest(teststr);

  printf("Result: %s\n",status?"Passed":"Failed");

}


If there is a realistic chance the language will expand in the future
to a branching syntax, then tokenizing now is probably a good idea.
But if the language is such that the parser will always know what to
expect based on what it's already read (essentially the grammar is
LR(0)), then a separate tokenizing step is overkill. It adds error
opportunities and slows things down.

What the OP did is fine, but I'll bet his tests didn't work the first
time. Character-by-character approaches like this are fiddly.

A middle ground is to develop a set of prefix matchers that can be
reused. I've tried many different APIs and idioms and settled on the
following over the years. With this approach the code of the parser
can look just like the spec. With a good inlining compiler, the
result will be as good or better than OP's code for performance.

typedef enum token_e {
OPEN_TOKEN,
CLOSE_TOKEN,
LOCK_TOKEN,
DOOR_TOKEN,
} TOKEN;

/*
* Return true iff the given token is a proper directive.
*/
int is_directive(TOKEN token)
{
return OPEN_TOKEN <= token && token <= LOCK_TOKEN;
}

/* Must be in alpha order, lower case */
KEYWORD_PAIR keywords[] = {
{ "close", CLOSE_TOKEN },
{ "door", DOOR_TOKEN }, // Could handle door specially....
{ "lock", LOCK_TOKEN },
{ "open", OPEN_TOKEN },
};

/*
* Return true iff the given integer is a proper door number.
*/
int door_number_ok(unsigned n)
{
return n <= 10;
}

/*
* Parse a string with the following format:
* - a directive token, case insensitive
* - one or more whitespaces (spaces, tabs, newlines...)
* - the word "door" (without quotes), case insensitive
* - an optional number of whitespaces, even zero
* - a door number in the range 0..10
* - no other characters after the number
* Return the directive as a token and the door number.
* Function return is error code or 0 for success.
*/
int parse(char *s, TOKEN *directive, unsigned *door_number)
{
int index;

// Probably overkill to demand no whitespace.
if (match_whitespace(s, &s))
return -1;

if (!match_keyword(s, &s,
keywords, TABLE_SIZE(keywords),
keyword_cmp_ignoring_case,
&index))
return -2; // Missing token.

*directive = (TOKEN)keywords[index].token;

if (!is_directive(*directive))
return -3; // Token not a directive.

if (!match_token(s, &s,
keywords, TABLE_SIZE(keywords),
keyword_cmp_ignoring_case,
DOOR_TOKEN))
return -4; // Door keyword missing or misspelled

if (!match_unsigned(s, &s, door_number))
return -5; // Door number missing

if (!door_number_ok(*door_number))
return -6; // Door number out of range

if (*s)
return -7; // Junk at end check.

return 0; // 0 is more conventional for success
}

.... and the reusable code looks essentially like this. I've de-
parameterized some of my actual code to shorten it.

/*
* Set *end one char past whitespace prefix of s.
* Return true iff any characters matched.
*/
int match_whitespace(char *s, char **end)
{
char *p = s;
while (isspace(*p))
++p;
*end = p;
return p > s;
}

/*
* Set *end one char past alpha prefix of s.
* Return true iff any characters matched.
*/
int match_alpha(char *s, char **end)
{
char *p = s;
while (isalpha(*p))
++p;
*end = p;
return p > s;
}

/*
* Skip whitespace and then scan as many decimal digits
* as possible. Convert to an unsigned value with no
* overflow check. Return true iff any digits were
* matched. Set *end one past the scanned chars.
* FIXME: Add overflow check.
*/
int match_unsigned(char *s, char **end, unsigned *val_rtn)
{
char *p;
unsigned val;

match_whitespace(s, &s);
for (val = 0, p = s; isdigit(*p); p++)
val = 10 * val + (*p - '0');
*end = p;
*val_rtn = val;
return p > s;
}

/*
* Convert a character to lower case.
*/
char lowercase(char c)
{
return ('A' <= c && c <= 'Z') ? c - 'A' + 'a' : c;
}

/*
* Key comparison function for keyword lookups.
*/
typedef int (*KEYWORD_CMP)(char *s, unsigned n, char *kw);

/*
* Case insensitive keyword comparision. This compares
* a prefix of s of up to n characters with kw.
*/
int keyword_cmp_ignoring_case(char *s, size_t n, char *kw)
{
while (n > 0 && *s && *kw && lowercase(*s) == *kw) {
--n; ++s; ++kw;
}
return (n == 0) ? -*kw : lowercase(*s) - *kw;
}

/*
* Keyword entries and tables. An empty key string in
* a table produces undefined results.
*/
typedef struct keyword_pair_s {
char *key;
int token;
} KEYWORD_PAIR, KEYWORD_TABLE[];

#define TABLE_SIZE(A) (sizeof A / sizeof A[0])

/*
* Match a maximal prefix of alphabetic characters of
* string s to keywords in a table. Binary search is
* used, so the keys must be sorted. The keyword table
* index is returned (not the token value).
* On success return true and set *end one past the keyword.
* If none found, set *end one past any whitespace and return 0
* with index unchanged.
*/
int match_keyword(char *s, char **end,
KEYWORD_TABLE tbl, unsigned tbl_size,
KEYWORD_CMP cmp,
int *index)
{
int diff, lo, hi, mid;

match_whitespace(s, &s);
if (match_alpha(s, end)) {
lo = 0;
hi = tbl_size - 1;
while (lo <= hi) {
mid = (lo + hi) / 2;
diff = cmp(s, *end - s, tbl[mid].key);
if (diff < 0)
hi = mid - 1;
else if (diff > 0)
lo = mid + 1;
else {
*index = mid;
return 1;
}
}
}
*end = s;
return 0;
}

/*
* Match a maximal prefix of string s to keywords in a table.
* If found, verify its token matches the given value.
* On success return true and set *end to character past
* scanned keyword. On any failure, set *end one past any
* leading whitespace.
*/
int match_token(char *s, char **end,
KEYWORD_TABLE tbl, unsigned tbl_size,
KEYWORD_CMP cmp,
int token)
{
int index;

match_whitespace(s, &s);
if (!match_keyword(s, end, tbl, tbl_size, cmp, &index) ||
tbl[index].token != token)
return 0;
return 1;
}
 
B

Ben Bacarisse

BartC said:
Simplicity is not just about line-count. It's also knowing exactly
what is being compared and being confident about the results.

That's true. Part of the trouble was that I though the original
approach was clear enough to have confidence in the results. Given
that, the actual amount of code needed does then measure complexity.

<snip>
 
R

Rui Maciel

pozz said:
I want to write a very simple (at least, I thought it was very simple)
parser of a string. This string has the following format:
- a string ID that matches one of a known list (no whitespaces
before), case insensitive
- one or more whitespaces (spaces, tabs, newlines...)
- the word "door" (without quotes), case insensitive
- an optional number of whitespaces, even zero
- a number in the range 0..10
- no other characters after the number
Of course, I want to write a good parser, so avoiding any possible
memory leak, buffer overrun, segmentation fault...

What do you think about the following code? Is there a better way? Did
I forget something? Is it portable (I don't know about strcasecmp()
and strncasecmp())?
<snip/>

Developing a proper parser is a better way to handle this task. As the
language you want to parse is a simple one, I suggest you try to develop a
LL parser by hand. Besides being able to develop a bullet-proof piece of
code, you will find it to be a very rewarding experience.


Rui Maciel
 
G

Gene

Gene said:
/*
 * Convert a character to lower case.
 */
char lowercase(char c)
{
  return ('A' <= c && c <= 'Z') ? c - 'A' + 'a' : c;
}

That's not 100% portable.
At least it won't work with EBCDIC.http://en.wikipedia.org/wiki/Extended_Binary_Coded_Decimal_Interchang...

#include <limits.h>
#include <string.h>

#define UPPER   "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
#define LOWER   "abcdefghijklmnopqrstuvwxyz"

int to_lower(int c)
{
    const char *lower;
    const char *const upper = UPPER;

    lower = CHAR_MAX >= c && c > '\0' ? strchr(upper, c) : NULL;
    return lower != NULL ? LOWER[lower - upper] : c;

}

Got that. Cut my teeth on an IBM 1130. In production code I'd rather
do something like

#if 'Z' - 'A' + 1 != 26 || 'z' - 'a' + 1 != 26
#error tolower() assumption violated
#endif

than replace 2 instructions with a few dozen executed once per input
character. Could also use ctype.h's tolower, but probably give up
inlining. And I've run into a couple of ctype.h's without it, though
not for a while.
 
K

Keith Thompson

BartC said:
What's the problem?

isalpha() expects an int argument, which either is representable
as an unsigned char or is equal to EOF. If plain char is signed,
and *lexptr happens to be negative (and not equal to EOF), the
behavior is undefined.
 
B

BartC

Keith Thompson said:
isalpha() expects an int argument, which either is representable
as an unsigned char or is equal to EOF. If plain char is signed,
and *lexptr happens to be negative (and not equal to EOF), the
behavior is undefined.

I still don't get it.

Are you saying that while an implementation's char type can be signed or
unsigned, it's isalpha() function assumes unsigned?

Apart from being ludicrous, how are you supposed to use it then? With an
(unsigned char) cast? Suppose the char value really is EOF?

(Quite a few threads in the past about the pointlessness of having signed
char at all, now I know it is totally crazy, when one part of the same
implementation can't agree with another about whether chars are supposed to
be signed or signed!)

Anyway I only used isalpha() because with my usual >=A and <=Z tests,
someone would have called me out on that too!

(BTW if I was implementing isalpha(), after checking for EOF, any negative
char values would just be converted to unsigned ones; end of story.)
 
J

James Kuyper

I still don't get it.

Are you saying that while an implementation's char type can be signed or
unsigned, it's isalpha() function assumes unsigned?

7.4p1 describes the character classification functions: "In all cases
the argument is an int, the value of which shall be representable as an
unsigned char or shall equal the value of the macro EOF. If the argument
has any other value, the behavior is undefined."
Apart from being ludicrous, how are you supposed to use it then? With an
(unsigned char) cast? Suppose the char value really is EOF?

That's covered - see above.
Keep in mind that the standard specifies that fgetc() "obtains that
character as an unsigned char converted to an int", unless it returns
EOF to indicate an error. Those specifications make fgetc() and the
character classification macros good matches for each other.
 
R

Richard Damon

I still don't get it.

Are you saying that while an implementation's char type can be signed or
unsigned, it's isalpha() function assumes unsigned?

Apart from being ludicrous, how are you supposed to use it then? With an
(unsigned char) cast? Suppose the char value really is EOF?

(Quite a few threads in the past about the pointlessness of having
signed char at all, now I know it is totally crazy, when one part of the
same implementation can't agree with another about whether chars are
supposed to be signed or signed!)

Anyway I only used isalpha() because with my usual >=A and <=Z tests,
someone would have called me out on that too!

(BTW if I was implementing isalpha(), after checking for EOF, any
negative char values would just be converted to unsigned ones; end of
story.)

This is one "wart" in the language. The isxxx functions are based on the
int returned by fgetc and family, not on array of chars. The undefined
signness of char I suspect is based on my assumption that there was a
machine in at least moderate use at the time that handled signed chars
better than unsigned, maybe the "Load Byte" instruction automatically
sign extended, so unsigned chars needed an explicit mask instruction.

One way around this problem is to define a new set of character
functions (either as macros or inline functions) that automatically
perform the cast. Ie something like

inline int isalphac(char c){ return isalpha((unsigned char)c);}

Not that a very common implementation of the isxxx functions is a table
lookup/masking operation. Often isalpha(c) is something like

table[c+1] & ALPHAFLAGS

This allows an easy change for changing locales, which might change
which of the upper 128 characters are considered to be alpha.
 
K

Keith Thompson

BartC said:
I still don't get it.

Are you saying that while an implementation's char type can be signed or
unsigned, it's isalpha() function assumes unsigned?

Yes. C99 7.4p1:

In all cases the argument is an int, the value of which shall
be representable as an unsigned char or shall equal the value
of the macro EOF. If the argument has any other value, the
behavior is undefined.
Apart from being ludicrous, how are you supposed to use it then? With an
(unsigned char) cast? Suppose the char value really is EOF?

Yes, an (unsigned char) cast is the correct way to use it. You can omit
the cast if you happen to know that the argument is positive -- which
would probably have been the case when these functions were first
defined. Note that all characters in the basic character set are
required to be non-negative -- but you can't assume that characters from
an outside source are non-negative.

An *unsigned char* value cannot be EOF, since EOF is negative.

Typically EOF is -1. A (signed) char value of -1 should be passed as
255 (assuming CHAR_BIT==8). The cast to unsigned char accomplishes
that.

[...]
 
B

BartC

James Kuyper said:
On 09/25/2011 03:43 PM, BartC wrote:
Keep in mind that the standard specifies that fgetc() "obtains that
character as an unsigned char converted to an int", unless it returns
EOF to indicate an error. Those specifications make fgetc() and the
character classification macros good matches for each other.

OK, so everywhere you look, it's clear that having unsigned char types is a
good idea.

So, even if signed chars are not eliminated completely, why are they so
popular, given that it is (presumably) the implementation's free choice to
choose between signed and unsigned? (At least, the three compilers I have,
all have char as signed.)

Then signed chars would only ever be used by someone needing a small signed
integer.
 
R

Richard Damon

OK, so everywhere you look, it's clear that having unsigned char types is a
good idea.

So, even if signed chars are not eliminated completely, why are they so
popular, given that it is (presumably) the implementation's free choice to
choose between signed and unsigned? (At least, the three compilers I have,
all have char as signed.)

Then signed chars would only ever be used by someone needing a small
signed integer.

My guess is that there must be a body of code that expects (incorrectly)
that "plain" char will be signed, and implementations are choosing chars
to be signed to meet customers need.

Those who know better, add the cast to unsigned, those that don't
probably more expect it to be signed.

Note that the isxxx functions work properly with signed characters as
long as the character set uses just base ASCII. Where it starts to fail
is when you get into extended character sets, and you need to do more
than just make char unsigned to handle those, so other parts of the
program will break as well, and the "undefined behavior" that comes out
of putting negative values into them is normally "just" wrong answers,
not "crashes", and if you don't know enough to set the locale, the
answers are likely unspecified anyway. If you know enough to set the
locale, you probably know enough to add the cast, or to change the
setting on the compiler to make char unsigned for your program.
 
K

Keith Thompson

BartC said:
OK, so everywhere you look, it's clear that having unsigned char types is a
good idea.

So, even if signed chars are not eliminated completely, why are they so
popular, given that it is (presumably) the implementation's free choice to
choose between signed and unsigned? (At least, the three compilers I have,
all have char as signed.)

Then signed chars would only ever be used by someone needing a small signed
integer.

Expressions of type char are promoted to int in most contexts. If plain
char is signed, this promotion can be done with a sign extension, which
is typically quite cheap. If plain char is unsigned, it might be more
expensive. (I don't have more specific information, but it's what I was
told when I asked a similar question some time ago.)

I agree that making plain char unsigned would be conceptually simpler.
 
K

Keith Thompson

Richard Damon said:
My guess is that there must be a body of code that expects (incorrectly)
that "plain" char will be signed, and implementations are choosing chars
to be signed to meet customers need.

Those who know better, add the cast to unsigned, those that don't
probably more expect it to be signed.

That's a good guess -- but I don't think it's correct. I've worked on
systems that weren't particularly exotic that had plain char unsigned:
IBM AIX (a Unix system) on PowerPC, and SGI IRIX (another Unix system)
on MIPS. I'm not aware of any code that had problems because of this.

As I said in my other followup, I think it's more a matter of code
efficiency.

[...]
 
A

Alan Curry

That's a good guess -- but I don't think it's correct. I've worked on
systems that weren't particularly exotic that had plain char unsigned:
IBM AIX (a Unix system) on PowerPC, and SGI IRIX (another Unix system)
on MIPS. I'm not aware of any code that had problems because of this.

You were lucky. I've seen those problems. Two that I remember:

char c;
while((c=getopt(...))!=-1)
...

(The bug report on that one is http://bugs.debian.org/327327 and it was
fixed pretty quickly after I reported it.)

The other was shortly after the DOOM source code release I tried to set
up a multiplayer game where one player was on a PowerPC and the others
were i386's. The game aborted as soon as anybody moved, because player
movements were being sent over the network as small signed integers,
stored in a plain char.

This was a game that was praised for having all of its core logic in
portable C with the system-dependent parts neatly isolated, and its
portable core assumed that plain char was signed. On the bright side,
the use of chars rather than shorts prevented any possible endianness
mismatches.

Oh, and a third example: the Jacob Navia "containers" library had a
similar kind of failure when I first ran it on a plain-char-is-unsigned
machine. I posted about that and he fixed it, in a thread here a few
months ago.

If any compiler changes from "plain char is signed" to "plain char is
unsigned", its users will experience bugs of this type. Some of them
will be hard to track down, and the users will grumble about the
compiler vendor breaking code that's always worked before.

That's enough to keep signed chars around forever.
 
B

Ben Bacarisse

Richard Damon said:
My guess is that there must be a body of code that expects
(incorrectly) that "plain" char will be signed, and implementations
are choosing chars to be signed to meet customers need.

Just a historical note. Of the four machines implementing C that are
documented in K&R, only one has signed chars: the PDP-11. It was
perhaps the most influential, so that might be part of the reason for
the common assumption that chars are signed.

K&R C also had no signed keyword, so if you had unsigned char, you could
not get a signed one. Also, the "unsigned" adjective was not allowed to
qualify char[1], so of you had signed chars you could not get unsigned
ones. This is significant, because the cast to unsigned char is
required in modern C. I don't know the sequence, but there may have
been a time when the isxxxx functions existed with no way to fix the
problem.

[1] I'd also forgotten that unsigned was only allowed to used with int.
unsigned long and unsigned short were not permitted.
 
S

Seebs

So, even if signed chars are not eliminated completely, why are they so
popular, given that it is (presumably) the implementation's free choice to
choose between signed and unsigned? (At least, the three compilers I have,
all have char as signed.)
Then signed chars would only ever be used by someone needing a small signed
integer.

I suspect consistency: If you don't *say* signed or unsigned, for any other
normal* integer type the default is signed. Basically, it's so that 'char'
is in the same bin as 'short', 'int', and 'long'.

-s
[*] Shut up, _Bool, we don't *care* about you.
 
K

Keith Thompson

You were lucky. I've seen those problems. Two that I remember:

char c;
while((c=getopt(...))!=-1)
...

(The bug report on that one is http://bugs.debian.org/327327 and it was
fixed pretty quickly after I reported it.)

That doesn't just assume that plain char is signed (though it might
happen to work if it is); it assumes, incorrectly, that getopt() returns
a char (in fact it returns an int). Even if plain char is unsigned, it
will misbehave if getopt() returns an actual 0xff character.

[...]
 
P

Phil Carmody

Keith Thompson said:
Expressions of type char are promoted to int in most contexts. If plain
char is signed, this promotion can be done with a sign extension, which
is typically quite cheap. If plain char is unsigned, it might be more
expensive. (I don't have more specific information, but it's what I was
told when I asked a similar question some time ago.)

I suspect those architectures may well live only in Bernsteinian la-la
land. Quite why the other 100% of the world should care one flying
ferret about expensiveness of operations on such architectures, I'm
not entirely sure. (And I say that as someone who has not used a
mainstream architecture as his primary machine for over a decade. (7
happy years on alpha, 4 happy years on POWER. And most development in
that time done for ARM.))
I agree that making plain char unsigned would be conceptually simpler.

I'm pretty sure you disagreed with me when I made that suggestion
a while back.

Phil
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top