Token Parser

S

Simon Morgan

I'm trying to write a function to parse a Reverse Polish Notation string
from stdin and return 1 token at a time. For those of you who are unaware
an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an operator
is encountered, each number on the stack is popped off and calculated
using that operator and the result pushed onto the stack.

The problem I'm having is differentiating between for example, the +
character and the number 43. Here's what I have so far:

#include <stdio.h>

int read_token(void) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == ' ' || token == '\n') {
continue;
}
if (token == '+' || token == '-' || token == '*' || token == '/') {
break;
} else {
while ((next_token = getchar()) != ' ' && next_token != EOF) {
token *= 10;
token += next_token;
}
break;
}
}

return token;
}

What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

Thanks.
 
F

Flash Gordon

Simon said:
I'm trying to write a function to parse a Reverse Polish Notation string
from stdin and return 1 token at a time. For those of you who are unaware
an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an operator
is encountered, each number on the stack is popped off and calculated
using that operator and the result pushed onto the stack.

The problem I'm having is differentiating between for example, the +
character and the number 43. Here's what I have so far:

A man who provides a problem description and the problem code. A rarity
that it is pleasant to encounter :)
#include <stdio.h>

int read_token(void) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == ' ' || token == '\n') {
continue;
}
if (token == '+' || token == '-' || token == '*' || token == '/') {
break;
} else {

You probably want to check that it was a digit here as opposed to a
letter or some other punctuation and error if it was not. The standard
isdigit function could be of use!
while ((next_token = getchar()) != ' ' && next_token != EOF) {

Again here.
token *= 10;
token += next_token;
}
break;
}
}

return token;
}

What about explicit positive and negative numbers? e.g.
+43 -3 *
What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

I would say that returning a token say the best option is to return a
token value indicating a number (possibly separate tokens for integer,
real etc) and return the value separately. You could either return a
structure with both items (not unreasonable) or return one of them be
passing a pointer to where it should be stored.
 
C

Chris Croughton

What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

Well, it's obvious that you aren't returning enough information. You
need to return the type of the token (operator or number) as well as its
value (type of operator, value of the number). Generally, either return
the type and vakue separately, or decide that certain values are "out of
bounds" (negative numbers, for instance) and assign the operators in
that range.

In your case you seem to handle only positive integers, so you could say
for instance:

#define PLUS (-'+')
#define MINUS (-'-')
#define TIMES (-'*')

etc. then when you get an operator set token = -token.

Your code is also not at all safe. You are saying that anything not one
of your 'known' tokens or space is a digit, that will give you very
strange results with "abc def +" or even "123.567". You really need to
decide whether the token is:

An operator
A number
Something else

and possibly give an error if it's "something else".

Chris C
 
T

T.M. Sommers

Simon said:
I'm trying to write a function to parse a Reverse Polish Notation string
from stdin and return 1 token at a time. For those of you who are unaware
an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an operator
is encountered, each number on the stack is popped off and calculated
using that operator and the result pushed onto the stack.

The problem I'm having is differentiating between for example, the +
character and the number 43. Here's what I have so far:

#include <stdio.h>

int read_token(void) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == ' ' || token == '\n') {
continue;
}
if (token == '+' || token == '-' || token == '*' || token == '/') {
break;
} else {
while ((next_token = getchar()) != ' ' && next_token != EOF) {
token *= 10;
token += next_token;
}
break;
}
}

return token;
}

There are several problems here. First, you are confusing tokens
with input characters. While a token can consist of a single
character, it is really a higher-level lexical object that has
both a type and a value. Your code tries to encode both type and
value in a single integer, but since one of your token types is
integer, you are bound to have problems. The usual solution is
to handle the token type and its value separately.

You are also confusing the characters '0'-'9' with the numbers 0-9.

You should take a look at lex or flex to get some ideas on
lexing. You should also stop by comp.compilers.

Here is a sample to give you the idea. It is hardly robust, but
should get you started.

/* --------------- snip --------------- */
#include <stdio.h>
#include <ctype.h>

#define UNKNOWN 1
#define NUMBER 2
#define OP 3

union val {
int num;
char op;
char ch;
};

int token(union val *val)
{
int tok = 0;
int c;

while ( (c = getchar()) != EOF ) {
if ( isspace(c) ) {
/* do nothing */
}
else if ( isdigit(c) ) {
tok = NUMBER;
val->num = c - '0';
while ( isdigit(c = getchar()) ) {
val->num *= 10;
val->num += c - '0';
}
if ( !isdigit(c) ) {
ungetc(c, stdin);
}
break;
}
else if ( c == '+' || c == '*' || c == '-' || c == '/' ) {
tok = OP;
val->op = c;
break;
}
else {
tok = UNKNOWN;
val->ch = c;
break;
}
}

return tok;
}

int main(int argc, char *argv[])
{
union val val;
int tok;

while ( (tok = token(&val)) ) {
switch (tok) {
case NUMBER:
printf("NUMBER = %d\n", val.num);
break;
case OP:
printf("OP = %c\n", val.op);
break;
case UNKNOWN:
printf("unexpected character: %c\n", val.ch);
break;
default:
printf("unknown token type: %d\n", tok);
}
}

return 0;
}
/* --------------- snip --------------- */
 
C

CBFalconer

T.M. Sommers said:
.... snip ...

There are several problems here. First, you are confusing tokens
with input characters. While a token can consist of a single
character, it is really a higher-level lexical object that has
both a type and a value. Your code tries to encode both type and
value in a single integer, but since one of your token types is
integer, you are bound to have problems. The usual solution is
to handle the token type and its value separately.

You are also confusing the characters '0'-'9' with the numbers 0-9.

You should take a look at lex or flex to get some ideas on
lexing. You should also stop by comp.compilers.

Here is a sample to give you the idea. It is hardly robust, but
should get you started.

/* --------------- snip --------------- */

Here is the beginnings of a simplified lexer I am designing. When
done I expect it will be usable for various languages (if I ever
get around to finishing it). The accompanying code is around 350
lines so far, so I am refraining from inflicting it on the
newsgroup. I could be persuaded to publish the code if someone
really wants it. Besides it doesn't work to my satisfaction yet.
When done it will return reals as strings with an appropriate
identifier, and I am not yet anywhere near satisfied about this.
So far it also doesn't handle escaped characters, trigraphs, and
such. The lexer will be the sole communication between the input
file and the program.

It is intended to work with languages that can resolve things on
the basis of one symbol and one char lookahead.

/* File lexer.h
This is intended to be a language adjustable module that will
open a specified file on initialization, and then return tokens
read from that file. The tokens may be identifiers, reserved
words, punctuation or EOF. They are accompanied by a
'spelling'.

The initialization phase specifies the language, which in turn
specifies the form of comments (to be ignored).

Users of the lexer can detect identifiers that are actually
reserved words, and take appropriate action.
*/

/* opaque type for internal use */
struct lexcontrol;

/* The user needs to free this and the contained ident */
typedef struct lextoken {
int nextch;
int sym; /* Also defines string delimiter */
int idlgh; /* A non-zero in these signals */
int numlgh; /* that an id, number, or string */
int stglgh; /* is present in ident */
char *ident; /* always return fresh in a new token */
} lextoken, *lextokenp;

typedef enum lang {unknown, C89, C99, Pascal} lexlang;

/* Initialize the system with an opened text file */
struct lexcontrol* lexinit(FILE *f, lexlang lang);

/* Return the next token, which is a pointer to a record */
lextokenp lexnext(struct lexcontrol* lx);
 
N

Neil Ferguson

Simon Morgan said:
I'm trying to write a function to parse a Reverse Polish Notation string
from stdin and return 1 token at a time. For those of you who are unaware
an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an operator
is encountered, each number on the stack is popped off and calculated
using that operator and the result pushed onto the stack.

The problem I'm having is differentiating between for example, the +
character and the number 43. Here's what I have so far:

#include <stdio.h>

int read_token(void) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == ' ' || token == '\n') {
continue;
}
if (token == '+' || token == '-' || token == '*' || token == '/') {
break;
} else {
while ((next_token = getchar()) != ' ' && next_token != EOF) {
token *= 10;
token += next_token;
}
break;
}
}

return token;
}

What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

Thanks.
I suggest that as an approach to implementing this problem you look into
using a state machine. It's an elegant way to solve problems that are
combinatorily restricted.
 
S

Simon Morgan

What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

Thanks to everybody for their help, some of it is a bit above me at the
moment because I'm still learning the language and programming in general
(who isn't?) but it's food for thought.

I decided to go with the passing in a pointer and setting it to indicate
the type approach because structures haven't been covered in the book I'm
working from yet.

The code I have is far from perfect but obviously works a lot better than
what I had before. For those who are interested:

#include <stdio.h>

int read_token(char *type) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;
while (isdigit(next_token = getchar())) {
token *= 10;
token += next_token - 48;
}
break;
}
}

return token;
}
 
F

Flash Gordon

Simon said:
Thanks to everybody for their help, some of it is a bit above me at the
moment because I'm still learning the language and programming in general
(who isn't?) but it's food for thought.

I decided to go with the passing in a pointer and setting it to indicate
the type approach because structures haven't been covered in the book I'm
working from yet.

The code I have is far from perfect but obviously works a lot better than
what I had before. For those who are interested:

#include <stdio.h>

int read_token(char *type) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;

token -= '0'; /* ASCII is not the only character set */
while (isdigit(next_token = getchar())) {
token *= 10;
token += next_token - 48;

token += next_token - '0'; /* ASCII is not the only
character set */
}
break;
}

else {
/* Put some code here to handle bad input */
}
 
C

CBFalconer

Simon said:
.... snip ...

The code I have is far from perfect but obviously works a lot
better than what I had before. For those who are interested:

#include <stdio.h>

int read_token(char *type) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;
while (isdigit(next_token = getchar())) {
token *= 10;
token += next_token - 48;
}
break;
}
}
return token;
}

That is pretty clean newbie code - nice job. I have two comments.
First, get rid of the magic number 48. You should use '0' here,
which will not vary with the underlying char set. Second is exit
conditions.

You are basically ignoring all input except the stuff that
interests you. When you get a symbol such as '+' you return that,
and whatever follows is available for a further call. That means
you can resolve input such as "+123" correctly. But what about
"123+4", for example. You have discarded the char that follows 123
forever.

C has a handy mechanism for handling this, called ungetc. You use
it to put a char back for future reading. It is limited to a
single level, i.e. you can't push back more than one char between
gets. So add the line:

ungetc(next_token, stdin);

just before the break in the number accumulation routine. Remember
that getchar is really a call on getc(stdin) without bothering to
specify stdin file.

Other comments:

Later you can worry about detecting overflows in numeric input.

You may find that interchanging the methods of returning type and
value is more useful later. You will generally base the next
decision on the returned type, not the value. So the prototype may
work better as:

int next_token_type(int *token_read);
 
T

T.M. Sommers

Simon said:
I decided to go with the passing in a pointer and setting it to indicate
the type approach because structures haven't been covered in the book I'm
working from yet.

You might find it worthwhile to read ahead a bit.
The code I have is far from perfect but obviously works a lot better than
what I had before. For those who are interested:

#include <stdio.h>

Don't forget:

#include said:
int read_token(char *type) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;
while (isdigit(next_token = getchar())) {

Use this instead:

while ( (c = getchar()) != EOF && isdigit(c) ) {

This was an error in the code I posted earlier. You don't want
to isdigit(EOF).
 
C

CBFalconer

T.M. Sommers said:
Use this instead:

while ( (c = getchar()) != EOF && isdigit(c) ) {

This was an error in the code I posted earlier. You don't want
to isdigit(EOF).

This is an error. He wants to find the non-digit operators also.
There is no problem passing EOF to isdigit, it will reply 'no'.
 
T

T.M. Sommers

CBFalconer said:
This is an error. He wants to find the non-digit operators also.
There is no problem passing EOF to isdigit, it will reply 'no'.

You are quite right. I was under the impression that the
argument had to be an unsigned char, but the standard also allows
EOF.
 
C

CBFalconer

T.M. Sommers said:
You are quite right. I was under the impression that the
argument had to be an unsigned char, but the standard also allows
EOF.

The error part was that you changed the logic so that it could no
longer return the various operators, skipblanks, etc. The other is
unnecessary, rather than an error.
 
T

T.M. Sommers

CBFalconer said:
The error part was that you changed the logic so that it could no
longer return the various operators, skipblanks, etc. The other is
unnecessary, rather than an error.

I cut and pasted from my own code, which used a different
variable for the input character ('c' vice 'next_token'), which
was an error, but other than that I don't see any error in the
above line (except that one should perhaps guarantee that c can
be cast to unsigned char). The loop controlled by the while is
over digits after the first, and will stop when and only when c
is not a digit. The only thing the EOF test changes is that
isdigit() will not be called for EOF, which, as you pointed out,
is unnecessary.
 
S

Simon Morgan

That is pretty clean newbie code - nice job.
Thanks.

I have two comments.

<snip>

Thanks for the heads up on the bug and ungetc and I really should know
better than hard coding the ASCII value!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top