Token Parser

Simon Morgan · Jul 23, 2005

I'm trying to write a function to parse a Reverse Polish Notation string
from stdin and return 1 token at a time. For those of you who are unaware
an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an operator
is encountered, each number on the stack is popped off and calculated
using that operator and the result pushed onto the stack.

The problem I'm having is differentiating between for example, the +
character and the number 43. Here's what I have so far:

#include <stdio.h>

int read_token(void) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == ' ' || token == '\n') {
continue;
}
if (token == '+' || token == '-' || token == '*' || token == '/') {
break;
} else {
while ((next_token = getchar()) != ' ' && next_token != EOF) {
token *= 10;
token += next_token;
}
break;
}
}

return token;
}

What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

Thanks.

Flash Gordon · Jul 23, 2005

Simon said:
I'm trying to write a function to parse a Reverse Polish Notation string
from stdin and return 1 token at a time. For those of you who are unaware
an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an operator
is encountered, each number on the stack is popped off and calculated
using that operator and the result pushed onto the stack.

The problem I'm having is differentiating between for example, the +
character and the number 43. Here's what I have so far:

A man who provides a problem description and the problem code. A rarity
that it is pleasant to encounter

#include <stdio.h>

int read_token(void) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == ' ' || token == '\n') {
continue;
}
if (token == '+' || token == '-' || token == '*' || token == '/') {
break;
} else {

You probably want to check that it was a digit here as opposed to a
letter or some other punctuation and error if it was not. The standard
isdigit function could be of use!

while ((next_token = getchar()) != ' ' && next_token != EOF) {

Again here.

token *= 10;
token += next_token;
}
break;
}
}

return token;
}

What about explicit positive and negative numbers? e.g.
+43 -3 *

What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

I would say that returning a token say the best option is to return a
token value indicating a number (possibly separate tokens for integer,
real etc) and return the value separately. You could either return a
structure with both items (not unreasonable) or return one of them be
passing a pointer to where it should be stored.

Chris Croughton · Jul 23, 2005

What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

Well, it's obvious that you aren't returning enough information. You
need to return the type of the token (operator or number) as well as its
value (type of operator, value of the number). Generally, either return
the type and vakue separately, or decide that certain values are "out of
bounds" (negative numbers, for instance) and assign the operators in
that range.

In your case you seem to handle only positive integers, so you could say
for instance:

#define PLUS (-'+')
#define MINUS (-'-')
#define TIMES (-'*')

etc. then when you get an operator set token = -token.

Your code is also not at all safe. You are saying that anything not one
of your 'known' tokens or space is a digit, that will give you very
strange results with "abc def +" or even "123.567". You really need to
decide whether the token is:

An operator
A number
Something else

and possibly give an error if it's "something else".

Chris C

T.M. Sommers · Jul 23, 2005

Simon said:
I'm trying to write a function to parse a Reverse Polish Notation string
from stdin and return 1 token at a time. For those of you who are unaware
an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an operator
is encountered, each number on the stack is popped off and calculated
using that operator and the result pushed onto the stack.

The problem I'm having is differentiating between for example, the +
character and the number 43. Here's what I have so far:

#include <stdio.h>

int read_token(void) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == ' ' || token == '\n') {
continue;
}
if (token == '+' || token == '-' || token == '*' || token == '/') {
break;
} else {
while ((next_token = getchar()) != ' ' && next_token != EOF) {
token *= 10;
token += next_token;
}
break;
}
}

return token;
}

There are several problems here. First, you are confusing tokens
with input characters. While a token can consist of a single
character, it is really a higher-level lexical object that has
both a type and a value. Your code tries to encode both type and
value in a single integer, but since one of your token types is
integer, you are bound to have problems. The usual solution is
to handle the token type and its value separately.

You are also confusing the characters '0'-'9' with the numbers 0-9.

You should take a look at lex or flex to get some ideas on
lexing. You should also stop by comp.compilers.

Here is a sample to give you the idea. It is hardly robust, but
should get you started.

/* --------------- snip --------------- */
#include <stdio.h>
#include <ctype.h>

#define UNKNOWN 1
#define NUMBER 2
#define OP 3

union val {
int num;
char op;
char ch;
};

int token(union val *val)
{
int tok = 0;
int c;

while ( (c = getchar()) != EOF ) {
if ( isspace(c) ) {
/* do nothing */
}
else if ( isdigit(c) ) {
tok = NUMBER;
val->num = c - '0';
while ( isdigit(c = getchar()) ) {
val->num *= 10;
val->num += c - '0';
}
if ( !isdigit(c) ) {
ungetc(c, stdin);
}
break;
}
else if ( c == '+' || c == '*' || c == '-' || c == '/' ) {
tok = OP;
val->op = c;
break;
}
else {
tok = UNKNOWN;
val->ch = c;
break;
}
}

return tok;
}

int main(int argc, char *argv[])
{
union val val;
int tok;

while ( (tok = token(&val)) ) {
switch (tok) {
case NUMBER:
printf("NUMBER = %d\n", val.num);
break;
case OP:
printf("OP = %c\n", val.op);
break;
case UNKNOWN:
printf("unexpected character: %c\n", val.ch);
break;
default:
printf("unknown token type: %d\n", tok);
}
}

return 0;
}
/* --------------- snip --------------- */

CBFalconer · Jul 24, 2005

T.M. Sommers said:
.... snip ...

There are several problems here. First, you are confusing tokens
with input characters. While a token can consist of a single
character, it is really a higher-level lexical object that has
both a type and a value. Your code tries to encode both type and
value in a single integer, but since one of your token types is
integer, you are bound to have problems. The usual solution is
to handle the token type and its value separately.

You are also confusing the characters '0'-'9' with the numbers 0-9.

You should take a look at lex or flex to get some ideas on
lexing. You should also stop by comp.compilers.

Here is a sample to give you the idea. It is hardly robust, but
should get you started.

/* --------------- snip --------------- */

Here is the beginnings of a simplified lexer I am designing. When
done I expect it will be usable for various languages (if I ever
get around to finishing it). The accompanying code is around 350
lines so far, so I am refraining from inflicting it on the
newsgroup. I could be persuaded to publish the code if someone
really wants it. Besides it doesn't work to my satisfaction yet.
When done it will return reals as strings with an appropriate
identifier, and I am not yet anywhere near satisfied about this.
So far it also doesn't handle escaped characters, trigraphs, and
such. The lexer will be the sole communication between the input
file and the program.

It is intended to work with languages that can resolve things on
the basis of one symbol and one char lookahead.

/* File lexer.h
This is intended to be a language adjustable module that will
open a specified file on initialization, and then return tokens
read from that file. The tokens may be identifiers, reserved
words, punctuation or EOF. They are accompanied by a
'spelling'.

The initialization phase specifies the language, which in turn
specifies the form of comments (to be ignored).

Users of the lexer can detect identifiers that are actually
reserved words, and take appropriate action.
*/

/* opaque type for internal use */
struct lexcontrol;

/* The user needs to free this and the contained ident */
typedef struct lextoken {
int nextch;
int sym; /* Also defines string delimiter */
int idlgh; /* A non-zero in these signals */
int numlgh; /* that an id, number, or string */
int stglgh; /* is present in ident */
char *ident; /* always return fresh in a new token */
} lextoken, *lextokenp;

typedef enum lang {unknown, C89, C99, Pascal} lexlang;

/* Initialize the system with an opened text file */
struct lexcontrol* lexinit(FILE *f, lexlang lang);

/* Return the next token, which is a pointer to a record */
lextokenp lexnext(struct lexcontrol* lx);

Neil Ferguson · Jul 24, 2005

Simon Morgan said:
I'm trying to write a function to parse a Reverse Polish Notation string
from stdin and return 1 token at a time. For those of you who are unaware
an RPN string looks like this:

1 2 + 4 * 3 +

With each number being read and pushed onto a stack and, when an operator
is encountered, each number on the stack is popped off and calculated
using that operator and the result pushed onto the stack.

The problem I'm having is differentiating between for example, the +
character and the number 43. Here's what I have so far:

#include <stdio.h>

int read_token(void) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == ' ' || token == '\n') {
continue;
}
if (token == '+' || token == '-' || token == '*' || token == '/') {
break;
} else {
while ((next_token = getchar()) != ' ' && next_token != EOF) {
token *= 10;
token += next_token;
}
break;
}
}

return token;
}

What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

Thanks.

I suggest that as an approach to implementing this problem you look into
using a state machine. It's an elegant way to solve problems that are
combinatorily restricted.

Simon Morgan · Jul 25, 2005

What's the conventional wisdom on overcoming the ambiguity between
characters and numbers in C?

Thanks to everybody for their help, some of it is a bit above me at the
moment because I'm still learning the language and programming in general
(who isn't?) but it's food for thought.

I decided to go with the passing in a pointer and setting it to indicate
the type approach because structures haven't been covered in the book I'm
working from yet.

The code I have is far from perfect but obviously works a lot better than
what I had before. For those who are interested:

#include <stdio.h>

int read_token(char *type) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;
while (isdigit(next_token = getchar())) {
token *= 10;
token += next_token - 48;
}
break;
}
}

return token;
}

Flash Gordon · Jul 25, 2005

Simon said:
Thanks to everybody for their help, some of it is a bit above me at the
moment because I'm still learning the language and programming in general
(who isn't?) but it's food for thought.

I decided to go with the passing in a pointer and setting it to indicate
the type approach because structures haven't been covered in the book I'm
working from yet.

The code I have is far from perfect but obviously works a lot better than
what I had before. For those who are interested:

#include <stdio.h>

int read_token(char *type) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;

token -= '0'; /* ASCII is not the only character set */

while (isdigit(next_token = getchar())) {
token *= 10;
token += next_token - 48;

token += next_token - '0'; /* ASCII is not the only
character set */

}
break;
}

else {
/* Put some code here to handle bad input */
}

CBFalconer · Jul 25, 2005

Simon said:
.... snip ...

The code I have is far from perfect but obviously works a lot
better than what I had before. For those who are interested:

#include <stdio.h>

int read_token(char *type) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;
while (isdigit(next_token = getchar())) {
token *= 10;
token += next_token - 48;
}
break;
}
}
return token;
}

That is pretty clean newbie code - nice job. I have two comments.
First, get rid of the magic number 48. You should use '0' here,
which will not vary with the underlying char set. Second is exit
conditions.

You are basically ignoring all input except the stuff that
interests you. When you get a symbol such as '+' you return that,
and whatever follows is available for a further call. That means
you can resolve input such as "+123" correctly. But what about
"123+4", for example. You have discarded the char that follows 123
forever.

C has a handy mechanism for handling this, called ungetc. You use
it to put a char back for future reading. It is limited to a
single level, i.e. you can't push back more than one char between
gets. So add the line:

ungetc(next_token, stdin);

just before the break in the number accumulation routine. Remember
that getchar is really a call on getc(stdin) without bothering to
specify stdin file.

Other comments:

Later you can worry about detecting overflows in numeric input.

You may find that interchanging the methods of returning type and
value is more useful later. You will generally base the next
decision on the returned type, not the value. So the prototype may
work better as:

int next_token_type(int *token_read);

T.M. Sommers · Jul 25, 2005

Simon said:
I decided to go with the passing in a pointer and setting it to indicate
the type approach because structures haven't been covered in the book I'm
working from yet.

You might find it worthwhile to read ahead a bit.

The code I have is far from perfect but obviously works a lot better than
what I had before. For those who are interested:

#include <stdio.h>

Don't forget:

#include said:
int read_token(char *type) {
int token, next_token;

while ((token = getchar()) != EOF) {
if (token == '+' || token == '-' || token == '*' || token == '/') {
*type = 'c';
break;
}
if (isdigit(token)) {
*type = 'n';
token -= 48;
while (isdigit(next_token = getchar())) {

Use this instead:

while ( (c = getchar()) != EOF && isdigit(c) ) {

This was an error in the code I posted earlier. You don't want
to isdigit(EOF).

CBFalconer · Jul 25, 2005

T.M. Sommers said:
Use this instead:

while ( (c = getchar()) != EOF && isdigit(c) ) {

This was an error in the code I posted earlier. You don't want
to isdigit(EOF).

This is an error. He wants to find the non-digit operators also.
There is no problem passing EOF to isdigit, it will reply 'no'.

T.M. Sommers · Jul 25, 2005

CBFalconer said:
This is an error. He wants to find the non-digit operators also.
There is no problem passing EOF to isdigit, it will reply 'no'.

You are quite right. I was under the impression that the
argument had to be an unsigned char, but the standard also allows
EOF.

CBFalconer · Jul 26, 2005

T.M. Sommers said:
You are quite right. I was under the impression that the
argument had to be an unsigned char, but the standard also allows
EOF.

The error part was that you changed the logic so that it could no
longer return the various operators, skipblanks, etc. The other is
unnecessary, rather than an error.

T.M. Sommers · Jul 26, 2005

CBFalconer said:
The error part was that you changed the logic so that it could no
longer return the various operators, skipblanks, etc. The other is
unnecessary, rather than an error.

I cut and pasted from my own code, which used a different
variable for the input character ('c' vice 'next_token'), which
was an error, but other than that I don't see any error in the
above line (except that one should perhaps guarantee that c can
be cast to unsigned char). The loop controlled by the while is
over digits after the first, and will stop when and only when c
is not a digit. The only thing the EOF test changes is that
isdigit() will not be called for EOF, which, as you pointed out,
is unnecessary.

Simon Morgan · Jul 26, 2005

That is pretty clean newbie code - nice job.
Thanks.

I have two comments.

<snip>

Thanks for the heads up on the bug and ungetc and I really should know
better than hard coding the ASCII value!

Lexical Analysis on C++	1	Oct 31, 2023
Compiling error: expected identifier or ‘(’ before ‘{’ token	17	Mar 10, 2010
Database Manager: A C++ Console Application	14	May 12, 2025
Expected Token Density in Random Stream	0	Dec 7, 2011
Hello I am learning how to code and I tried making a calculator with HTML and js with some CSS I am stuck at thing, Like the screen value is	0	Mar 13, 2025
Help in fixing this code	8	May 8, 2022
embarrassing spaghetti code needs stylistic advice	72	Mar 19, 2009
Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025

Token Parser

Simon Morgan

Flash Gordon

Chris Croughton

T.M. Sommers

CBFalconer

Neil Ferguson

Simon Morgan

Flash Gordon

CBFalconer

T.M. Sommers

CBFalconer

T.M. Sommers

CBFalconer

T.M. Sommers

Simon Morgan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads