getc and ungetc

B

Bill Cunningham

Would getc and ungetc be the best and most simple what to parse
expressions for a parser?

Bill
 
T

Trent Buck

Quoth Bill Cunningham on or about 2004-11-18:
Would getc and ungetc be the best and most simple what to parse
expressions for a parser?

ungetc can't be applied more than once `in a row' (i.e. sequentially).
I suspect that makes for a rather unsuitable function for a parser.

-t
 
C

Chris Barts

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Bill Cunningham wrote:
| Would getc and ungetc be the best and most simple what to parse
| expressions for a parser?
|

It's usually easier to do like K&R did in "The C Programming Language"
(at the end, when they designed the RPN calculator): Read in a block of
text, and then define your own getchar()/ungetchar() functions to push
and pop characters on and off that buffer. You can read in more text (a
block at a time, for efficiency and convenience) when your getchar()
function tries to read beyond the end of your buffer.

You need to write a bit more code yourself, but it's fairly trivial code
and you can reuse it in any other parser you write.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBnXYsKxatjOtX+j0RAnY8AJ4gpW7DPFF/VgtMv7V2cdfzYINTawCeOcqi
fpvs+q6HPDjLx6sVc1ozUXk=
=A6oi
-----END PGP SIGNATURE-----
 
M

Malcolm

Trent Buck said:
ungetc can't be applied more than once `in a row' (i.e. sequentially).
I suspect that makes for a rather unsuitable function for a parser.
It depends on your parser design.
Most simple parsers used for things like computer languages divide the input
stream into token, and then parse from left to right with one token of "look
ahead". So if your tokens are single characters then getc() and ungetc() may
be adequate.
Of course if you have ambitions to build a natural language parser, or even
some simple grammars with unusual characteristics, then this scheme won't
work, and you will need some method of scanning up and down many tokens on
input.
 
S

SM Ryan

# Of course if you have ambitions to build a natural language parser, or even
# some simple grammars with unusual characteristics, then this scheme won't
# work, and you will need some method of scanning up and down many tokens on
# input.

Such a parser runs the risk of exponential running time, while a tabular
parser doesn't need to back up and has cubic time worst case. ungetc has
at most marginal usability for some types of lexical scanners.
 
B

Bill Cunningham

SM Ryan said:
# Of course if you have ambitions to build a natural language parser, or even
# some simple grammars with unusual characteristics, then this scheme won't
# work, and you will need some method of scanning up and down many tokens on
# input.

Such a parser runs the risk of exponential running time, while a tabular
parser doesn't need to back up and has cubic time worst case. ungetc has
at most marginal usability for some types of lexical scanners.
I have k&r 2. What about redirecting fgetc to stdio and using it insteat
getc?

Bill
 
B

Bill Cunningham

Bill Cunningham said:
tokens
I have k&r 2. What about redirecting fgetc to stdio and using it insteat
getc?

Bill

I've heard of top down recursive parsers.
Bill
 
H

Herbert Rosenau

Would getc and ungetc be the best and most simple what to parse
expressions for a parser?

Yes. But be aware of that ungetc can only unget ONE char at a time.

On other hand you can simply by macro or function extend getc and
ungetc to unget more than one char. Your UNGETC() would accept a
number of chars to give back in your GETC() in reverse order. Your
GETC() will give back the chars UNGETC had received before it gets new
chars from the stream itself.

A bit tricky is to ungent a char that you have never gotten - legally
as there is nothing that forbids it and ungenc requires the char that
is to unget. Be sure that you does NOT tries to unget EOF, this won't
work. This can be very useful when you has a long list of keyword
delemiters with same meaning to a long list of similar keywords.

Your parser may convert keywords or keychars into tokens and save the
tokens until all or a part of the input stream is readed and then work
on the generated token, it may work in other ways you thinks it
matches your requirements.

getc() (in conjunktion with ungetc() ) gives you the strongest
possible control over the stream you can ever need. When needed you
can siply count the number of chars, words, lines... readed in as side
effect, reset these counters as needed..... You avoids supervising of
buffers - you does need one.

Build your parser as state mashine and you can reuse the same code
again and again beside the little number of statements you needs to
handle a specific (sub)state. You gets high flexibility as you would
easy extend the functionality of the parser by create a new
(sub)state. Makes maintenance an easy work.
 
M

Malcolm

Bill Cunningham said:
I've heard of top down recursive parsers.
What is important in terms of the tokenizer is that they be left-right with
only one token of lookahead. Most practical parsers are in this class.

Unfortunately, using getc / ungetc means that the tokens are constrained to
be single characters. This may be Ok for a very simple application, but if
the tokens are naturally several characters long it will be a nuisance. You
can do it, for instance if you have a mathematical function tan() then you
could build it up from the latters 't' 'a' and 'n' rather than reading it in
as a single token TAN. But it is generally a lot easier to do the lexical
analysis before the parse proper.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top