Tokenizer interface

Lorenzo J. Lucchini · Nov 28, 2003

Do you see any counter-indication to a token extractor in the
following form?

typedef int token;
token ExtractToken(const char** String, char* Symbol);

If it finds a valid token at the start of *String, it copies it into
Symbol (unless Symbol==NULL, in which case it doesn't copy it
anywhere); it makes *String point to the first character *after* the
token, and returns a token identificator (0 if no valid token found).

Of course, there is also a function
void CreateToken(const char* Symbol, token Id);
to add tokens to the set.

The only other function I know of that behaves like this (as the
char** is concerned) is strtol(), which can change a pointer to point
to the first character after the number parsed.

by LjL
(e-mail address removed)

Andreas Kahari · Nov 28, 2003

Do you see any counter-indication to a token extractor in the
following form?

Did you consider using strtok()?

Kaz Kylheku · Nov 29, 2003

Do you see any counter-indication to a token extractor in the
following form?

typedef int token;
token ExtractToken(const char** String, char* Symbol);

If it finds a valid token at the start of *String, it copies it into
Symbol (unless Symbol==NULL, in which case it doesn't copy it

It's a terrible interface, because it's not obvious how much space the
second argument points to! What if the lexeme is too long?

There are several alternatives.

1. Allow the input string to be destroyed. The tokenizer can then
insert a terminating null and return a pointer to an address
within the original string:

token ExtractToken(char **inputstring, char **outputptr);

2. Dynamically allocate the lexeme:

token ExtractToken(char **inputstring, char **outputstring);

If a non-null pointer is stored to *outputstring, the caller
is responsible for liberating the memory with free().

3. Provide a buffer for the lexeme, and specify the size:

token ExtractToken(char **inputstring,
char *outputbuffer, size_t buffersize);

4. Create an abstract data type representing tokens, and return that:

typedef struct token {
char *lexeme;
int code;
};

token *token_extract(char **inputstring)
{
token *t = malloc(sizeof *t);

/* ... do the extraction stuff, allocate lexeme, etc ... */

return t;
}

void token_free(token *t)
{
if (t != 0) {
free(t->lexeme);
free(t);
}
}

etc.

Morris Dovey · Nov 29, 2003

Lorenzo said:
Do you see any counter-indication to a token extractor in the
following form?

typedef int token;
token ExtractToken(const char** String, char* Symbol);

If it finds a valid token at the start of *String, it copies it into
Symbol (unless Symbol==NULL, in which case it doesn't copy it
anywhere); it makes *String point to the first character *after* the
token, and returns a token identificator (0 if no valid token found).

Lorenzo...

No, I don't.

From time to time I've found it advantageous to extract all
tokens at once, then process them in a loop. You're welcome to
look over http://www.iedu.com/mrd/c/tokenize.c for ideas.

Lorenzo J. Lucchini · Nov 29, 2003

Andreas Kahari said:
Did you consider using strtok()?

I did, but - besides not liking its interface very much, honestly - it
doesn't really fulfill my needs.
That's because I don't have one or more separator(s) that uniquely
identify the beginning and end of a token.

Take this expression:
MyVar==0x1234
(no, this is not trying to be a C expression)

The token "MyVar" is terminated by the character '='; the token "==",
however, is obviously not terminated by an '=' (even though "=" alone
is also a token); then for the token "0x1234", the parser must be
aware that the sequence of tokens "0", "x" and "1234" makes no sense,
and so treat "0x1234" as a single token.

Moreover, my tokenizer understands that, given the following set of
tokens:
"test", "dummy", "foobar"
the expression
"dfoot"
represents the sequence of tokens "dummy", "foobar" and "test" (well,
it only makes this 'translation' is the argument Completion is set to
a non-zero value, but I haven't included that argument is the
prototype I posted).

by LjL
(e-mail address removed)

Lorenzo J. Lucchini · Nov 29, 2003

It's a terrible interface, because it's not obvious how much space the
second argument points to! What if the lexeme is too long?

Well, but you can always make Symbol be long enough.
First, Symbol will never exceed the length of String, so
Symbol=malloc(strlen(*String)+1) should always work (unless the
malloc() fails, that is).
Second, you could go to greater lengths and save some more space by
making Symbol big enough to contain the longest lexeme in your set of
valid tokens.
I could even add a function GetMaxTokenLength() to make this easier.

Anyway, even if I simply make Symbol big enough to contain the whole
String, I won't be wasting much space, since I almost never need to
keep a lexeme stored while I'm extracting the next - what I usually
have to keep is the value returned by ExtractToken(), which identifies
the extracted token for me, not the actual lexeme in its string form.

There are several alternatives.

1. Allow the input string to be destroyed. The tokenizer can then
insert a terminating null and return a pointer to an address
within the original string:

token ExtractToken(char **inputstring, char **outputptr);

But then I would have to store the original inputstring somewhere
else, so using the same amount of memory as if I had made Symbol big
enough to store String / inputstring.

[snip point 2, dynamic allocation]

3. Provide a buffer for the lexeme, and specify the size:

token ExtractToken(char **inputstring,
char *outputbuffer, size_t buffersize);

I could do this, but it feels a bit redundant, since I [can] know in
advance the length of the longest lexeme that ExtractToken() can give
me.

4. Create an abstract data type representing tokens, and return that:

typedef struct token {
char *lexeme;
int code;
};

token *token_extract(char **inputstring)
[snip]

void token_free(token *t)
[snip]

This, as well as what you propose in 2., could be done, but it seems a
bit overkill to do it for every lexeme you are extracting from a
string, since the lexemes - while not being all of exactly the same
length - are all going to be of "about" the same length.

And after all, many functions in the standard library expect a
pre-allocated buffer that they can fill, and they don't always expect
an argument telling them how many items can be put in the buffer - at
least not when the worst-case size the buffer needs to be can be
computed in advance -, do they?

By the way, thank you for "lexeme". I needed a way to distinguish it
from "token" ("token" = something good enough to uniquely identify a
lexeme within a given set of lexemes). "Symbol" was stupid, and
besides, I was already using the term "symbol" for something else
(linker global symbols), and that *was* confusing.

by LjL
(e-mail address removed)

Lorenzo J. Lucchini · Nov 29, 2003

Morris Dovey said:
Lorenzo...

No, I don't.

From time to time I've found it advantageous to extract all
tokens at once, then process them in a loop. You're welcome to
look over http://www.iedu.com/mrd/c/tokenize.c for ideas.

It looks like your code would work with the ASCII character set, but
not necessarily with other sets.
Why don't you use the is*() functions in ctype.h instead of testing if
the value of a char falls into a specific range?

For my purposes, I prefer to scan tokens one by one. I've made my
ExtractToken() work the way it works precisely to find a compromise
between the clumsy (IMHO) syntax of strtok() and extracting all tokens
in one pass.

Before writing the current set of functions of which ExtractToken() is
part, however, I had written a functions set that worked a bit like
your tokenize.c, in the sense that it used a pointer to a function
deciding whether a character was or wasn't a delimiter - and it also
used another function pointer to decide whether two characters were
the same (mainly to parse tokens in a case-insensitve environment).

This latter feature stopped working when I added hashing (which was
the main reason I wrote those functions for). Of course, for the
specific case of dealing with case-insensitiveness, I could have
simply hashed everything in either upper or lowercase - but this would
have defeated the purpose of using a function pointer to generically
handle "characters that are equivalent".

I ended up dumping the whole thing because or these reasons, but
mainly because a tokenizer based on the concept of a "delimiter"
didn't suit my purposes anymore.
More precisely, there are to contexts in the program I'm writing where
I need "tokens" - one would be fine with separator-delimited token
handling, but the other (an expression parser, please refer to my
reply to Andreas Kahari) wouldn't. Thus I decided to write something
that would accomodate both.

You can look at how my original token extractor/parser worked at
http://cvs.sourceforge.net/viewcvs.py/z80sim/z80sim/tokens.c
and
http://cvs.sourceforge.net/viewcvs.py/z80sim/z80sim/tokens.h

Look at versions 1.1 of both. Subsequent versions contain the new
tokenizer (the one using ExtractToken()), which uses a tree-based
algorithm I invented for the purposes - I don't know if it is an
algorithm actually studied and used for purposes of parsing, but it
Works For Me, apparently.

Please feel free to criticize both programs, but keep in mind that the
old tokens.c was abandoned during developement, and the new one is far
from thoroughly-tested (both in terms of algorithmical correctness and
ANSI compliance).

by LjL
(e-mail address removed)

Chris Torek · Nov 29, 2003

Kaz Kylheku said:
4. Create an abstract data type representing tokens, and return that:
typedef struct token {
char *lexeme;
int code;
};
token *token_extract(char **inputstring) { ... }

For whatever it is worth, this is the alternative I would favor
(despite the objections the original poster raises in a followup).

One note: the "typedef" above is "naked" -- it does not define any
aliases for "struct token". The subsequent "token *" declaration
for token_extract() is thus a syntax error in C. My preferred fix
for this is to avoid "typedef" altogether: just write "struct token"
everywhere. If one finds this objectionable, however, my "backup
preferred fix" is:

typedef struct token token;
struct token { ... };

That is, if you must use typedefs to paper over C's real abstract
types ("struct"), write the typedef alias first, then write the
structure declaration. This way you can use the alias inside the
structure declaration. This works for complicated, mutually-referential
structures as well:

/* Each object has a list of elements it contains, and each element
has a list of objects that refer to it. */
typedef struct object Object;
typedef struct element Element;

struct object {
Object *o_next, *o_prev; /* doubly linked list of objects */
Element *o_elements; /* elements we contain */
Object *o_ethread; /* threading list of objects */
...
};
struct element {
Element *e_next; /* singly linked list of elements */
Object *e_head; /* head of object-level threading list */
int e_refcount; /* number of objects using this list */
...
};

Given the syntactic havoc that typedefs wreak, it is a good idea
to have a "trick" for distinguishing typedef names from all other
identifiers. In this case I used the "uppercase first letter"
trick (although in practice, I prefer not to use the typedefs at
all).

(Note that, given some element e, we can find all objects that
refer to e using:

for (o = e->e_head; o != NULL; o = o->o_ethread)
...

and given an object O, we can find all its elements with:

for (e = o->o_elements; e != NULL; e = e->e_next)

This does, however, assume that if objects o1 and o2 both use some
element e, they both use exactly the same element-list. I made
this example up on the spot and do not guarantee its usefulness.)

Paul Hsieh · Nov 30, 2003

I'm not really a big fan of your approach. I would instead do it like
this:

int IterateTokens (const char * string,
int (*cb) (void * parm, size_t offset, size_t len),
void * parm);

With the idea that the callback function cb would be called on each
successive token. (The purpose of void * parm is to let you use a
universal callback function, while retaining specific context to a
particular string that you are tokenizing. I.e., IterateTokens just
passes on the parm it was given to the cb function.) This allows you
to be flexible about how you consume your tokens.

For example, if cb returns with a negative number, or other "fail
code" then you can use that to halt the iterator early (for example
you might be able to detect that a syntax error has occurred prior to
performing tokenizing the whole string.) You can also make the return
value of IterateTokens just return the number of successful tokens
that were passed to the callback, or a negative number to indicate
some kind of failure.

It also might not be necessary to '\0' terminate the tokens as you
encounter them. Certainly if you were using bstrlib
(http://bstring.sf.net) you wouldn't want to. This gives you a way to
perform "tokenization by reference" if its possible for you to avoid
copying the tokens for further processing (for example if you have a
totally incremental parsing algorithm.) If you actually need '\0'
terminated strings without damaging the input string, then doing a
precisely sized malloc, memcpy, and storing a terminating '\0' in your
callback function is trivial and will do the trick -- it remains for
you to then design a stack or list or vector or whatever you like to
retain all these strings (all the work being done in the callback
function).

The point is, that with a mechanism like this, you are flexible enough
to employ any policy of how you want your results from the process of
tokenization. It also seperates the process of actual parsing from
the process of dealing with the results.

Lorenzo J. Lucchini · Nov 30, 2003

I'm not really a big fan of your approach. I would instead do it like
this:

int IterateTokens (const char * string,
int (*cb) (void * parm, size_t offset, size_t len),
void * parm);

With the idea that the callback function cb would be called on each
successive token. (The purpose of void * parm is to let you use a
universal callback function, while retaining specific context to a
particular string that you are tokenizing. I.e., IterateTokens just
passes on the parm it was given to the cb function.) This allows you
to be flexible about how you consume your tokens.

But see this expression:
$AF+MyVar+0x1234==0

"$AF" is a token in my parsing token, all is fine. So is "+".
"MyVar", though, is not a token: on it, my ExtractToken() function
returns 0 (synonym for "no [significant] token"), and Symbol becomes
"".
At this point, my parsing loop decides that what follows the "+" must
be a variable (which it treats like a "free form" string), and so it
calls ExtractFreeform(), which extracts everything up to the next
valid token it encounters ("+" in this case).

Now, I could make variables like "MyVar" tokens to avoid embarassing
ExtractToken() and using this ExtractFreeform() function.
But besides now working well with my program, this approach couldn't
handle the next part of the expression: "0x1234" is not a token that
ExtractToken() will ever be able to extract, so I *have* to call
ExtractFreeform() here.

As I see it, your approach would choke on "non-tokens" that are
currently handled by a call to ExtractFreeform() following a failed
(return value == 0, Symbol is "") call to ExtractToken().

For example, if cb returns with a negative number, or other "fail
code" then you can use that to halt the iterator early (for example
you might be able to detect that a syntax error has occurred prior to
performing tokenizing the whole string.)

Or I could use it to simulate the behavior of ExtractFreeform().
But this would force the calling loop (which, with your approach,
wouldn't probably be a loop at all) to:
1) do ExtractFreeform()'s job in a way not consistent with how
IterateTokens() works (it would even have to compute where in the
string parsing has stopped, unless IterateTokens returns precisely
this information)
2) after parsing the free-form part of the expression, "resurrect"
IterateTokens to parse from after the location of the free-form stuff

This looks awfully complicated...

You can also make the return
value of IterateTokens just return the number of successful tokens
that were passed to the callback, or a negative number to indicate
some kind of failure.

It also might not be necessary to '\0' terminate the tokens as you
encounter them. Certainly if you were using bstrlib
(http://bstring.sf.net) you wouldn't want to. This gives you a way to
perform "tokenization by reference" if its possible for you to avoid
copying the tokens for further processing (for example if you have a
totally incremental parsing algorithm.)

I *do* already "tokenize by reference", in a sense.
Most of the times, ExtractToken() is called with Symbol==NULL, i.e. it
doesn't copy the token anywhere. All that is needed is the int value
ExtractToken() returns: this value uniquely identifies the token most
of the times, unless it is 0, in which case the parser does have to
check Symbol to see if it encountered a "" (end of string / unparsable
token) or a " " (space, or other separator, whose token id is 0, as
well).

Look at this code:

Operator=ExtractToken(OpTokens, Expression, NULL);
if(Operator==OP_NONE) {
if(strlen(*Expression)==0) Operator=OP_END; else {
Register=ExtractToken(RegTokens, Expression, NULL);
if(Register==REG_NONE) {
char *Symbol=malloc(strlen(*Expression)+1);
ExtractFreeform(OpTokens, Expression, Symbol);
/* ... */
/* This is taken from a part of my program, but things not relevant
to the discussion have been stripped out. */

OpTokens and RegTokens are handle to different tokenization contexts;
for simplicity, I omitted this part in the declaration I posted.
OpTokens' tokens are operators like "+" and "-", while RegToken's
tokens are register specificators like "$AF" and "$HL".

The algorithm basically goes:

1) Extract next token. If it's an operator, fine, otherwise
2) See if it's a register specification. If it's not even that, then
3) It must be a variable (numeric literals omitted from the code
above)

You can see that, in 1) and 2), the extracted lexeme (Symbol) isn't
copied anywhere, because such a copy is not needed to identify the
token.
Only when a token can *not* be extracted do we need to analyze an
actual string (which is extracted by ExtractFreeform(), not even
ExtractToken()).

This is not the only place or the only way in my program where I use
the tokenization function, but it should give a picture.

If you actually need '\0'
terminated strings without damaging the input string, then doing a
precisely sized malloc, memcpy, and storing a terminating '\0' in your
callback function is trivial and will do the trick

Not if IterateTokens() doesn't *know* where the token ends.

-- it remains for
you to then design a stack or list or vector or whatever you like to
retain all these strings (all the work being done in the callback
function).

The point is, that with a mechanism like this, you are flexible enough
to employ any policy of how you want your results from the process of
tokenization. It also seperates the process of actual parsing from
the process of dealing with the results.

Yours is an interesting approach that can - I think - be most useful
is situations where the lexeme, I mean the token in its string form,
must often be copied, because an integral value (my "token id") cannot
identify it uniquely enough in an easy way. Even more useful perhaps
when all or most of the parsed tokens must be kept (in string form) in
a list somewhere.
In the latter case, your function would behave similarly to Morris
Dovey's function, except that the actual work of putting the extracted
tokens in a vector is done by the callback function in your approach,
instead of being embedded in the iterating token extractor.

But you haven't convinced me that your approach is superior in my
case. Using a callback function would be nice to eliminate the outer
parsing loop and make code concentrate on analysing tokens singularly;
however, the outer loop would eventually still be there, together with
other strange creatures, to handle the equivalent of
ExtractFreeform().

In addition, I also use my tokenizer to parse simple interactive
commands like "run" or "info stack", besides interpreting more complex
expressions like the example I made above. For this task, it seems to
me that the (now fairly simple) structure of my interactive command
interpreter would be sort of obfuscated by introducing a callback
function and somewhere to keep parsed tokens (or, alternatively, a
*complicated* callback function without having to keep parsed tokens
somewhere).

P.S. A question: for such a short signature as the one below, would
you find it necessary to use a standard signature marker?

by LjL
(e-mail address removed)

Paul Hsieh · Dec 1, 2003

I'm not really a big fan of your approach. I would instead do it like
this:

int IterateTokens (const char * string,
int (*cb) (void * parm, size_t offset, size_t len),
void * parm);

With the idea that the callback function cb would be called on each
successive token. (The purpose of void * parm is to let you use a
universal callback function, while retaining specific context to a
particular string that you are tokenizing. I.e., IterateTokens just
passes on the parm it was given to the cb function.) This allows you
to be flexible about how you consume your tokens.

Click to expand...

But see this expression:
$AF+MyVar+0x1234==0

"$AF" is a token in my parsing token, all is fine. So is "+".
"MyVar", though, is not a token: on it, my ExtractToken() function
returns 0 (synonym for "no [significant] token"), and Symbol becomes
"".

Huh? Then how do you continue to parse? I assume you use some other logic
to match the successive characters to a previously defined variable? But in
that case, don't you need a grammar for your variables? And does your
variable matching algorithm correctly differentiate between "My" and "MyVar"?

At this point, my parsing loop decides that what follows the "+" must
be a variable (which it treats like a "free form" string), and so it
calls ExtractFreeform(), which extracts everything up to the next
valid token it encounters ("+" in this case).

Ok, well I am use a more "classical" definition of token in which all
successive sub-portions are considered tokens, which are then given
differentiating attributes describing what they are. So let's call my kind of
tokens "meta-tokens" just to avoid confusion (so you could rename my function
IterateMetaTokens().)

The idea then would be that the callback function would first look into the
string and decide whether its one of what *you* call tokens, or something else
(a variable I suppose?) This logic would be analogous to whatever logic you
used to decide that a ... uh ... *symbol* becomes "". Then you would branch
into two cases -- one which parses your "tokens", and another which calls
a variation on ExtractFreeform(), I suppose.

So IterateMetaTokens() would never return something analogous to "" (i.e.,
setting the len to 0 in the callback) which assumes that an empty string is
never valid in your gramar. So you would have put in the simple logic of
scanning ahead the length of a variable ("Freeform"?) when a "nontoken" was
detected into IterateMetaTokens().

Now, I could make variables like "MyVar" tokens to avoid embarassing
ExtractToken() and using this ExtractFreeform() function.

Well, "MyVar" would be a meta-token. Your callback would then decide how it
further sub-classified.

But besides now working well with my program, this approach couldn't
handle the next part of the expression: "0x1234" is not a token that
ExtractToken() will ever be able to extract, so I *have* to call
ExtractFreeform() here.

I don't understand. What makes you think "0x1234" couldn't be parsed into
a meta token in the IterateMetaToken() function?

As I see it, your approach would choke on "non-tokens" that are
currently handled by a call to ExtractFreeform() following a failed
(return value == 0, Symbol is "") call to ExtractToken().

No -- if a "token"-wise parsing failed within IterateMetaToken() you would
immediately try to extract a free form token. The role of the callback
function would be to track the semantics of whatever it is that it received.

Or I could use it to simulate the behavior of ExtractFreeform().
But this would force the calling loop (which, with your approach,
wouldn't probably be a loop at all) to:
1) do ExtractFreeform()'s job in a way not consistent with how
IterateTokens() works (it would even have to compute where in the
string parsing has stopped, unless IterateTokens returns precisely
this information)

IterateMetaTokens would never *modify* anything (take notice of the const in
the parameter list) so its not really relevant that it retain its place after
returning, whether it failed or not. You can still know the last successful
point of parsing before it failed by tracking the last offset + len passed to
the callback function. You can store this info in an arbitrary struct which
is pointed to by the cookie variable: parm.

2) after parsing the free-form part of the expression, "resurrect"
IterateTokens to parse from after the location of the free-form stuff

This looks awfully complicated...

Well, parsing in general is complicated no matter what you do. There are
entire books written about it. The method I am describing to you follows the
general notions of a typical tokenizable grammar.

I *do* already "tokenize by reference", in a sense.
Most of the times, ExtractToken() is called with Symbol==NULL, i.e. it
doesn't copy the token anywhere. All that is needed is the int value
ExtractToken() returns: this value uniquely identifies the token most
of the times, unless it is 0, in which case the parser does have to
check Symbol to see if it encountered a "" (end of string / unparsable
token) or a " " (space, or other separator, whose token id is 0, as
well).

Yes, because most of your grammar metaTokens are 1 character in length right?
Well that just means that in my solution most of the time len will be equal to
1, in which case you do a simple character test (a switch with a default) to
figure out if its "token" versus a variable.

Look at this code:

Operator=ExtractToken(OpTokens, Expression, NULL);
if(Operator==OP_NONE) {
if(strlen(*Expression)==0) Operator=OP_END; else {
Register=ExtractToken(RegTokens, Expression, NULL);
if(Register==REG_NONE) {
char *Symbol=malloc(strlen(*Expression)+1);
ExtractFreeform(OpTokens, Expression, Symbol);
/* ... */
/* This is taken from a part of my program, but things not relevant
to the discussion have been stripped out. */

Well the call-back with my design would look something like:

int categorizeMetaToken (void * parm, size_t offset, size_t len) {
if (len == 1) switch (((char *)parm)[offset]) {
/* A token has been received? */
case '+': /* ... etc */ ; return 0;
case '*': /* ... etc */ ; return 0;
/* ... etc */
/* Nope! Its just a 1-character long variable */
default : break;
}
{
struct tagbstring Symbol;
blk2tbstr (Symbol, (char *)parm + offset, len);
/* Symbol now contains a tabstring of the variable/symbol */
/* ... etc */
return 0; /* Success */
}
return -1; /* Some kind of error. */
}

If it bothers you that you have determine that something is a one character
token twice (once in the outer loop, and once in the callback function) you
could pass an additional parameter to the callback -- something like "int
category" or "int type". So that you wouldn't need a "default" case, and each
case of the switch could break, then go to a common "return 0" exit point.

OpTokens and RegTokens are handle to different tokenization contexts;
for simplicity, I omitted this part in the declaration I posted.
OpTokens' tokens are operators like "+" and "-", while RegToken's
tokens are register specificators like "$AF" and "$HL".

The algorithm basically goes:

1) Extract next token. If it's an operator, fine, otherwise
2) See if it's a register specification. If it's not even that, then
3) It must be a variable (numeric literals omitted from the code
above)

You can see that, in 1) and 2), the extracted lexeme (Symbol) isn't
copied anywhere, because such a copy is not needed to identify the
token.
Only when a token can *not* be extracted do we need to analyze an
actual string (which is extracted by ExtractFreeform(), not even
ExtractToken()).

Right -- but notice how in my solution no mallocs or copies are made *no
matter what*. But it assumes that you also use "the better string library" to
be able to achieve this.

This is not the only place or the only way in my program where I use
the tokenization function, but it should give a picture.

Not if IterateTokens() doesn't *know* where the token ends.

Well, the idea would be that it *should* know. But I am making classical
parsing algorithm assumptions. Though I cannot quite see how your conditions
would fail such an approach.

Yours is an interesting approach that can - I think - be most useful
is situations where the lexeme, I mean the token in its string form,
must often be copied, because an integral value (my "token id") cannot
identify it uniquely enough in an easy way. Even more useful perhaps
when all or most of the parsed tokens must be kept (in string form) in
a list somewhere.
In the latter case, your function would behave similarly to Morris
Dovey's function, except that the actual work of putting the extracted
tokens in a vector is done by the callback function in your approach,
instead of being embedded in the iterating token extractor.

But you haven't convinced me that your approach is superior in my
case.

That's because I think we have different concepts of how parsing should be
done. With me, I'm concerned that I don't know whether I want to follow a
strategy of storing the parsed metaTokens, or act on them as they are
consumed. Using a general callback mechanism like this allows me to change
that decision on a whim without modifying the code that does the actual
work of parsing metaTokens.

To me its just a program maintenance and flexibility issue. Any strategy can
be *forced* to work. But if you find out you have to change some of your
strategy which do you think will retain more code investment? A solution
which clearly seperates the parsing of metaTokens from what the meaning of
these tokens are, or a solution which intermingles the two notions?

[...] Using a callback function would be nice to eliminate the outer
parsing loop and make code concentrate on analysing tokens singularly;
however, the outer loop would eventually still be there, together with
other strange creatures, to handle the equivalent of
ExtractFreeform().

In addition, I also use my tokenizer to parse simple interactive
commands like "run" or "info stack", besides interpreting more complex
expressions like the example I made above.

Right -- so long as the metaTokenization semantics are the same, one can
simply plug in a different callback for the different contexts (in your case
top-level command line versus in-code, presumably.)

[...] For this task, it seems to
me that the (now fairly simple) structure of my interactive command
interpreter would be sort of obfuscated by introducing a callback
function and somewhere to keep parsed tokens (or, alternatively, a
*complicated* callback function without having to keep parsed tokens
somewhere).

P.S. A question: for such a short signature as the one below, would
you find it necessary to use a standard signature marker?

No. In fact I don't use a standard signature file or anything like that. I
actually type it by hand every time. That's because google doesn't have any
provisions for signature that I can tell. Also I sometimes change my
signature if I believe its appropriate for the context of the discussion.

Paul Hsieh · Dec 2, 2003

I'm not really a big fan of your approach. I would instead do it like
this:

int IterateTokens (const char * string,
int (*cb) (void * parm, size_t offset, size_t len),
void * parm);

With the idea that the callback function cb would be called on each
successive token. (The purpose of void * parm is to let you use a
universal callback function, while retaining specific context to a
particular string that you are tokenizing. I.e., IterateTokens just
passes on the parm it was given to the cb function.) This allows you
to be flexible about how you consume your tokens.

Click to expand...

But see this expression:
$AF+MyVar+0x1234==0

"$AF" is a token in my parsing token, all is fine. So is "+".
"MyVar", though, is not a token: on it, my ExtractToken() function
returns 0 (synonym for "no [significant] token"), and Symbol becomes
"".

Huh? Then how do you continue to parse? I assume you use some other logic
to match the successive characters to a previously defined variable? But in
that case, don't you need a grammar for your variables? And does your
variable matching algorithm correctly differentiate between "My" and "MyVar"?

At this point, my parsing loop decides that what follows the "+" must
be a variable (which it treats like a "free form" string), and so it
calls ExtractFreeform(), which extracts everything up to the next
valid token it encounters ("+" in this case).

Ok, well I am use a more "classical" definition of token in which all
successive sub-portions are considered tokens, which are then given
differentiating attributes describing what they are. So let's call my kind of
tokens "meta-tokens" just to avoid confusion (so you could rename my function
IterateMetaTokens().)

The idea then would be that the callback function would first look into the
string and decide whether its one of what *you* call tokens, or something else
(a variable I suppose?) This logic would be analogous to whatever logic you
used to decide that a ... uh ... *symbol* becomes "". Then you would branch
into two cases -- one which parses your "tokens", and another which calls
a variation on ExtractFreeform(), I suppose.

So IterateMetaTokens() would never return something analogous to "" (i.e.,
setting the len to 0 in the callback) which assumes that an empty string is
never valid in your gramar. So you would have put in the simple logic of
scanning ahead the length of a variable ("Freeform"?) when a "nontoken" was
detected into IterateMetaTokens().

Now, I could make variables like "MyVar" tokens to avoid embarassing
ExtractToken() and using this ExtractFreeform() function.

Well, "MyVar" would be a meta-token. Your callback would then decide how it
further sub-classified.

But besides now working well with my program, this approach couldn't
handle the next part of the expression: "0x1234" is not a token that
ExtractToken() will ever be able to extract, so I *have* to call
ExtractFreeform() here.

I don't understand. What makes you think "0x1234" couldn't be parsed into
a meta token in the IterateMetaToken() function?

As I see it, your approach would choke on "non-tokens" that are
currently handled by a call to ExtractFreeform() following a failed
(return value == 0, Symbol is "") call to ExtractToken().

No -- if a "token"-wise parsing failed within IterateMetaToken() you would
immediately try to extract a free form token. The role of the callback
function would be to track the semantics of whatever it is that it received.

Or I could use it to simulate the behavior of ExtractFreeform().
But this would force the calling loop (which, with your approach,
wouldn't probably be a loop at all) to:
1) do ExtractFreeform()'s job in a way not consistent with how
IterateTokens() works (it would even have to compute where in the
string parsing has stopped, unless IterateTokens returns precisely
this information)

IterateMetaTokens would never *modify* anything (take notice of the const in
the parameter list) so its not really relevant that it retain its place after
returning, whether it failed or not. You can still know the last successful
point of parsing before it failed by tracking the last offset + len passed to
the callback function. You can store this info in an arbitrary struct which
is pointed to by the cookie variable: parm.

2) after parsing the free-form part of the expression, "resurrect"
IterateTokens to parse from after the location of the free-form stuff

This looks awfully complicated...

Well, parsing in general is complicated no matter what you do. There are
entire books written about it. The method I am describing to you follows the
general notions of a typical tokenizable grammar.

I *do* already "tokenize by reference", in a sense.
Most of the times, ExtractToken() is called with Symbol==NULL, i.e. it
doesn't copy the token anywhere. All that is needed is the int value
ExtractToken() returns: this value uniquely identifies the token most
of the times, unless it is 0, in which case the parser does have to
check Symbol to see if it encountered a "" (end of string / unparsable
token) or a " " (space, or other separator, whose token id is 0, as
well).

Yes, because most of your grammar metaTokens are 1 character in length right?
Well that just means that in my solution most of the time len will be equal to
1, in which case you do a simple character test (a switch with a default) to
figure out if its "token" versus a variable.

Look at this code:

Operator=ExtractToken(OpTokens, Expression, NULL);
if(Operator==OP_NONE) {
if(strlen(*Expression)==0) Operator=OP_END; else {
Register=ExtractToken(RegTokens, Expression, NULL);
if(Register==REG_NONE) {
char *Symbol=malloc(strlen(*Expression)+1);
ExtractFreeform(OpTokens, Expression, Symbol);
/* ... */
/* This is taken from a part of my program, but things not relevant
to the discussion have been stripped out. */

Well the call-back with my design would look something like:

int categorizeMetaToken (void * parm, size_t offset, size_t len) {
if (len == 1) switch (((char *)parm)[offset]) {
/* A token has been received? */
case '+': /* ... etc */ ; return 0;
case '*': /* ... etc */ ; return 0;
/* ... etc */
/* Nope! Its just a 1-character long variable */
default : break;
}
{
struct tagbstring Symbol;
blk2tbstr (Symbol, (char *)parm + offset, len);
/* Symbol now contains a tabstring of the variable/symbol */
/* ... etc */
return 0; /* Success */
}
return -1; /* Some kind of error. */
}

If it bothers you that you have determine that something is a one character
token twice (once in the outer loop, and once in the callback function) you
could pass an additional parameter to the callback -- something like "int
category" or "int type". So that you wouldn't need a "default" case, and each
case of the switch could break, then go to a common "return 0" exit point.

OpTokens and RegTokens are handle to different tokenization contexts;
for simplicity, I omitted this part in the declaration I posted.
OpTokens' tokens are operators like "+" and "-", while RegToken's
tokens are register specificators like "$AF" and "$HL".

The algorithm basically goes:

1) Extract next token. If it's an operator, fine, otherwise
2) See if it's a register specification. If it's not even that, then
3) It must be a variable (numeric literals omitted from the code
above)

You can see that, in 1) and 2), the extracted lexeme (Symbol) isn't
copied anywhere, because such a copy is not needed to identify the
token.
Only when a token can *not* be extracted do we need to analyze an
actual string (which is extracted by ExtractFreeform(), not even
ExtractToken()).

Right -- but notice how in my solution no mallocs or copies are made *no
matter what*. But it assumes that you also use "the better string library" to
be able to achieve this.

This is not the only place or the only way in my program where I use
the tokenization function, but it should give a picture.

Not if IterateTokens() doesn't *know* where the token ends.

Well, the idea would be that it *should* know. But I am making classical
parsing algorithm assumptions. Though I cannot quite see how your conditions
would fail such an approach.

Yours is an interesting approach that can - I think - be most useful
is situations where the lexeme, I mean the token in its string form,
must often be copied, because an integral value (my "token id") cannot
identify it uniquely enough in an easy way. Even more useful perhaps
when all or most of the parsed tokens must be kept (in string form) in
a list somewhere.
In the latter case, your function would behave similarly to Morris
Dovey's function, except that the actual work of putting the extracted
tokens in a vector is done by the callback function in your approach,
instead of being embedded in the iterating token extractor.

But you haven't convinced me that your approach is superior in my
case.

That's because I think we have different concepts of how parsing should be
done. With me, I'm concerned that I don't know whether I want to follow a
strategy of storing the parsed metaTokens, or act on them as they are
consumed. Using a general callback mechanism like this allows me to change
that decision on a whim without modifying the code that does the actual
work of parsing metaTokens.

To me its just a program maintenance and flexibility issue. Any strategy can
be *forced* to work. But if you find out you have to change some of your
strategy which do you think will retain more code investment? A solution
which clearly seperates the parsing of metaTokens from what the meaning of
these tokens are, or a solution which intermingles the two notions?

[...] Using a callback function would be nice to eliminate the outer
parsing loop and make code concentrate on analysing tokens singularly;
however, the outer loop would eventually still be there, together with
other strange creatures, to handle the equivalent of
ExtractFreeform().

In addition, I also use my tokenizer to parse simple interactive
commands like "run" or "info stack", besides interpreting more complex
expressions like the example I made above.

Right -- so long as the metaTokenization semantics are the same, one can
simply plug in a different callback for the different contexts (in your case
top-level command line versus in-code, presumably.)

[...] For this task, it seems to
me that the (now fairly simple) structure of my interactive command
interpreter would be sort of obfuscated by introducing a callback
function and somewhere to keep parsed tokens (or, alternatively, a
*complicated* callback function without having to keep parsed tokens
somewhere).

P.S. A question: for such a short signature as the one below, would
you find it necessary to use a standard signature marker?

No. In fact I don't use a standard signature file or anything like that. I
actually type it by hand every time. That's because google doesn't have any
provisions for signature that I can tell. Also I sometimes change my
signature if I believe its appropriate for the context of the discussion.

Can't solve problems! please Help	0	Sep 26, 2022
Why does strcat mess up the tokens in strtok (and strtok_r)?	92	Jun 11, 2014
Lexical Analysis on C++	1	Oct 31, 2023
Is it possible to convert this JS code into .NET?	1	Oct 19, 2022
TF-IDF	1	Aug 19, 2021
strtok question	6	May 31, 2008
tokenizer class	1	Sep 11, 2007
Tokenizer Function (plus rant on strtok documentation)	18	Jul 11, 2006

Tokenizer interface

Lorenzo J. Lucchini

Andreas Kahari

Kaz Kylheku

Morris Dovey

Lorenzo J. Lucchini

Lorenzo J. Lucchini

Lorenzo J. Lucchini

Chris Torek

Paul Hsieh

Lorenzo J. Lucchini

Paul Hsieh

Paul Hsieh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads