I'm not really a big fan of your approach. I would instead do it like
this:
int IterateTokens (const char * string,
int (*cb) (void * parm, size_t offset, size_t len),
void * parm);
With the idea that the callback function cb would be called on each
successive token. (The purpose of void * parm is to let you use a
universal callback function, while retaining specific context to a
particular string that you are tokenizing. I.e., IterateTokens just
passes on the parm it was given to the cb function.) This allows you
to be flexible about how you consume your tokens.
But see this expression:
$AF+MyVar+0x1234==0
"$AF" is a token in my parsing token, all is fine. So is "+".
"MyVar", though, is not a token: on it, my ExtractToken() function
returns 0 (synonym for "no [significant] token"), and Symbol becomes
"".
At this point, my parsing loop decides that what follows the "+" must
be a variable (which it treats like a "free form" string), and so it
calls ExtractFreeform(), which extracts everything up to the next
valid token it encounters ("+" in this case).
Now, I could make variables like "MyVar" tokens to avoid embarassing
ExtractToken() and using this ExtractFreeform() function.
But besides now working well with my program, this approach couldn't
handle the next part of the expression: "0x1234" is not a token that
ExtractToken() will ever be able to extract, so I *have* to call
ExtractFreeform() here.
As I see it, your approach would choke on "non-tokens" that are
currently handled by a call to ExtractFreeform() following a failed
(return value == 0, Symbol is "") call to ExtractToken().
For example, if cb returns with a negative number, or other "fail
code" then you can use that to halt the iterator early (for example
you might be able to detect that a syntax error has occurred prior to
performing tokenizing the whole string.)
Or I could use it to simulate the behavior of ExtractFreeform().
But this would force the calling loop (which, with your approach,
wouldn't probably be a loop at all) to:
1) do ExtractFreeform()'s job in a way not consistent with how
IterateTokens() works (it would even have to compute where in the
string parsing has stopped, unless IterateTokens returns precisely
this information)
2) after parsing the free-form part of the expression, "resurrect"
IterateTokens to parse from after the location of the free-form stuff
This looks awfully complicated...
You can also make the return
value of IterateTokens just return the number of successful tokens
that were passed to the callback, or a negative number to indicate
some kind of failure.
It also might not be necessary to '\0' terminate the tokens as you
encounter them. Certainly if you were using bstrlib
(
http://bstring.sf.net) you wouldn't want to. This gives you a way to
perform "tokenization by reference" if its possible for you to avoid
copying the tokens for further processing (for example if you have a
totally incremental parsing algorithm.)
I *do* already "tokenize by reference", in a sense.
Most of the times, ExtractToken() is called with Symbol==NULL, i.e. it
doesn't copy the token anywhere. All that is needed is the int value
ExtractToken() returns: this value uniquely identifies the token most
of the times, unless it is 0, in which case the parser does have to
check Symbol to see if it encountered a "" (end of string / unparsable
token) or a " " (space, or other separator, whose token id is 0, as
well).
Look at this code:
Operator=ExtractToken(OpTokens, Expression, NULL);
if(Operator==OP_NONE) {
if(strlen(*Expression)==0) Operator=OP_END; else {
Register=ExtractToken(RegTokens, Expression, NULL);
if(Register==REG_NONE) {
char *Symbol=malloc(strlen(*Expression)+1);
ExtractFreeform(OpTokens, Expression, Symbol);
/* ... */
/* This is taken from a part of my program, but things not relevant
to the discussion have been stripped out. */
OpTokens and RegTokens are handle to different tokenization contexts;
for simplicity, I omitted this part in the declaration I posted.
OpTokens' tokens are operators like "+" and "-", while RegToken's
tokens are register specificators like "$AF" and "$HL".
The algorithm basically goes:
1) Extract next token. If it's an operator, fine, otherwise
2) See if it's a register specification. If it's not even that, then
3) It must be a variable (numeric literals omitted from the code
above)
You can see that, in 1) and 2), the extracted lexeme (Symbol) isn't
copied anywhere, because such a copy is not needed to identify the
token.
Only when a token can *not* be extracted do we need to analyze an
actual string (which is extracted by ExtractFreeform(), not even
ExtractToken()).
This is not the only place or the only way in my program where I use
the tokenization function, but it should give a picture.
If you actually need '\0'
terminated strings without damaging the input string, then doing a
precisely sized malloc, memcpy, and storing a terminating '\0' in your
callback function is trivial and will do the trick
Not if IterateTokens() doesn't *know* where the token ends.
-- it remains for
you to then design a stack or list or vector or whatever you like to
retain all these strings (all the work being done in the callback
function).
The point is, that with a mechanism like this, you are flexible enough
to employ any policy of how you want your results from the process of
tokenization. It also seperates the process of actual parsing from
the process of dealing with the results.
Yours is an interesting approach that can - I think - be most useful
is situations where the lexeme, I mean the token in its string form,
must often be copied, because an integral value (my "token id") cannot
identify it uniquely enough in an easy way. Even more useful perhaps
when all or most of the parsed tokens must be kept (in string form) in
a list somewhere.
In the latter case, your function would behave similarly to Morris
Dovey's function, except that the actual work of putting the extracted
tokens in a vector is done by the callback function in your approach,
instead of being embedded in the iterating token extractor.
But you haven't convinced me that your approach is superior in my
case. Using a callback function would be nice to eliminate the outer
parsing loop and make code concentrate on analysing tokens singularly;
however, the outer loop would eventually still be there, together with
other strange creatures, to handle the equivalent of
ExtractFreeform().
In addition, I also use my tokenizer to parse simple interactive
commands like "run" or "info stack", besides interpreting more complex
expressions like the example I made above. For this task, it seems to
me that the (now fairly simple) structure of my interactive command
interpreter would be sort of obfuscated by introducing a callback
function and somewhere to keep parsed tokens (or, alternatively, a
*complicated* callback function without having to keep parsed tokens
somewhere).
P.S. A question: for such a short signature as the one below, would
you find it necessary to use a standard signature marker?
by LjL
(e-mail address removed)