Michael said:
[snip 115 lines quoted without comment]
Learn to trim your posts, please.
*What* isn't necessarily *what*? Part of providing context is making
sure it's specific.
lexers are sometimes quite complex
And sometimes quite simple.
and for something
as simplistic as this a simplistic solution would work just the same.
Obviously it wouldn't "work just the same" unless it was the same. A
simplistic solution might well work, though since we don't have a good
description of the extent of the OP's actual needs we can't determine
how simple a correct solution might be.
Using an existing implementation would also work. So would writing the
program using a language with high-level constructs for parsing. The
fact that a solution would work is no recommendation.
[snip example of brute-force parsing]
This too is easy. Our subject would be the easiest to parse, however at
the same time it will be somewhat difficult due to reasons mentioned by
others in this thread. This is where we will manipulate strtok() to our
needs. We have three "parameters" in the above line and we need to
separate them. To keep things simple (we all like KISS) we will now
I prefer correct, robust, and maintainable over simple, thank you anyway.
That's why I recommended a proper solution rather than an ad hoc brute
force one.
remove yet another parameter to make our task a little less cumbersome.
Depending on the implementation, recipient will either be in the form
joeuser or (e-mail address removed).
That was not specified in the OP's request. It might be implied by,
say, the SMTP RFC, but we have no way of knowing whether that's
applicable.
Since there will never be any spaces in a recipient,
This is precisely the sort of assumption that makes for a fragile
implementation.
we can remove that part as well. Now we are left with the
following:
"body" "subject"
Much easier to handle now. The subject will be shorter than the body in
most cases, so let's take it out of the equation as well. Presuming that
all double quotes are escaped, this will be somewhat easy.
Not with strtok. You claim you're going to "manipulate strtok() to
[your] needs", but in your sample code you don't use it - because, as
I noted, it's the wrong approach.
Once we are
done with that we simply take of the first " and the last " and we have
our body. Below is some sample code to help illustrate this.
Congratulations. You've just created a awkward, domain-specific,
inflexible, fragile parser.
I know that using fixed sizes on the strings is ugly - I would never do
this in production code, but for illustrative purposes:
void send_message(char *data)
If you've already determined that the command is SEND, why don't you
know where its arguments start? Why are you passing in a pointer to
the command at all?
{
char *tokenptr; /* strtok() */
Which you never call...
unsigned int loc; /* placeholder/counter */
char recipient[512];
char subject[512];
/* trim off the command "SEND" */
for(loc = 0; data[loc] != ' '; loc++) /* get to our destination */;
If there is no space, you've just invoked UB. Sometimes input isn't
well-formed, and one job of a parser is to handle that.
loc++; /* move to the end of the space */
memmove(data, &data[loc], strlen(&data[loc]) + 1);
Why bother shifting the data? Keeping a pointer or offset to the
arguments suffices.
/* now we have: recipient "body" "subject" */
memset(recipient, '\0', 512);
Whatever for? You're going to overwrite those bytes in a moment
anyway. And if you must clear it, what's wrong with just assigning
{0} to it when you define it? And if you must use memset, why are
you passing in the length as a magic number?
Yes, yes, this is "example" code. Examples have a way of migrating
into production; those who take the time to write them well have
fewer problems to solve down the road.
(I also don't see much point in using '\0' rather than 0 for the
second argument, but that's a matter of style.)
for(loc = 0; data[loc] != ' '; loc++) /* get to our destination */;
As above, plus you've now reified your assumption about no spaces in
the recipient argument. If that were part of the grammar, you could
either support a quoted spacy string for recipient, or reject it as
malformed data, for free - it'd be handled automatically by the lexer
and parser.
That's the difference between production-quality code and ad hoc
messes.
loc++;
strncpy(recipient, data, loc);
Ugh. strncpy is rarely the right solution, either; it's nearly as
broken as strtok.
First problem: if loc > sizeof recipient, you've just invoked UB.
Second problem: if loc == sizeof recipient, you now have an
unterminated string in recipient. (You only have a terminated one
in there if loc < sizeof recipient because of the wasteful earlier
memset; better to just do "recipient[loc] = 0;" after copying the
data in.)
Third problem: if your input is well-formed, then there is no null
character in the first loc bytes of data, so strncpy buys you
nothing. It's just memcpy with a pointless extra test.
In general, strncpy is the Wrong Thing anyway. Heathfield's
Observation applies: most programs need to know if they're truncating
the source, so they need to know whether it was larger than the
destination anyway, so they might as well call strlen and then memcpy
the appropriate length. And strncpy's behavior when source is smaller
than destination is infelicitous in most circumstances, as it wastes
cycles and potentially spoils locality of reference and otherwise
messes with good memory management.
memmove(data, &data[loc], strlen(&data[loc]) + 1);
Now we've shifted the entire body twice. No doubt we could find a
less efficient approach, but it'd take some thought.
/* now we have: "body" "subject" */
for(loc = strlen(data) - 1; data[loc] != '"' && data[loc - 1] != '\\';
loc--) /* get to the first unescaped quote in the subject */ ;
Now your assumption about how quotation marks are escaped is reified
in an obscure line of code in the middle of a function. Changing or
extending it is a maintenance nightmare. And if you're supporting
multiple commands that have to handle this syntax, you have to make
those changes in multiple functions, because you've pushed your parsing
down into the handling of each specific function - duplicating effort
and errors.
memset(subject, '\0', 512);
strncpy(subject, &data[loc], strlen(&data[loc]) - 1);
data[loc] = '\0';
Here the memset is completely superfluous - it's not even serving as
an inefficient way of terminating the string.
/* all we have left now is the body. remove the first and last " */
data[strlen(data) - 1] = '\0';
memmove(data, &data[1], strlen(&data[1]) + 1);
Ouch! The body is moved for a third time. For this one simple task
you've managed to do three potentially very large buffer copies. (I
would hate to see this try to process a bunch of the huge MIME
multipart email messages people like to throw around these days, with
whopping binary attachments in Base64 or uuencode.)
And a couple of unnecessary calls to strlen, while we're at it. The
function should already know how long the body is - it should have
known it as soon as it parsed it out. There's never any reason to
determine it again except sheer laziness.
/* subject now contains the subject, recipient now contains the
recipient, and data now contains the body. pretty simple. */
Pretty ghastly is what it is.
The code above is certainly not perfect, but it certainly works
(untested, but should work - in theory).
Motto!
It eliminates the need to use
strtok() and it should be just as fast (maybe a little slower, but not
enough to matter for something of this size) as strtok().
Since strtok() isn't applicable to this problem, that's a meaningless
comparison. And considering the number of buffer copies and scans
this code does, I wouldn't be entering it in any speed contests if I
were you.
Ad hoc brute-force parsing has its place, I suppose, in trivial q&d
one-offs and the like. I took this to be a serious question, deserving
a serious answer, and the serious answer is to write a real parser.
Anything else is just a disaster waiting to happen - and by the time it
satisfies all the requirements, very likely more work than doing it
correctly in the first place.