Does strtok require a non-null token?

R

ryampolsky

I'm using strtok to break apart a colon-delimited string. It basically
works, but it looks like strtok skips over empty sections. In other
words, if the string has 2 colons in a row, it doesn't treat that as a
null token, it just treats the 2 colons as a single delimiter.

Is that the intended behavior?
 
A

Al Balmer

I'm using strtok to break apart a colon-delimited string. It basically
works, but it looks like strtok skips over empty sections. In other
words, if the string has 2 colons in a row, it doesn't treat that as a
null token, it just treats the 2 colons as a single delimiter.

Is that the intended behavior?

Yes. This is one of the drawbacks of strtok. From the current
position, it searches for a character *not* in the delimiter set, sets
this position as the return pointer, then searches for the first
character that *is* in the delimiter set and sets it to null.

(Individual implementations may be different, but that's the way it's
required to behave.)

For your application, it's probably easier to scan the string
yourself.
 
W

William Hughes

I'm using strtok to break apart a colon-delimited string. It basically
works, but it looks like strtok skips over empty sections. In other
words, if the string has 2 colons in a row, it doesn't treat that as a
null token, it just treats the 2 colons as a single delimiter.

Is that the intended behavior?

Yes. Just one more reason to avoid strtok().

- William Hughes
 
D

Default User

William said:
Yes. Just one more reason to avoid strtok().

Unless that's the behavior you want. Example, breaking lines into words
with white space. You don't want a bunch of "null" words.





Brian
 
C

CBFalconer

I'm using strtok to break apart a colon-delimited string. It
basically works, but it looks like strtok skips over empty
sections. In other words, if the string has 2 colons in a row, it
doesn't treat that as a null token, it just treats the 2 colons as
a single delimiter.

Is that the intended behavior?

Yes. If that is a problem, consider using my toksplit routine, the
code for which has been published here before. I think googling
for "toksplit" will bring it up, so I won't burden the newsgroup
with YAC (yet another copy).

--
Some informative links:
< <http://www.geocities.com/nnqweb/>
<http://www.catb.org/~esr/faqs/smart-questions.html>
<http://www.caliburn.nl/topposting.html>
<http://www.netmeister.org/news/learn2quote.html>
<http://cfaj.freeshell.org/google/>
 
B

Ben Pfaff

I'm using strtok to break apart a colon-delimited string. It basically
works, but it looks like strtok skips over empty sections. In other
words, if the string has 2 colons in a row, it doesn't treat that as a
null token, it just treats the 2 colons as a single delimiter.

strtok() has at least these problems:

* It merges adjacent delimiters. If you use a comma as your
delimiter, then "a,,b,c" will be divided into three tokens,
not four. This is often the wrong thing to do. In fact, it
is only the right thing to do, in my experience, when the
delimiter set contains white space (for dividing a string
into "words") or it is known in advance that there will be
no adjacent delimiters.

* The identity of the delimiter is lost, because it is
changed to a null terminator.

* It modifies the string that it tokenizes. This is bad
because it forces you to make a copy of the string if
you want to use it later. It also means that you can't
tokenize a string literal with it; this is not
necessarily something you'd want to do all the time but
it is surprising.

* It can only be used once at a time. If a sequence of
strtok() calls is ongoing and another one is started,
the state of the first one is lost. This isn't a
problem for small programs but it is easy to lose track
of such things in hierarchies of nested functions in
large programs. In other words, strtok() breaks
encapsulation.
 
W

William Hughes

Default said:
Unless that's the behavior you want. Example, breaking lines into words
with white space. You don't want a bunch of "null" words.

The point is not that the function's behaviour is not sometimes
what you want. The point is

-the default behaviour is surprising

-the default behaviour is not even
usually what you want

-the default behaviour throws information away

-if you don't like the default behaviour, see
figure 1.

Personally I'm with the Linux man pages on this one. Under Bugs
is the advice "Never use this function".

-William Hughes
 
D

Default User

William said:
Default User wrote:

The point is not that the function's behaviour is not sometimes
what you want. The point is

-the default behaviour is surprising

Only if one fails to read the documentation. A number of functions are
funny that way.
-the default behaviour is not even
usually what you want

How do you know? Even if true, so what?
-the default behaviour throws information away

Again, if you know that and if fits the problem, so what?
-if you don't like the default behaviour, see
figure 1.

I don't understand this statement. I have no idea what "figure 1" is.
Personally I'm with the Linux man pages on this one. Under Bugs
is the advice "Never use this function".

Well, that's stupid advice. The function may be tricky, but sometimes
it's just the right thing. In those cases, it should be used. If not,
it shouldn't.



Brian
 
A

Al Balmer

The point is not that the function's behaviour is not sometimes
what you want. The point is

-the default behaviour is surprising

The behavior of many functions might be surprising if you don't read
the documentation.
-the default behaviour is not even
usually what you want
Like any other function in the library, it's used where appropriate.
Sometimes it *is* what I want.
-the default behaviour throws information away

I don't really know what information you're referring to. You could
just as easily say it adds information. If there's information that
you need to protect, it's trivial.
-if you don't like the default behaviour, see
figure 1.

? Did you copy this from a book with pictures? That would explain the
odd indentation, I suppose.
Personally I'm with the Linux man pages on this one. Under Bugs
is the advice "Never use this function".

That's silly. Like any other function, it should be used when
appropriate, and not used when not appropriate.
 
W

William Hughes

Al said:
The behavior of many functions might be surprising if you don't read
the documentation.
Like any other function in the library, it's used where appropriate.
Sometimes it *is* what I want.

I don't really know what information you're referring to.

The number of delimiters. (strtok() also discards the identity
of these delimiters but that has not been previously mentioned in
this subthread).
You could
just as easily say it adds information. If there's information that
you need to protect, it's trivial.

? Did you copy this from a book with pictures? That would explain the
odd indentation, I suppose.

figure 1. is a picture of a hand with a single digit extended (guess
which
one). It comes from an old piece of xerox-lore, a parody of DEC (?)
documentation in which an oft repeated phase is "see figure 1."
I guess the reference was a little too obscure.
That's silly. Like any other function, it should be used when
appropriate, and not used when not appropriate.

Well, never is probably too strong. However, strtok() is dominated by
a good general purpose parsing method. Since you need a good
general purpose parsing method, why not use that instead of
strtok()?

- William Hughes
 
C

CBFalconer

Ben said:
strtok() has at least these problems:

* It merges adjacent delimiters. If you use a comma as your
delimiter, then "a,,b,c" will be divided into three tokens,
not four. This is often the wrong thing to do. In fact, it
is only the right thing to do, in my experience, when the
delimiter set contains white space (for dividing a string
into "words") or it is known in advance that there will be
no adjacent delimiters.

* The identity of the delimiter is lost, because it is
changed to a null terminator.

* It modifies the string that it tokenizes. This is bad
because it forces you to make a copy of the string if
you want to use it later. It also means that you can't
tokenize a string literal with it; this is not
necessarily something you'd want to do all the time but
it is surprising.

* It can only be used once at a time. If a sequence of
strtok() calls is ongoing and another one is started,
the state of the first one is lost. This isn't a
problem for small programs but it is easy to lose track
of such things in hierarchies of nested functions in
large programs. In other words, strtok() breaks
encapsulation.

Whence sprang toksplit, which returns a pointer to the src string
just past the delimiting char, except at end of string. The only
possible nuisance IMO is that it handles only one possible token
delimiter char (apart from '\0').

const char *toksplit(const char *src, /* Source of tokens */
char tokchar, /* token delimiting char */
char *token, /* receiver of parsed token */
size_t lgh) /* length token can receive */
/* not including final '\0' */

--
Some informative links:
< <http://www.geocities.com/nnqweb/>
<http://www.catb.org/~esr/faqs/smart-questions.html>
<http://www.caliburn.nl/topposting.html>
<http://www.netmeister.org/news/learn2quote.html>
<http://cfaj.freeshell.org/google/>
 
A

Al Balmer

Well, never is probably too strong. However, strtok() is dominated by
a good general purpose parsing method. Since you need a good
general purpose parsing method, why not use that instead of
strtok()?

Where one is needed, I do, and am obligated to supply it with the rest
of the code. Anyone maintaining that code is then obligated to read
and understand it.

Where it's not needed, strtok is already there, and the maintainer
already knows what it does.

It's the same reason I don't supply my own version of other parts of
the standard library.
 
W

William Hughes

Al said:
Where one is needed, I do, and am obligated to supply it with the rest
of the code. Anyone maintaining that code is then obligated to read
and understand it.

Where it's not needed, strtok is already there, and the maintainer
already knows what it does.

It's the same reason I don't supply my own version of other parts of
the standard library.

Ok. I can see why, if you expect the code to be maintained by
others (a common setup), you would want to use standard functions.
And, usually, a bad standard is better than no standard. But there
are limits!

In any case, I don't think your average maintainence drone would
know how strtok() works, or that said drone would be better
at reading the documentation for strtok() than any documentation
you supply with a good general purpose routine.

One reason for my opinion is that personally I don't find
strtok() very useful (Indeed, outside of a couple of exericises that
mandated its use, I don't think I have ever used it). This is in
part because, if possible, I don't use C for string manipulation.
But even when I do use C I don't use strtok(). Clearly, my
situation may not be the most usual one (or even common).


- William Hughes
 
C

Clever Monkey

William said:
The point is not that the function's behaviour is not sometimes
what you want. The point is

-the default behaviour is surprising
Perhaps. About the only thing surprising to me was that the argument
you pass it is affected.
-the default behaviour is not even
usually what you want
So far I've only ever needed the default behaviour with respect to
collapsing adjacent tokens. In fact, I *expected* this! That is, for
the majority of the reasons I need to tokenized a string, this default
behaviour is exactly what I want.
-the default behaviour throws information away
Not sure what you mean here, but I assume you are referring to how it
munges its argument. I guess I just never care about this because we
always store strings in a struct that is passed around, or make copies
of things we tokenized and care about.
-if you don't like the default behaviour, see
figure 1.
I assume figure 1 is a picture of your own implementation that has
non-default requirements :)
Personally I'm with the Linux man pages on this one. Under Bugs
is the advice "Never use this function".
Well, I'll ignore this advice. For the trivial case of needing
tokenized a string to store in my own array of buffers, it works just fine.

For those requirements that strtok() does not fit we have our own
internal tokenizing routines. If all I need is to parse out (say) a
bunch of email addresses passed as a list and store them in a char**
[which was the last time I used strtok()] then it fits perfectly. In
this case I don't even care if the calling code screwed up the list. I
either get one or more valid strings or I don't. I return success or
failure and let them howl!

Of course, if I'd been bitten by the function in the past, I'd be
arguing differently.

Many of the str_ routines in the Standard have some legacy use that
explains design decisions [e.g., strncpy() and database column width].
I wonder if strtok() also has history that explains why the defaults
cause so much consternation?
 
W

William Hughes

Clever said:
Perhaps. About the only thing surprising to me was that the argument
you pass it is affected.

So far I've only ever needed the default behaviour with respect to
collapsing adjacent tokens. In fact, I *expected* this! That is, for
the majority of the reasons I need to tokenized a string, this default
behaviour is exactly what I want.

Not sure what you mean here, but I assume you are referring to how it
munges its argument.

No, it also throws away the number [and identity] of the tokens.
I guess I just never care about this because we
always store strings in a struct that is passed around, or make copies
of things we tokenized and care about.

I assume figure 1 is a picture of your own implementation that has
non-default requirements :)

Nope. See the jargon file.
Well, I'll ignore this advice.

Chacon a son gout.
For the trivial case of needing
tokenized a string to store in my own array of buffers, it works just fine.

For those requirements that strtok() does not fit we have our own
internal tokenizing routines.

And your reason for not using them in preference to strtok()?
If all I need is to parse out (say) a
bunch of email addresses passed as a list and store them in a char**
[which was the last time I used strtok()] then it fits perfectly. In
this case I don't even care if the calling code screwed up the list. I
either get one or more valid strings or I don't. I return success or
failure and let them howl!

Of course, if I'd been bitten by the function in the past, I'd be
arguing differently.

Many of the str_ routines in the Standard have some legacy use that
explains design decisions [e.g., strncpy() and database column width].
I wonder if strtok() also has history that explains why the defaults
cause so much consternation?

I am sure that the defaults were chosen for what was at the
time a good reason ( maybe because
the immediate need was removing whitespace). The fact remains
they are not a good choice for a general purpose routine
(and the fact that they are "mandatory defaults" makes things
even worse).

- William Hughes
 
C

Clever Monkey

William said:
Clever said:
William said:
Default User wrote:
William Hughes wrote:

(e-mail address removed) wrote:
I'm using strtok to break apart a colon-delimited string. It
basically works, but it looks like strtok skips over empty
sections. In other words, if the string has 2 colons in a row, it
doesn't treat that as a null token, it just treats the 2 colons as
a single delimiter.

Is that the intended behavior?
Yes. Just one more reason to avoid strtok().
Unless that's the behavior you want. Example, breaking lines into words
with white space. You don't want a bunch of "null" words.
[...]
-the default behaviour throws information away
Not sure what you mean here, but I assume you are referring to how it
munges its argument.

No, it also throws away the number [and identity] of the tokens.
Ah. I always just keep track of them myself, usually as a index into
the array of strings I'm building. I think every book on the standard
library has similar example code.
And your reason for not using them in preference to strtok()?
A few reasons come to mind. It might be too heavy-weight for my
purpose, or too specific for the simplest case of "get 0 or more things
from this delimited string", which strtok() fits perfectly. I have a
chunk of code I use that is almost cliched that I use to walk the
string, get the pieces and exit with a count.

That is to say, we've never found the need for a better_strtok(), as the
standard implementation satisfies all the necessary requirements.

At this time it has not been obvious that we need to factor this out to
a general-purpose string tokenization routine. Add to that that the
code I maintain is well-established, and I can't simply refactor for the
purpose of refactoring. Adding this much risk to a stable codebase this
late in the day is actually worse than living with standard functions
with warts.

I actually just counted the amount of times we invoke strtok() in a
major part of our product, and I found 6 discrete instances. Some of
that is dead code that has been deprecated. Two of them are places I've
added new functionality.

We _could_ have factored that out to our own function, but, quite
frankly, we never saw the point (except maybe to replace 3-5 lines of
cliche code with a single function call [which is nothing to sneeze at],
but this is usually the last thing to drive maintenance in my experience).

Anyway, I understand why strtok() is not recommended. But I also think
that once you understand the limitations and caveats that go along with
it, there is no reason not to use it for those cases where it is a good fit.

I actually look forward to the time here I'm bitten by strtok(). It
seems like the main sin it commits is being useful to some and
completely useless for others.
 
A

Al Balmer

In any case, I don't think your average maintainence drone would
know how strtok() works,

Huh! I resent that ;-)

This maintenance drone has known about strtok for many years, as well
as all the other functions in the standard library. "Maintenance
drones" not only need to be capable of writing good, solid,
maintainable code, but have the added burden of needing to figure out
what some cowboy coder really meant to do.
 
K

Keith Thompson

Al Balmer said:
Huh! I resent that ;-)

This maintenance drone has known about strtok for many years, as well
as all the other functions in the standard library.
[...]

Perhaps you're not average? :cool:}
 
W

William Hughes

Al said:
Huh! I resent that ;-)

My appologies for an unintended insult.
This maintenance drone has known about strtok for many years, as well
as all the other functions in the standard library. "Maintenance
drones" not only need to be capable of writing good, solid,
maintainable code, but have the added burden of needing to figure out
what some cowboy coder really meant to do.


I agree. Maintenance is difficult work and the best programmers
should be placed on maintenance, not new development.
In my experience, however, this is not the case. The term
"Maintenance drone" is all too often appropriate.

-William Hughes
 
A

Al Balmer

Al Balmer said:
Huh! I resent that ;-)

This maintenance drone has known about strtok for many years, as well
as all the other functions in the standard library.
[...]

Perhaps you're not average? :cool:}

Well, I'd like to think that ;-)

Truthfully, though I've worked for large companies, and done much
maintenance, I've never worked with many other maintenance
programmers, so I'm not at all qualified to know what the average is.
Indeed, I've seen many developers of new programs that were not
qualified. It might be more appropriate just to remove the word
"maintenance" from William's claim.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top