strtok behavior with multiple consecutive delimiters

Geometer · May 6, 2006

Hello, and good whatever daytime is at your place..

please can somebody tell me, what the standard behavior of strtok shall be,
if it encounters two or more consecutive delimiters like in
(checks omitted)

char tst[] = "this\nis\n\nan\nempty\n\n\nline";
^^^^ ^^^^^^
char *tok = strtok(tst, "\n");
tok = strtok(NULL, "\n");
and so on..

will the groups of '\n' marked above be consumed one by one or the whole
group together?

Thank you very much

Phlip · May 6, 2006

Geometer said:
please can somebody tell me, what the standard behavior of strtok shall
be, if it encounters two or more consecutive delimiters like in
(checks omitted)

char tst[] = "this\nis\n\nan\nempty\n\n\nline";
^^^^ ^^^^^^
char *tok = strtok(tst, "\n");
tok = strtok(NULL, "\n");
and so on..

will the groups of '\n' marked above be consumed one by one or the whole
group together?

Yes.

But why didn't you just write a test case and see?

Going forward, don't use strtok(). Google for a replacement, possibly
including a Regex system. Then you can control such details.

Geometer · May 6, 2006

--
Geometer
Dipl.Ing. Erwin Lebloch

Hauptplatz 39
2130 Mistelbach - NÖ
Tel.: 02572/4300

www.lebloch.at
(e-mail address removed)

Phlip said:
Geometer said:

please can somebody tell me, what the standard behavior of strtok shall
be, if it encounters two or more consecutive delimiters like in
(checks omitted)

char tst[] = "this\nis\n\nan\nempty\n\n\nline";
^^^^ ^^^^^^
char *tok = strtok(tst, "\n");
tok = strtok(NULL, "\n");
and so on..

will the groups of '\n' marked above be consumed one by one or the whole
group together?

Click to expand...

Yes.

But why didn't you just write a test case and see?

I did

. I just wanted to know if this is the behavior required by the
standard and whether there is a difference betwenn C and C++.
Thanks for your response.

Robert

Peter Jansson · May 6, 2006

Geometer said:
Hello, and good whatever daytime is at your place..

please can somebody tell me, what the standard behavior of strtok shall be,
if it encounters two or more consecutive delimiters like in
(checks omitted)

char tst[] = "this\nis\n\nan\nempty\n\n\nline";
^^^^ ^^^^^^
char *tok = strtok(tst, "\n");
tok = strtok(NULL, "\n");
and so on..

will the groups of '\n' marked above be consumed one by one or the whole
group together?

Thank you very much

<quote src="A man-page for strok.">

Never use these functions. If you do, note that:
These functions modify their first argument.
These functions cannot be used on constant strings.
The identity of the delimiting character is lost.
The strtok() function uses a static buffer while parsing,
so it’s not thread safe.

</quote>

Regards,

Peter Jansson
http://www.p-jansson.com/
http://www.jansson.net/

CBFalconer · May 6, 2006

Peter said:
Geometer said:

please can somebody tell me, what the standard behavior of
strtok shall be, if it encounters two or more consecutive
delimiters like in (checks omitted)

char tst[] = "this\nis\n\nan\nempty\n\n\nline";
^^^^ ^^^^^^
char *tok = strtok(tst, "\n");
tok = strtok(NULL, "\n");
and so on..

will the groups of '\n' marked above be consumed one by one or
the whole group together?

Click to expand...

<quote src="A man-page for strok.">

Never use these functions. If you do, note that:
These functions modify their first argument.
These functions cannot be used on constant strings.
The identity of the delimiting character is lost.
The strtok() function uses a static buffer while parsing,
so it’s not thread safe.

</quote>

The OP can simply use the following replacement function, which
does not have those objectionable features. The testing code is
longer than the function.

/* ------- file toksplit.c ----------*/
#include "toksplit.h"

/* copy over the next token from an input string, after
skipping leading blanks (or other whitespace?). The
token is terminated by the first appearance of tokchar,
or by the end of the source string.

The caller must supply sufficient space in token to
receive any token, Otherwise tokens will be truncated.

Returns: a pointer past the terminating tokchar.

This will happily return an infinity of empty tokens if
called with src pointing to the end of a string. Tokens
will never include a copy of tokchar.

A better name would be "strtkn", except that is reserved
for the system namespace. Change to that at your risk.

released to Public Domain, by C.B. Falconer.
Published 2006-02-20. Attribution appreciated.
*/

const char *toksplit(const char *src, /* Source of tokens */
char tokchar, /* token delimiting char */
char *token, /* receiver of parsed token */
size_t lgh) /* length token can receive */
/* not including final '\0' */
{
if (src) {
while (' ' == *src) *src++;

while (*src && (tokchar != *src)) {
if (lgh) {
*token++ = *src;
--lgh;
}
src++;
}
if (*src && (tokchar == *src)) src++;
}
*token = '\0';
return src;
} /* toksplit */

#ifdef TESTING
#include <stdio.h>

#define ABRsize 6 /* length of acceptable token abbreviations */

int main(void)
{
char teststring[] = "This is a test, ,, abbrev, more";

const char *t, *s = teststring;
int i;
char token[ABRsize + 1];

puts(teststring);
t = s;
for (i = 0; i < 4; i++) {
t = toksplit(t, ',', token, ABRsize);
putchar(i + '1'); putchar(':');
puts(token);
}

puts("\nHow to detect 'no more tokens'");
t = s; i = 0;
while (*t) {
t = toksplit(t, ',', token, 3);
putchar(i + '1'); putchar(':');
puts(token);
i++;
}

puts("\nUsing blanks as token delimiters");
t = s; i = 0;
while (*t) {
t = toksplit(t, ' ', token, ABRsize);
putchar(i + '1'); putchar(':');
puts(token);
i++;
}
return 0;
} /* main */

#endif
/* ------- end file toksplit.c ----------*/

I have set follow-ups to exclude c.l.c++. Although the above code
is usable there, it is seldom a good idea to mix the two
languages. I have not provided a header file with a C++ linkage
provision.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>

Jerry Coffin · May 6, 2006

[ ... ]

The strtok() function uses a static buffer while parsing,
so it’s not thread safe.

More accurately, it uses a static pointer while parsing,
so the vendor has to go to extra work to make it thread
safe. The same is true with a number of other functions
as well, though -- much of what's defined in time.h, to
give only one obvious example.

Ben Pfaff · May 6, 2006

Geometer said:
please can somebody tell me, what the standard behavior of strtok shall be,
if it encounters two or more consecutive delimiters like in

strtok() has at least these problems:

* It merges adjacent delimiters. If you use a comma as your
delimiter, then "a,,b,c" will be divided into three tokens,
not four. This is often the wrong thing to do. In fact, it
is only the right thing to do, in my experience, when the
delimiter set contains white space (for dividing a string
into "words") or it is known in advance that there will be
no adjacent delimiters.

* The identity of the delimiter is lost, because it is
changed to a null terminator.

* It modifies the string that it tokenizes. This is bad
because it forces you to make a copy of the string if
you want to use it later. It also means that you can't
tokenize a string literal with it; this is not
necessarily something you'd want to do all the time but
it is surprising.

* It can only be used once at a time. If a sequence of
strtok() calls is ongoing and another one is started,
the state of the first one is lost. This isn't a
problem for small programs but it is easy to lose track
of such things in hierarchies of nested functions in
large programs. In other words, strtok() breaks
encapsulation.

andy · May 6, 2006

CBFalconer said:
The OP can simply use the following replacement function, which
does not have those objectionable features. The testing code is
longer than the function.

OTOH By using C++ life becomes more productive, less error prone,
less complicated and more elegant:

#include <sstream>
#include <string>
#include <vector>
#include <iostream>

int main()
{

char tst[] = "this\nis\n\nan\nempty\n\n\nline";

std::stringstream s;
s << tst;

std::vector<std::string> tokens;
while (! s.eof() ){
std::string str;
getline(s,str,'\n');
tokens.push_back(str);
}

for (std::vector<std::string>::const_iterator iter
= tokens.begin();
iter !=tokens.end();
++iter){
std::cout << "token: \""<< *iter <<"\"\n";
}

}

regards
Andy Little

Pete Becker · May 6, 2006

Peter said:
<quote src="A man-page for strok.">

The name of the function is strtok.

Never use these functions. If you do, note that:
These functions modify their first argument.
These functions cannot be used on constant strings.

These two say the same thing. Sounds like someone is trying too hard.

The identity of the delimiting character is lost.

Which has nothing to do with the claim that you should never use it. You
shouldn't use it if you need to know which of the delimiters was
actually encountered.

The strtok() function uses a static buffer while parsing,

No, it uses a static variable to hold its result BETWEEN calls.

so it’s not thread safe.

Non sequitur. It's easy enough to implement with a per-thread static
pointer, which is thread safe.

Yup, definitely trying too hard. strtok is well suited for what it does.
If you need something more elaborate, go for it.

jacob navia · May 6, 2006

You forgot toksplit.h Chuck

Can you post it too?

jacob

jacob navia · May 6, 2006

(e-mail address removed) a écrit :

CBFalconer wrote:

The OP can simply use the following replacement function, which
does not have those objectionable features. The testing code is
longer than the function.

Click to expand...

OTOH By using C++ life becomes more productive, less error prone,
less complicated and more elegant:

#include <sstream>
#include <string>
#include <vector>
#include <iostream>

int main()
{

char tst[] = "this\nis\n\nan\nempty\n\n\nline";

std::stringstream s;
s << tst;

std::vector<std::string> tokens;
while (! s.eof() ){
std::string str;
getline(s,str,'\n');
tokens.push_back(str);
}

for (std::vector<std::string>::const_iterator iter
= tokens.begin();
iter !=tokens.end();
++iter){
std::cout << "token: \""<< *iter <<"\"\n";
}

}

regards
Andy Little

I compiled your program in C++ using the VS 2005 compiler. The
executable size of that stuff was 180 224 bytes.

Then I compiled Chuck's version using his strtok function using the
lcc-win32 compiler (a C compiler, not a C++ one). The size was 14 645 bytes.

Then I eliminated output from both programs. Compiled them without any
optimizations and inserted a loop of 1 million times.

C++ took 1.234 seconds
C took 0.375 seconds

Then I compiled both programs using VS 2005 (64 bits) with full
optimization:

C++ took 0.234 seconds
C took 0.156 seconds

I do not say that this measurements are important for everybody. But
maybe they are important for *some* people.

jacob

andy · May 6, 2006

jacob said:
I compiled your program in C++ using the VS 2005 compiler. The
executable size of that stuff was 180 224 bytes.

It comes out at around 112 K for me. What were your command line
options?

Then I compiled Chuck's version using his strtok function using the
lcc-win32 compiler (a C compiler, not a C++ one). The size was 14 645 bytes.

Then I eliminated output from both programs. Compiled them without any
optimizations and inserted a loop of 1 million times.

C++ took 1.234 seconds
C took 0.375 seconds

Then I compiled both programs using VS 2005 (64 bits) with full
optimization:

C++ took 0.234 seconds
C took 0.156 seconds

(It would be nice to see the full source code that you were testing
FWIW). C++ version did rather better than I would expect, good
optimiser! ...;-)

I do not say that this measurements are important for everybody. But
maybe they are important for *some* people.

Sure, C++ will handle the C-style code as well if necessary, but the
amount of time you need to spend writing, testing and debugging is a
major factor to some people too.

And of course ... In what real situation are you going to be spending a
long time tokenising string literals?

regards
Andy Little

jacob navia · May 6, 2006

Command line for non optimized version:
cl /EHsc toksplit.cpp
lc toksplit.c

Command line for the optimized version:
cl /Ox /EHsc toksplit.cpp
cl /OX toksplit.c

Here is the code
--------------------------------------------------toksplit.h
#ifndef H_toksplit_h
# define H_toksplit_h

# ifdef __cplusplus
extern "C" {
# endif

#include <stddef.h>

/* copy over the next token from an input string, after
skipping leading blanks (or other whitespace?). The
token is terminated by the first appearance of tokchar,
or by the end of the source string.

The caller must supply sufficient space in token to
receive any token, Otherwise tokens will be truncated.

Returns: a pointer past the terminating tokchar.

This will happily return an infinity of empty tokens if
called with src pointing to the end of a string. Tokens
will never include a copy of tokchar.

released to Public Domain, by C.B. Falconer.
Published 2006-02-20. Attribution appreciated.
*/

const char *toksplit(const char *src, /* Source of tokens */
char tokchar, /* token delimiting char */
char *token, /* receiver of parsed token */
size_t lgh); /* length token can receive */
/* not including final '\0' */

# ifdef __cplusplus
}
# endif
#endif
--------------------------------------------end of toksplit.h
Now toksplit.c
/* ------- file toksplit.c ----------*/
#include "toksplit.h"

/* copy over the next token from an input string, after
skipping leading blanks (or other whitespace?). The
token is terminated by the first appearance of tokchar,
or by the end of the source string.

The caller must supply sufficient space in token to
receive any token, Otherwise tokens will be truncated.

Returns: a pointer past the terminating tokchar.

This will happily return an infinity of empty tokens if
called with src pointing to the end of a string. Tokens
will never include a copy of tokchar.

A better name would be "strtkn", except that is reserved
for the system namespace. Change to that at your risk.

released to Public Domain, by C.B. Falconer.
Published 2006-02-20. Attribution appreciated.
*/

const char *toksplit(const char *src, /* Source of tokens */
char tokchar, /* token delimiting char */
char *token, /* receiver of parsed token */
size_t lgh) /* length token can receive */
/* not including final '\0' */
{
if (src) {
while (' ' == *src) *src++;

while (*src && (tokchar != *src)) {
if (lgh) {
*token++ = *src;
--lgh;
}
src++;
}
if (*src && (tokchar == *src)) src++;
}
*token = '\0';
return src;
} /* toksplit */

#include <stdio.h>

#define ABRsize 64 /* length of acceptable token abbreviations */

int main(void)
{
char teststring[] = "this\nis\n\nan\nempty\n\n\nline";

const char *t, *s = teststring;
int i;
char token[ABRsize + 1];
int count;

count=0;
do {
t = s; i = 0;
while (*t) {
t = toksplit(t, '\n', token, 64);
//putchar(i + '1'); putchar(':');
//puts(token);
i++;
}
count++;
} while (count < 1000000);
return 0;
} /* main */

--------------------------------------------------------------toksplit.c

Now the C++ version:
--------------------------------------------------------------toksplit.cpp
#include <sstream>
#include <string>
#include <vector>
#include <iostream>

int main()
{

char tst[] = "this\nis\n\nan\nempty\n\n\nline";

std::stringstream s;
s << tst;

std::vector<std::string> tokens;
int count=0;
do {
s << tst;
while (! s.eof() ){
std::string str;
getline(s,str,'\n');
tokens.push_back(str);
}

for (std::vector<std::string>::const_iterator iter
= tokens.begin();
iter !=tokens.end();
++iter){
//std::cout << "token: \""<< *iter <<"\"\n";
}
count++;
} while (count < 1000000);

}
--------------------------------------------------------------end of
toksplit.cpp

Ben C · May 6, 2006

[...]
I compiled your program in C++ using the VS 2005 compiler. The
executable size of that stuff was 180 224 bytes.

Then I compiled Chuck's version using his strtok function using the
lcc-win32 compiler (a C compiler, not a C++ one). The size was 14 645 bytes.

Then I eliminated output from both programs. Compiled them without any
optimizations and inserted a loop of 1 million times.

C++ took 1.234 seconds
C took 0.375 seconds

Then I compiled both programs using VS 2005 (64 bits) with full
optimization:

C++ took 0.234 seconds
C took 0.156 seconds

I do not say that this measurements are important for everybody. But
maybe they are important for *some* people.

Interesting. Can you do a timing of VS 2005 with full optimizations on
the C version? I think this would complete the picture.

Ian Collins · May 6, 2006

jacob said:
Now the C++ version:
--------------------------------------------------------------toksplit.cpp
#include <sstream>
#include <string>
#include <vector>
#include <iostream>

int main()
{

char tst[] = "this\nis\n\nan\nempty\n\n\nline";

std::stringstream s;
s << tst;

std::vector<std::string> tokens;
int count=0;
do {
s << tst;
while (! s.eof() ){
std::string str;
getline(s,str,'\n');
tokens.push_back(str);
}

for (std::vector<std::string>::const_iterator iter
= tokens.begin();
iter !=tokens.end();
++iter){
//std::cout << "token: \""<< *iter <<"\"\n";
}
count++;
} while (count < 1000000);

}
--------------------------------------------------------------end of
toksplit.cpp

If you are going to eliminate output for comparison, you should comment
out the entire last for loop as the C version outputs inline.

Also, to make things more equal, remove the vector, as this is only used
to store tokens for output.

Christopher Benson-Manica · May 7, 2006

In said:
OTOH By using C++ life becomes more productive, less error prone,
less complicated and more elegant:

Not always... (digression warning)

std::cout << "token: \""<< *iter <<"\"\n";

IMHO this is harder for the programmer to read than

printf( "token: \"%s\"\n", str );

To a certain extent this is a question of religion, but the difference
between the prevailing styles becomes more pronounced with heavily
formatted output:

printf( "%6s %2.2f %-18s:%u\n", val1, val2, val3, val4 );

Accomplishing the same thing with std::cout would be messy.

Christopher Benson-Manica · May 7, 2006

In comp.lang.c Ben Pfaff said:
* It can only be used once at a time. If a sequence of
strtok() calls is ongoing and another one is started,
the state of the first one is lost.

<ot>For OP, if this is a problem for you, strtok_r() may be
available, depending on your system and portability constraints.</ot>

For all the pitfalls of strtok(), it is still possible to use it
correctly for fun and profit, a point which I think has not been
emphasized in this thread. It may well be the appropriate function
for the OP, but of course given his question it might also be
unsuable.

Phlip · May 7, 2006

Geometer said:
I did . I just wanted to know if this is the behavior required by the
standard and whether there is a difference betwenn C and C++.

Can we add to the FAQ "Please don't ask about strtok(), because everyone
here is ready to complain about it in endless ways"?

andy · May 7, 2006

jacob said:
Command line for non optimized version:
cl /EHsc toksplit.cpp

(Assuming this is my original as above)
Using these switches comes out at 120 kb on my system

lc toksplit.c

Command line for the optimized version:
cl /Ox /EHsc toksplit.cpp

Using these switches, comes out at 112 kb on my system

regards
Andy Little

CBFalconer · May 7, 2006

jacob said:
.... snip ...

I compiled your program in C++ using the VS 2005 compiler. The
executable size of that stuff was 180 224 bytes.

Then I compiled Chuck's version using his strtok function using the
lcc-win32 compiler (a C compiler, not a C++ one). The size was
14 645 bytes.

I compiled toksplit without the testing code, using gcc -Os, and
the generated object code was 0x5b bytes long. That's less than
100 bytes of object code.

The point is: measure the routine, not the testing program.

--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>

comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Aug 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jul 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

strtok behavior with multiple consecutive delimiters

Geometer

Phlip

Geometer

Peter Jansson

CBFalconer

Jerry Coffin

Ben Pfaff

andy

Pete Becker

jacob navia

jacob navia

andy

jacob navia

Ben C

Ian Collins

Christopher Benson-Manica

Christopher Benson-Manica

Phlip

andy

CBFalconer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads