strtok behavior with multiple consecutive delimiters

andy · May 7, 2006

Ian said:
If you are going to eliminate output for comparison, you should comment
out the entire last for loop as the C version outputs inline.

Also, to make things more equal, remove the vector, as this is only used
to store tokens for output.

FWIW Below is my version of the comparison. Moving the construction of
the stringstream into the loop really kills performance of the
stringstream version. However this is IMO a more realistic *simple*
useage . I also modified the other code into C++ style but thats by the
way. With this approach the C code is an order of magnitude faster ( I
had to decrease the number of loops to avoid waiting on the
stringstream code), but its not really a fair comparison. The killer of
the C version for me is that you cant have arbitrary length tokens. You
are limited to whatever the value of ABRsize is. If the C coders want
to write a version that can handle arbitrary length C style strings
then it would be a fairer comparison IMO, (though my previous comments
re ease of coding, testing etc remain) BTW I used boost timer for
timing. If you havent got the boost distro you will just have to modify
those parts. I'm too lazy to do that...

regards
Andy Little

#include <sstream>
#include <string>
#include <vector>
#include <iostream>
#include <boost/timer.hpp>

int const ABRsize = 64;
int const NLOOPS = 100000;

const char *
toksplit(
const char *src,
char tokchar,
char *token,
size_t lgh
);

int main()
{
char tst[] = "this\nis\n\nan\nempty\n\n\nline";

std::cout << "Timing stringstream version: ";
boost::timer t0;
for( int count = 0; count < NLOOPS; ++count) {
std::stringstream ss;
ss << tst;
while (! ss.eof() ){
std::string str;
getline(ss,str,'\n');
}
}
std::cout << t0.elapsed() << "s\n";

std::cout << "Timing toksplit version: ";
boost::timer t1;
for( int count =0;count < NLOOPS;++count){
char token[ABRsize + 1];
const char *t = tst;
while (*t) {
t = toksplit(t, '\n', token, ABRsize);
}
}
std::cout << t1.elapsed() << "s\n";

}

const char *toksplit(
const char *src, /* Source of tokens */
char tokchar, /* token delimiting char */
char *token, /* receiver of parsed token */
size_t lgh) /* length token can receive */
/* not including final '\0' */
{
if (src) {
while (' ' == *src) *src++;
while (*src && (tokchar != *src)) {
if (lgh) {
*token++ = *src;
--lgh;
}
src++;
}
if (*src && (tokchar == *src)) src++;
}
*token = '\0';
return src;
} /* toksplit */

andy · May 7, 2006

Christopher said:
Not always... (digression warning)

IMHO this is harder for the programmer to read than

printf( "token: \"%s\"\n", str );

In this case I think it depends what you are used to!

To a certain extent this is a question of religion, but the difference
between the prevailing styles becomes more pronounced with heavily
formatted output:

printf( "%6s %2.2f %-18s:%u\n", val1, val2, val3, val4 );

Accomplishing the same thing with std::cout would be messy.

FWIW I think it might look like this:

std::cout
<< std::setw(6) << val1
<< ' ' << std::fixed << std::setw(2)
<< std::setprecision(2) << val2
<< ' ' << std::left << std::setw(18) << val3
<< ":" << val4 << '\n';

regards
Andy Little

Charles Richmond · May 7, 2006

Phlip said:
Can we add to the FAQ "Please don't ask about strtok(), because everyone
here is ready to complain about it in endless ways"?

I *like* strtok(). It does what it does, and it does it very well.
You just have to understand how it works. Like so much in C, use
the right tool for job.

Ian Collins · May 7, 2006

FWIW I think it might look like this:

std::cout
<< std::setw(6) << val1
<< ' ' << std::fixed << std::setw(2)
<< std::setprecision(2) << val2
<< ' ' << std::left << std::setw(18) << val3
<< ":" << val4 << '\n';

Which I'm sure you will admit, is a bit of an abomination!

Thank goodness C++ retains the C standard library for cases like this.

Phlip · May 7, 2006

Charles said:
I *like* strtok(). It does what it does, and it does it very well.
You just have to understand how it works. Like so much in C, use
the right tool for job.

I would agreed with you except for this:

Ben said:
* It can only be used once at a time. If a sequence of
strtok() calls is ongoing and another one is started,
the state of the first one is lost. This isn't a
problem for small programs but it is easy to lose track
of such things in hierarchies of nested functions in
large programs. In other words, strtok() breaks
encapsulation.

Another term for "breaks encapsulation" is "refactor-hostile".

I want to refactor freely without worrying that moving a method call into
another one will cause two strtoks to step on each other's toes.

Other than that, yes it's just a tool, and I too will occassionally use it
where it works.

Phlip · May 7, 2006

Ian said:
Which I'm sure you will admit, is a bit of an abomination!

Thank goodness C++ retains the C standard library for cases like this.

Thank goodness C++ permits you to write custom IO manipulators, to bottle
all that up into something legible.

Try extending printf to accept your own % tags some time...

Ian Collins · May 7, 2006

Phlip said:
Ian Collins wrote:

Thank goodness C++ permits you to write custom IO manipulators, to bottle
all that up into something legible.

Thank goodness the C standard library saves you the trouble of writing
custom IO manipulators, to bottle all that up into something legible.

Phlip · May 7, 2006

Ian said:
Thank goodness the C standard library saves you the trouble of writing
custom IO manipulators, to bottle all that up into something legible.

%*.*e

jacob navia · May 7, 2006

CBFalconer a écrit :

jacob navia wrote:

... snip ...

I compiled toksplit without the testing code, using gcc -Os, and
the generated object code was 0x5b bytes long. That's less than
100 bytes of object code.

The point is: measure the routine, not the testing program.

OK. With lcc-win32 the size of the toksplit routine is 84 bytes,
or 0x54 if you prefer hexa

Malcolm · May 7, 2006

CBFalconer said:
CBFalconer said:

The OP can simply use the following replacement function, which
does not have those objectionable features. The testing code is
longer than the function.

Click to expand...

OTOH By using C++ life becomes more productive, less error prone,
less complicated and more elegant:

#include <sstream>
#include <string>
#include <vector>
#include <iostream>

int main()
{

char tst[] = "this\nis\n\nan\nempty\n\n\nline";

std::stringstream s;
s << tst;

Now I get that this is an insertion operator. But why is it aliased with the
bit shift operator? Isn't that a bit counter intuitive?

std::vector<std::string> tokens;

Then we get a line of pure gibberish. All it is doing is declaring an array
of strings.

while (! s.eof() ){
std::string str;
getline(s,str,'\n');
tokens.push_back(str);
}

Now in C, the function feof() will not return true until there has been a
read failure. So is this code correct or not? I suspect that it will fall
over if you terminate the input string with an "\n". But the whole thing is
so encapsulated and wrapped up that it is difficult to tell whehter the code
is bugged or not.

for (std::vector<std::string>::const_iterator iter
= tokens.begin();
iter !=tokens.end();
++iter){

Total gibberish. Four lines in the body of a for loop?

std::cout << "token: \""<< *iter <<"\"\n";

And this line isn't too pretty. Strings aren't too bad, but have you ever
tried to format a floating point number using these fancy stream interfaces?

}

}

regards
Andy Little

This is the reason I've given up using C++. The other reason is that
nowadays much of my code has to rin on a parallel computer, and have you
tried passing objects to other processes using the MPI interface?

Malcolm · May 7, 2006

Phlip said:
Thank goodness C++ permits you to write custom IO manipulators, to bottle
all that up into something legible.

Try extending printf to accept your own % tags some time...

It can't be done. Probably a good thing, because functions should always
behave the same way. If a standards committee decided, however, it would be
trivial to add an addprintftag() function.

/*
add a custom field to printf
Parmas: fieldcode - the letter we are using for the field (existing codes
can be overwritten).
format - function to perform formatting. Returns a pointer to static data.
field - the field the user entered (eg " %[myfancyspecifier]d"
obj - pointer user passed to printf.
*/
void addprintftag( char fieldcode, const char (*format)(const char *field,
const void *obj))

The slight nuisance is that the user must always pass extended arguments by
address, because the printf() variable argument code needs to know what
objects to get off the stack.

What you can do, of course, is write your own vsprintf_extendable()
function. Then you have to pass the results to an output function.

jacob navia · May 7, 2006

(e-mail address removed) a écrit :

(Assuming this is my original as above)
Using these switches comes out at 120 kb on my system

Yes, I am using the 64 bit compiler under windows server 2003 64 bits.
The native code is 64 bits too.

Using these switches, comes out at 112 kb on my system

Yes, in 32 bits its smaller. Still, nothing like 15K...

Charles Richmond · May 9, 2006

Phlip said:
I would agreed with you except for this:

Another term for "breaks encapsulation" is "refactor-hostile".

I want to refactor freely without worrying that moving a method call into
another one will cause two strtoks to step on each other's toes.

Okay, so write your own function to do the work. There is *no* prohibition
in the C standard for writing your own functions. Since you are familiar
with the limitations of strtok(), you will *not* use it. As Mr. Pfaff said,
you *never* know when somewhere down in nested function calls, you might
be re-using strtok() again.

Other than that, yes it's just a tool, and I too will occassionally use it
where it works.

I had an occasion to use strtok() in a nested loop construct, which would
cause the strtok() calls to become nested. I disconnected the loops. The
first loop then just saved the pointers returned in an array of pointers.
The second loop (which was the inner loop before) processed the strings
it needed to process...by using the array of pointers.

If I had thought about how easy it is to write a strtok() with multiple levels,
I would have written my own strtok() clone instead.

andy · May 9, 2006

jacob said:
(e-mail address removed) a écrit :

Yes, I am using the 64 bit compiler under windows server 2003 64 bits.
The native code is 64 bits too.

Yes, in 32 bits its smaller. Still, nothing like 15K...

Its a shame that compiler cant handle C++ code else it might be more
interesting to me.

I'd really like to see what the C version for arbitrary length strings
would look like though.

regards
Andy Little

Richard Herring · May 9, 2006

Ian Collins said:
Thank goodness the C standard library saves you the trouble of writing
custom IO manipulators, to bottle all that up into something legible.

If val1, val2, val3, val4 are so intimately related, surely they should
have been encapsulated into an appropriate class or struct in the first
place.

Then the above reduces to

MyClass x;
std::cout << x;

Isn't that abominable?

comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Aug 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jul 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

strtok behavior with multiple consecutive delimiters

andy

andy

Charles Richmond

Ian Collins

Phlip

Phlip

Ian Collins

Phlip

jacob navia

Malcolm

Malcolm

jacob navia

Charles Richmond

andy

Richard Herring

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads