Tokenizer Function (plus rant on strtok documentation)

R

Robbie Hatley

A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it. I figured I'd
just look it up in some book and that would be that.

I figured wrong!

Firstly, Bjarne Stroustrup's "The C++ Programming Language" said:

(nothing)

Ok, how about a C book? Steven Prata's "C Primer Plus" said:

(nothing)

Aaarrrggg. Ok, how about good old Randy Schildt and his book
"C++: Complete Reference"? It said:

#include <cstring>
char *strtok(char *str1, const char *str2);
The strtok() function returns a pointer to the next token in
the string pointed to by str1. The characters making up the
string pointed to by str2 are the delimiters that determine
the token. A null pointer is returned when there is no token
to return.To tokenize a string, the first call to strtok()
must have str1 point to the string being tokenized. Subsequent
calls must use a null pointer for str1. In this way, the entire
string can be reduced to its tokens. It is possible to use a
different set of delimiters for each call to strtok() .

Ok. But when I tried using the function, it didn't do what I
expected at all. For one thing, it severely alters the contents
of its first argument. Randy Schildt's book doesn't mention that
little factoid at all. :-( Bad Randy!

I had to google this function and find info on it on the web in
order to find out how it really works. Turns out, there's lots
of things missing is Schildt's description. (But hey, at least
he tried. Most other C/C++ authors chicken out and won't even
touch strtok in their books.) This is how this function REALLY
works:

http://www.opengroup.org/onlinepubs/007908799/xsh/strtok_r.html

I wish more authors would cover this useful function in their
books. After all, it IS a part of both the C and C++ standard
libraries. Ok, I'm done ranting now.


For your amusement, here is a function I wrote to break a string
into tokens, given a string of "separator" characters, and put
the tokens in a std::vector<std::string> . I'm sure there's
various ways this could be improved. Comments? Slings? Arrows?


void
Tokenize
(
std::string const & RawText,
std::string const & Delimiters,
std::vector<std::string> & Tokens
)
{
// Load raw text into an appropriately-sized dynamic char array:
size_t StrSize = RawText.size();
size_t ArraySize = StrSize + 5;
char* Ptr = new char[ArraySize];
memset(Ptr, 0, ArraySize);
strncpy(Ptr, RawText.c_str(), StrSize);

// Clear the Tokens vector:
Tokens.clear();

// Get the tokens from the array and put them in the vector:
char* TokenPtr = NULL;
char* TempPtr = Ptr;
while (NULL != (TokenPtr = strtok(TempPtr, Delimiters.c_str())))
{
Tokens.push_back(std::string(TokenPtr));
TempPtr = NULL;
}

// Free memory and scram:
delete[] Ptr;
return;
}


--
Cheers,
Robbie Hatley
East Tustin, CA, USA
lone wolf intj at pac bell dot net
(put "[usenet]" in subject to bypass spam filter)
http://home.pacbell.net/earnur/
 
J

jmoy

Robbie said:
A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it.
This is how this function REALLY
works:

http://www.opengroup.org/onlinepubs/007908799/xsh/strtok_r.html

I wish more authors would cover this useful function in their
books. After all, it IS a part of both the C and C++ standard
libraries. Ok, I'm done ranting now.

strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.
For your amusement, here is a function I wrote to break a string
into tokens, given a string of "separator" characters, and put
the tokens in a std::vector<std::string> . I'm sure there's
various ways this could be improved. Comments? Slings? Arrows?


void
Tokenize
(
std::string const & RawText,
std::string const & Delimiters,
std::vector<std::string> & Tokens
)
{
// Load raw text into an appropriately-sized dynamic char array:
size_t StrSize = RawText.size();
size_t ArraySize = StrSize + 5;
char* Ptr = new char[ArraySize];
memset(Ptr, 0, ArraySize);
strncpy(Ptr, RawText.c_str(), StrSize);

// Clear the Tokens vector:
Tokens.clear();

// Get the tokens from the array and put them in the vector:
char* TokenPtr = NULL;
char* TempPtr = Ptr;
while (NULL != (TokenPtr = strtok(TempPtr, Delimiters.c_str())))
{
Tokens.push_back(std::string(TokenPtr));
TempPtr = NULL;
}

// Free memory and scram:
delete[] Ptr;
return;
}

I guess tying the tokenizer to vector<string> is not a good idea. If it
took an output iterator it could be used with any container or even
with things like ostream_iterators. Here is my attempt, which also gets
rid of strtok:

#include <string>
using namespace std;
template <class OIter> void tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_type Sz;

Sz begin=0;
while(begin<str.size()){
Sz end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
begin=str.find_first_not_of(delim,end);
}
}

I use find_first_not_of in order to be compatible with strtok's
behaviour of treating multiple adjacent delimiters as a single
delimiter. I have not measured the performance of this version against
the strtok version.
 
S

sonison.james

jmoy said:
Robbie said:
A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it.
This is how this function REALLY
works:

http://www.opengroup.org/onlinepubs/007908799/xsh/strtok_r.html

I wish more authors would cover this useful function in their
books. After all, it IS a part of both the C and C++ standard
libraries. Ok, I'm done ranting now.

strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.
For your amusement, here is a function I wrote to break a string
into tokens, given a string of "separator" characters, and put
the tokens in a std::vector<std::string> . I'm sure there's
various ways this could be improved. Comments? Slings? Arrows?


void
Tokenize
(
std::string const & RawText,
std::string const & Delimiters,
std::vector<std::string> & Tokens
)
{
// Load raw text into an appropriately-sized dynamic char array:
size_t StrSize = RawText.size();
size_t ArraySize = StrSize + 5;
char* Ptr = new char[ArraySize];
memset(Ptr, 0, ArraySize);
strncpy(Ptr, RawText.c_str(), StrSize);

// Clear the Tokens vector:
Tokens.clear();

// Get the tokens from the array and put them in the vector:
char* TokenPtr = NULL;
char* TempPtr = Ptr;
while (NULL != (TokenPtr = strtok(TempPtr, Delimiters.c_str())))
{
Tokens.push_back(std::string(TokenPtr));
TempPtr = NULL;
}

// Free memory and scram:
delete[] Ptr;
return;
}

I guess tying the tokenizer to vector<string> is not a good idea. If it
took an output iterator it could be used with any container or even
with things like ostream_iterators. Here is my attempt, which also gets
rid of strtok:

#include <string>
using namespace std;
template <class OIter> void tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_type Sz;

Sz begin=0;
while(begin<str.size()){
Sz end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
begin=str.find_first_not_of(delim,end);
}
}

I use find_first_not_of in order to be compatible with strtok's
behaviour of treating multiple adjacent delimiters as a single
delimiter. I have not measured the performance of this version against
the strtok version.

check out http://www.boost.org/libs/tokenizer/index.html
One cool thing about the boost tokenizer is that you can get NULL
tokens if you have adjacent separators, which I believe can't be
handled by strtok.

Thanks and regards
SJ
 
M

Mehturt

jmoy said:
Robbie said:
A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it.
This is how this function REALLY
works:

http://www.opengroup.org/onlinepubs/007908799/xsh/strtok_r.html

I wish more authors would cover this useful function in their
books. After all, it IS a part of both the C and C++ standard
libraries. Ok, I'm done ranting now.

strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.
For your amusement, here is a function I wrote to break a string
into tokens, given a string of "separator" characters, and put
the tokens in a std::vector<std::string> . I'm sure there's
various ways this could be improved. Comments? Slings? Arrows?


void
Tokenize
(
std::string const & RawText,
std::string const & Delimiters,
std::vector<std::string> & Tokens
)
{
// Load raw text into an appropriately-sized dynamic char array:
size_t StrSize = RawText.size();
size_t ArraySize = StrSize + 5;
char* Ptr = new char[ArraySize];
memset(Ptr, 0, ArraySize);
strncpy(Ptr, RawText.c_str(), StrSize);

// Clear the Tokens vector:
Tokens.clear();

// Get the tokens from the array and put them in the vector:
char* TokenPtr = NULL;
char* TempPtr = Ptr;
while (NULL != (TokenPtr = strtok(TempPtr, Delimiters.c_str())))
{
Tokens.push_back(std::string(TokenPtr));
TempPtr = NULL;
}

// Free memory and scram:
delete[] Ptr;
return;
}

I guess tying the tokenizer to vector<string> is not a good idea. If it
took an output iterator it could be used with any container or even
with things like ostream_iterators. Here is my attempt, which also gets
rid of strtok:

#include <string>
using namespace std;
template <class OIter> void tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_type Sz;

Sz begin=0;
while(begin<str.size()){
Sz end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
begin=str.find_first_not_of(delim,end);
}
}

I like this implementation, but don't you assume the space for data
(tokens) is already pre-allocated?
If I use your fn with something like this, I get segmentation fault..

std::vector<std::string> v;
std::vector<std::string>::iterator it = v.begin();
 
R

Robbie Hatley

jmoy said:
strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.

Ah, sort of like the code my ex-boss left me to maintain after he
got fired. Hundreds of global variables, which he uses to pass
data from function to function, like a dumbass. Of course, since
the program is a complex windows app with timers and interrupts,
the data often gets over-written on its way from one place to
another. ::sigh:: Global variables are the work of Sauron.
I guess tying the tokenizer to vector<string> is not a good idea.

It does limit the user to a std::vector<std::string>, yes. However,
that construct is pretty good for this app. I find it hard to
think of cases which couldn't use that to hold a bunch of tokens.
If it took an output iterator it could be used with any container
or even with things like ostream_iterators.

Provided that the output container was big enough. If you start
with an empty conainer and try writing to it using output
iterators, you'll get an "illegal memory access" or "general
protection fault" or some such thing. So you'd have to make sure
that the container was huge. I don't like that approach.
#include <string>
using namespace std;
template <class OIter> void tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_type Sz;

Sz begin=0;
while(begin<str.size()){
Sz end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
begin=str.find_first_not_of(delim,end);
}
}

I use find_first_not_of in order to be compatible with strtok's
behaviour of treating multiple adjacent delimiters as a single
delimiter. I have not measured the performance of this version against
the strtok version.

Alluring in its simplicity, yes. But has two major bugs:

1. Memory corruption danger if used to write to a small container.
2. You don't take into account the fact that the string might START
with one or more delimiters.

Maybe something like THIS might be better:

#include <string>
// using namespace std; // Ewww.
template <class Container>
void
tokenize
(
const std::string & str,
const std::string & delim,
Container & C
)
{
typedef std::string::size_type Sz;
Sz begin = 0;
Sz end = 0;
while (begin < str.size())
{
begin = str.find_first_not_of (delim, begin);
end = str.find_first_of (delim, begin);
Container.push_back(str.substr(begin, end-begin));
}
}

I haven't tested that, but I think something like that would work
better. It does require that the container for the tokens have
the push_back() method defined. Other than that, it's pretty
generic.

Note that to take care of the "starts with delimiters" case,
I simply moved your "first_not_of" up to the top of the loop.
That should work nicely.


--
Cheers,
Robbie Hatley
East Tustin, CA, USA
lone wolf intj at pac bell dot net
(put "[usenet]" in subject to bypass spam filter)
http://home.pacbell.net/earnur/
 
J

Jerry Coffin

[ ... ]
Maybe something like THIS might be better:

#include <string>
// using namespace std; // Ewww.
template <class Container>
void
tokenize
(
const std::string & str,
const std::string & delim,
Container & C
)
{
typedef std::string::size_type Sz;
Sz begin = 0;
Sz end = 0;
while (begin < str.size())
{
begin = str.find_first_not_of (delim, begin);
end = str.find_first_of (delim, begin);
Container.push_back(str.substr(begin, end-begin));
}
}

IMO, this is a poor idea. Take an iterator for the output. If the
user wants the data pushed onto the back, they can use back_inserter
to get that. If they want it inserted into something like a set, they
can use inserter to get that.
 
R

Robbie Hatley

Jerry Coffin said:
IMO, this is a poor idea. Take an iterator for the output.

Puts extreme burden on the user to provide the right kind of
container and iterator. Such a function would often get mis-used
and cause memory corruption and program crashes.
If the user wants the data pushed onto the back, they can
use back_inserter to get that.

If they know any better.
If they want it inserted into something like a set, they
can use inserter to get that.

If they know that they should, and if they know how.

So it really depends on which kind of function one wants to write:

1. Something efficient but dangerous, that requires having and
reading and understanding some external documentation to use it
correctly.

or

2. Something safe and easy and self-documenting, but a bit limited.

I can see use for both, actually. But the iterator version will
always be the more dangerous one.

--
Cheers,
Robbie Hatley
East Tustin, CA, USA
lone wolf intj at pac bell dot net
(put "[usenet]" in subject to bypass spam filter)
http://home.pacbell.net/earnur/
 
J

Jerry Coffin

Puts extreme burden on the user to provide the right kind of
container and iterator.

IMO, it's not extreme at all. They're going to have to provide the
right kind of container in any case -- but the code you provided will
often _prevent_ them from using the right container. Just for
example, putting the output into a set might well make sense -- but
your code simply won't work with it at all.

[ ... ]
I can see use for both, actually. But the iterator version will
always be the more dangerous one.

The iterator version is the only one that really works. In any case,
for a programmer to become at all proficient in using C++, they need
to learn how to do this anyway -- look through most of the algorithms
in the standard library, and note that they also take an iterator to
tell them where to put the results -- with precisely the same result.
 
A

Alex Vinokur

Alex Vinokur wrote:
[slip]
--------------------------------------------------
Instead of
should be
http://groups.google.com/group/perfo/msg/f3c775cf7e3cdcf0
Sorry
--------------------------------------------------
[snip]

Alex Vinokur
email: alex DOT vinokur AT gmail DOT com
http://mathforum.org/library/view/10978.html
http://sourceforge.net/users/alexvn
 
J

jmoy

Robbie said:
Alluring in its simplicity, yes. But has two major bugs:

1. Memory corruption danger if used to write to a small container.

No. As mentioned by other posters, if you are adding tokens to a
container the right thing is to call the function with something like a
back_inserter in which case there is no memory corruption
2. You don't take into account the fact that the string might START
with one or more delimiters.

You are right. My mistake.
Maybe something like THIS might be better:

#include <string>
// using namespace std; // Ewww.
template <class Container>
void
tokenize
(
const std::string & str,
const std::string & delim,
Container & C
)
{
typedef std::string::size_type Sz;
Sz begin = 0;
Sz end = 0;
while (begin < str.size())
{
begin = str.find_first_not_of (delim, begin);
end = str.find_first_of (delim, begin);
Container.push_back(str.substr(begin, end-begin));
}
}

The problem with this is that it fails for the reverse case of a string
ending with a delimiter. Also, I don't like the idea of tying
algorithms with containers---iterators are a much more general concept.
Here is a corrected version of my function:

template <class OIter> void tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_type Sz;

Sz end=0;
for(;;){
Sz begin=str.find_first_not_of(delim,end);
if (begin==string::npos)
break;
end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
}
}
 
J

Jerry Coffin

[ ... ]
template <class OIter> void tokenize( const string &str,
const string &delim,
OIter oi)
{
typedef string::size_type Sz;

Sz end=0;
for(;;){
Sz begin=str.find_first_not_of(delim,end);
if (begin==string::npos)
break;
end=str.find_first_of(delim,begin);
*oi++=str.substr(begin,end-begin);
}
}

I think I'd also make the character type a template parameter:

template class<charT, class OIter>
void tokenize( basic_string<charT> input,
basic_string<charT> delim,
OIter oi)
{
typedef basic_string<charT> str;
typedef str::size_type Sz;

Sz end = 0;
for (;;) {
Sz begin = input.find_first_not_of(delim, end);
if (str::npos == begin)
break;
end = input.find_first_of(delim, begin);
*oi++ = input.substr(begin, end-begin);
}
}

This way, the code can work with strings of either narrow or wide
characters.
 
N

Noah Roberts

Robbie said:
A couple of days ago I dedecided to force myself to really learn
exactly what "strtok" does, and how to use it.

Every once in a while I also have this massocistic urge to torture
myself for no reason. Eventually though I tire of it and move on.

The stroke function is useless. It is insecure, works in insecure
types, and is a general pita to use. There are better options that are
much easier to use and have type safety. Look into string streams as a
much better alternative to stroke.
 
R

Roland Pibinger

So it really depends on which kind of function one wants to write:

1. Something efficient but dangerous, that requires having and
reading and understanding some external documentation to use it
correctly.

or

2. Something safe and easy and self-documenting, but a bit limited.

The following tokenize function, derived from John Potter's
implementation
(http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/aa4daafacd01ce26),
is IMO safe (no output iterators) and efficient (no dynamic
allocation, no substr). Usable for all STL-like containers (except
map) with bidirectional iterators:

#include <algorithm>

template <typename StringT, typename ContainerT>
size_t tokenize (const StringT& text, const StringT& delim,
ContainerT& result) {
size_t num = 0;
typename StringT::size_type b = text.find_first_not_of(delim);
while (b != StringT::npos) {
typename StringT::size_type e(text.find_first_of(delim, b));
StringT s (text.c_str() + b, e - b);
result.insert (result.end(), StringT());
typename ContainerT::iterator iter = result.end();
(*--iter).swap (s);
++num;
b = text.find_first_not_of(delim, std::min(e, text.size()));
}
return num;
}

For std::vector as result container efficency can be increased with
reserve().

Best wishes,
Roland Pibinger
 
M

Mehturt

Roland said:
The following tokenize function, derived from John Potter's
implementation
(http://groups.google.com/group/comp.lang.c++.moderated/browse_frm/thread/aa4daafacd01ce26),
is IMO safe (no output iterators) and efficient (no dynamic
allocation, no substr). Usable for all STL-like containers (except
map) with bidirectional iterators:

#include <algorithm>

template <typename StringT, typename ContainerT>
size_t tokenize (const StringT& text, const StringT& delim,
ContainerT& result) {
size_t num = 0;
typename StringT::size_type b = text.find_first_not_of(delim);
while (b != StringT::npos) {
typename StringT::size_type e(text.find_first_of(delim, b));

I think e can be StringT::npos here and that would cause problems in
the line below..
 
R

Ron Natalie

jmoy said:
strtok is one of the weird functions that maintain internal state, so
that you cannot tokenize two strings in an interleaved manner or use it
in a multithreaded program. POSIX offers a strtok_r which is somewhat
saner.
Strtok should have never been allowed to enter the standard. It is
an abysmal function an a relic from the days when C programmers
were pretty lousy software engineers (it's even worse than the
stupid-assed "portable" I/O library that should have been allowed
to be made into the stdio.
 
R

Roland Pibinger

I think e can be StringT::npos here and that would cause problems in
the line below..

You are right. The while loop should rather be:

while (b != StringT::npos) {
typename StringT::size_type
e(std::min (text.find_first_of(delim, b), text.size()));
StringT s (text.c_str() + b, e - b);
result.insert (result.end(), StringT());
typename ContainerT::iterator iter = result.end();
(*--iter).swap (s);
++num;
b = text.find_first_not_of(delim, std::min(e, text.size()));
}

Thank you,
Roland Pibinger
 
M

Mehturt

Roland said:
You are right. The while loop should rather be:

while (b != StringT::npos) {
typename StringT::size_type
e(std::min (text.find_first_of(delim, b), text.size()));
StringT s (text.c_str() + b, e - b);
result.insert (result.end(), StringT());
typename ContainerT::iterator iter = result.end();
(*--iter).swap (s);
++num;
b = text.find_first_not_of(delim, std::min(e, text.size()));
}

Thank you,
Roland Pibinger

I have also found this impl is not usable with set, you only mentioned
it does not work with map..
Is that intentional?
 
R

Roland Pibinger

I have also found this impl is not usable with set, you only mentioned
it does not work with map. Is that intentional?

Let me add background information to the above implementation. The
performance characteristics for copying a std::string (the
std::basic_string template) are not specified by the C++ Standard.
Copying may be expensive ('deep') or cheap because of an optimization
(ref-counting or small-string optimization) - implementation specific.
The above code tries to avoid copying (and assignment) of non-empty
strings (assuming that copying an empty string is always cheap for
reasonable basic_sting implementations). As shown in the mentioned
thread with John Potter's original code an optimized version of the
tokenize function can considerably reduce dynamic memory allocation
for some string implementations and some usage scenarios.
I realize now that it's probably not a good idea to make the container
type a template parameter.That kind of optimization cannot be applied
to all Standard containers in the same way. Elements in a set e.g. are
sorted after insert and immutable (otherwise the sort order would be
compromised). Separate non-template implementations for each container
and each string type are preferable.

Best regards,
Roland Pibinger
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

C++ strtok 5
Can't solve problems! please Help 0
tokenizer class 1
strtok() 13
Why does strcat mess up the tokens in strtok (and strtok_r)? 92
tokenizer 6
Lexical Analysis on C++ 1
Array of structs function pointer 10

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top