string tokenizing

D

David Rubin

I looked on google for an answer, but I didn't find anything short of
using boost which sufficiently answers my question: what is a good way
of doing string tokenization (note: I cannot use boost). For example, I
have tried this:

#include <algorithm>
#include <cctype>
#include <climits>
#include <deque>
#include <iostream>
#include <iterator>
#include <string>

using namespace std;

int
main()
{
string delim;
int c;

/* fill delim */
for(c=0; c < CHAR_MAX; c++){ // I tried #include <limits>, but
failed...
if((isspace(c) || ispunct(c))
&& !(c == '_' || c == '#')
delim += c;
}

string buf;
string::size_type op, np;
deque<string> tok;

while(std::getline(cin, buf) && !cin.fail()){
op = 0;
while((np=buf.find_first_of(delim, op)) != buf.npos){
tok.push_back(string(&buf[op], np-op));
if((op=buf.find_first_not_of(delim, np)) == buf.npos)
break;
}
tok.push_back(string(&buf[op]));

cout << buf << endl;
copy(tok.begin(), tok.end(), ostream_iterator<string>(cout,
"\n"));
cout << endl;
tok.clear();
}
return 0;
}

The inner loop basically finds tokens delimited by any character in
delim where multiple delimiters may appear between tokens (algorithm
follows some advice found on clc++). However, the method seems a little
clumsy, especially with respect to temporary objects. (Also, it does not
seem to work correctly. For example, the last token gets corrupted in
the second outer loop iteration.)

Also, it would be very nice to have a function like

int tokenize(const string& s, container<string>& c);

which returns the number of tokens, inserted into the container.
However, how do you write this so c is any container model? I'm not sure
you can since they don't share a base class. Is there any better way?

Certainly, this is easy to do with a mix of C and C++:

for(char *t=strtok(buf, delim); t != 0; t=strtok(0, delim))
tok.push_back(t);

where buf and delim are essentially char*'s. However, this seems
unsatisfactory as well.

/david
 
M

Mike Wahler

David Rubin said:
I looked on google for an answer, but I didn't find anything short of
using boost which sufficiently answers my question: what is a good way
of doing string tokenization (note: I cannot use boost). For example, I
have tried this:

Remarks below.
#include <algorithm>
#include <cctype>
#include <climits>
#include <deque>
#include <iostream>
#include <iterator>
#include <string>

using namespace std;

int
main()
{
string delim;
int c;

/* fill delim */
for(c=0; c < CHAR_MAX; c++)


Is there a particular reason you're excluding the
value 'CHAR_MAX' from the loop? (using < instead of <=)
{ // I tried #include <limits>, but
failed...

What happened?

#include <limits>

std::numeric_limits<char>::max();

should work.

More below.
if((isspace(c) || ispunct(c))
&& !(c == '_' || c == '#')
delim += c;
}

string buf;
string::size_type op, np;
deque<string> tok;

while(std::getline(cin, buf) && !cin.fail()){
op = 0;
while((np=buf.find_first_of(delim, op)) != buf.npos){
tok.push_back(string(&buf[op], np-op));
if((op=buf.find_first_not_of(delim, np)) == buf.npos)
break;
}
tok.push_back(string(&buf[op]));

cout << buf << endl;
copy(tok.begin(), tok.end(), ostream_iterator<string>(cout,
"\n"));
cout << endl;
tok.clear();
}
return 0;
}

The inner loop basically finds tokens delimited by any character in
delim where multiple delimiters may appear between tokens (algorithm
follows some advice found on clc++). However, the method seems a little
clumsy, especially with respect to temporary objects. (Also, it does not
seem to work correctly. For example, the last token gets corrupted in
the second outer loop iteration.)

Also, it would be very nice to have a function like

int tokenize(const string& s, container<string>& c);

which returns the number of tokens, inserted into the container.
However, how do you write this so c is any container model? I'm not sure
you can since they don't share a base class. Is there any better way?

I find your code interesting, so I'll probably play around
with it for a bit, and let you know if I have any ideas.

But here's some food for thought: one way to 'generalize'
container access is with iterators, as do the functions in
<algorithm>.

template <typename T>
T::size_type tokenize(const std::string& s,
T::iterator beg,
T::iterator end)
{
}

For inserting new elements, you can use an iterator
adapter, e.g. std::insert_iterator. You could even
use an output stream as a 'container' using
ostream_iterator.
Certainly, this is easy to do with a mix of C and C++:

for(char *t=strtok(buf, delim); t != 0; t=strtok(0, delim))
tok.push_back(t);

This contradicts your parameter type of const reference to string,
since 'strtok()' modifies its argument.
where buf and delim are essentially char*'s. However, this seems
unsatisfactory as well.

Yes, 'strtok()' can be problematic, if only for the reason that
it modifies its argument, necessitating creation of a copy if
you want to keep the argument const.

HTH,
-Mike
 
M

Mike Wahler

template <typename T>
T::size_type tokenize(const std::string& s,
T::iterator beg,
T::iterator end)
{
}

Oops, I meant to make those iterator parameters const refs
as well

const T::iterator& beg, const T::iterator& end

-Mike
 
D

David Rubin

Mike Wahler wrote:

[snip]
What happened?

foo.cc:9: limits: No such file or directory

For some reason, my compiler can't find the file. Otherwise, I agree
with you...
#include <limits>

std::numeric_limits<char>::max();

should work.
[snip]
But here's some food for thought: one way to 'generalize'
container access is with iterators, as do the functions in
<algorithm>.
template <typename T>
T::size_type tokenize(const std::string& s,
T::iterator beg,
T::iterator end)
{
}

This is a nice idea! I knew you should be able to do this, but I
couldn't see how. Here is the refactored code:

template <typename InsertIter>
void
tokenize(const string& buf, const string& delim, InsertIter ii)
{
string word;
string::size_type sp, ep; // start/end position

sp = 0;
do{
sp = buf.find_first_not_of(delim, sp);
ep = buf.find_first_of(delim, sp);
if(sp != ep){
if(ep == buf.npos)
ep = buf.length();
word = buf.substr(sp, ep-sp);
*ii++ = lc(word);
sp = buf.find_first_not_of(delim, ep+1);
}
}while(sp != buf.npos);

if(sp != buf.npos){
word = buf.substr(sp, buf.length()-sp);
*ii++ = lc(word);
}
}

called as

tokenize(buf, delim, insert_iter<deque<string> >(tokens,
tokens.begin()));

The orignal spec returned the number of tokens parsed. Now I have to
settle for checking

if(tokens.size() > 0){ ... }

/david
 
D

David Rubin

David Rubin wrote:

template <typename InsertIter>
void
tokenize(const string& buf, const string& delim, InsertIter ii)
{
string::size_type sp, ep; // start/end position

ep = -1;

do{
sp = buf.find_first_not_of(delim, ep+1);
ep = buf.find_first_of(delim, sp);
if(sp != ep){
if(ep == buf.npos)
ep = buf.length();
*ii++ = buf.substr(sp, ep-sp);
}
}while(sp != buf.npos);
}

That's better. The 'ep+1' is a small optimization. I'm not sure it makes
any difference, and really, starting with ep=0 makes the code a bit
clearer.

/david
 
M

Mike Wahler

David Rubin said:
Mike Wahler wrote:

[snip]
What happened?

foo.cc:9: limits: No such file or directory

For some reason, my compiler can't find the file.

Configuration problem, installation problem, or perhaps
simply not provided by your implemenation. Which one
are you using? More below.
Otherwise, I agree
with you...
#include <limits>

std::numeric_limits<char>::max();

should work.
But here's some food for thought: one way to 'generalize'
container access is with iterators, as do the functions in
<algorithm>.
template <typename T>
T::size_type tokenize(const std::string& s,
T::iterator beg,
T::iterator end)
{
}

This is a nice idea!


Yes it is. But not my idea. I "stole" it from
the standard library, the design of which is imo
rich with Good Ideas.

If you don't have the Josuttis book, get it.
www.josuttis.com/libbook

I knew you should be able to do this, but I
couldn't see how. Here is the refactored code:

[snip code] (I didn't look at it very closely, so if
obvious errors, I didn't see them)
called as

tokenize(buf, delim, insert_iter<deque<string> >(tokens,
tokens.begin()));

The orignal spec returned the number of tokens parsed. Now I have to
settle for checking

if(tokens.size() > 0){ ... }

or

if(!tokens.empty())

which *might* improve performance, and imo is
more expressive.

-Mike
 
M

Mike Wahler

David Rubin said:
David Rubin wrote:

template <typename InsertIter>
void
tokenize(const string& buf, const string& delim, InsertIter ii)

You might squeeze out some more performance by making
that last parameter a const reference.

{
string::size_type sp, ep; // start/end position

ep = -1;

do{
sp = buf.find_first_not_of(delim, ep+1);
ep = buf.find_first_of(delim, sp);
if(sp != ep){
if(ep == buf.npos)
ep = buf.length();
*ii++ = buf.substr(sp, ep-sp);
}
}while(sp != buf.npos);
}

That's better. The 'ep+1' is a small optimization. I'm not sure it makes
any difference, and really, starting with ep=0 makes the code a bit
clearer.

"I love it when a plan comes together."
-George Peppard, as "Hannibal" in "The A Team"

:)

-Mike
 
S

SomeDumbGuy

Mike Wahler wrote:

You might squeeze out some more performance by making
that last parameter a const reference.

Sorry to bother you. I was wondering how making it a const reference
would help performance?
 
M

Mike Wahler

SomeDumbGuy said:
Mike Wahler wrote:



Sorry to bother you. I was wondering how making it a const reference
would help performance?

Note the word "might". Depending upon the implementation,
an iterator might be a simple pointer, but it also could
be an elaborate large structure, in which case passing
by reference could be faster than passing a copy by
value, and would still probably be just as fast as
pass by value if the iterator is represented with a
pointer type.

Since OP's quest was to 'generalize' for any container
type (via a template), we cannot know what the actual
representation of the iterator will be.

Also the same iterator type might be implemented in different
ways among library implemenations, some large and complex,
some not. A reference can prevent possible performance
degradation, without having to be concerned whether it's
actually an issue.

The 'const' part of my suggestion doesn't really have anything
to do with performance. I say 'const' reference, since the
function does not need to modify the iterator's value. This
is the recommendation for any parameter of non-built-in type
which a function does not modify, especially when the parameter
type is a template argument, since you can't know how large and
complex it might be.

The "traditional wisdom" concerning parameter types is
essentially:

For non-built-in types or "unknown" types (specified by
a template argument), pass by const reference by default.

If the function needs to modify the caller's argument,
regardless of type, pass by nonconst reference.

If the parameter is always a built-in type and the function
need not modify the caller's argument, pass by value or
const reference.

Or something like that. :)

Does that help?

-Mike
 
J

Jerry Coffin

I looked on google for an answer, but I didn't find anything short of
using boost which sufficiently answers my question: what is a good way
of doing string tokenization (note: I cannot use boost). For example, I
have tried this:

This may look a bit odd at first, but it works quite nicely. Its basic
idea is to hijack the tokenizer built into the standard iostream
classes, and put it to our purposes. The main thing necessary to do
that is to create a facet that classifies our delimiters as "space" and
everything else as not-"space". Once we have a stream using our facet,
we can simply read tokens from the stream and use them as we please.

Here's the code:

#include <iostream>
#include <deque>
#include <sstream>
#include <iterator>

#include "facet"

template <class T>
class delims : public ctype_table<T>
{
public:
delims(size_t refs = 0)
: ctype_table<T>(ctype_table<T>::empty)
{
for (int i=0; i<table_size; i++)
if((isspace(i) || ispunct(i)) && !(i == '_' || i == '#'))
table()[widen(i)] = mask(space);
}
};

int main() {
std::string buf;
std::locale d(std::locale::classic(), new delims<char>);

while ( std::getline(std::cin, buf)) {
// deque to hold tokens.
std::deque<std::string> tok;

// create istringstream from the input string and have it use our facet.
std::istringstream is(buf);
is.imbue(d);

std::istream_iterator<std::string> in(is), end;

// copy tokens from our stream into a deque
std::copy(in, end,
std::back_inserter<std::deque<std::string> >(tok));

// show tokens, one per line.
std::copy(tok.begin(), tok.end(),
std::eek:stream_iterator<std::string>(std::cout, "\n"));
}

return 0;
}

Note that if we knew we were going to use std::cin for this, we could
just imbue std::cin with the locale we created, and read directly from
it into the deque:

// same facet as above.

int main() {
std::locale d(std::locale::classic(), new delims<char>);
std::cin.imbue(d);

std::istream_iterator<std::string> in(std::cin), end;
std::eek:stream_iterator<std::string> out(std::cout, "\n");

std::copy(in, end, out);
return 0;
}

Of course, even if you're planning to put the tokens into some
container, you can still imbue the input stream with the locale instead
of reading from stream to string, then imbuing a stringstream with the
locale.

The facet header I'm using contains some code I posted a while back --
it looks like this:

#include <locale>
#include <algorithm>

template<class T>
class table {
typedef typename std::ctype<T>::mask tmask;

tmask *t;
public:
table() : t(new std::ctype<T>::mask[std::ctype<T>::table_size]) {}
~table() { delete [] t; }
tmask *the_table() { return t; }
};

template<class T>
class ctype_table : table<T>, public std::ctype<T> {
protected:
typedef typename std::ctype<T>::mask tmask;

enum inits { empty, classic };

ctype_table(size_t refs = 0, inits init=classic)
: std::ctype<T>(the_table(), false, refs)
{
if (classic == init)
std::copy(classic_table(),
classic_table()+table_size,
the_table());
else
std::fill_n(the_table(), table_size, mask());
}
public:
tmask *table() {
return the_table();
}
};

This handles most of the dirty work of creating a facet, so about all
you're left with is specifying what characters you want treated as
delimiters, and then telling the stream to use a locale that includes
the facet.
 
S

SomeDumbGuy

Mike said:
Note the word "might". Depending upon the implementation,
an iterator might be a simple pointer, but it also could
be an elaborate large structure, in which case passing
by reference could be faster than passing a copy by
value, and would still probably be just as fast as
pass by value if the iterator is represented with a
pointer type.

The "traditional wisdom" concerning parameter types is
essentially:

For non-built-in types or "unknown" types (specified by
a template argument), pass by const reference by default.

If the function needs to modify the caller's argument,
regardless of type, pass by nonconst reference.

If the parameter is always a built-in type and the function
need not modify the caller's argument, pass by value or
const reference.

Or something like that. :)

Does that help?

Yes. :)
 
D

David Rubin

Jerry said:
This may look a bit odd at first, but it works quite nicely. Its basic
idea is to hijack the tokenizer built into the standard iostream
classes, and put it to our purposes. The main thing necessary to do
that is to create a facet that classifies our delimiters as "space" and
everything else as not-"space". Once we have a stream using our facet,
we can simply read tokens from the stream and use them as we please.

Interesting idea. It seems rather complex compared to the code I posted
though. One nice thing about your solution is that you encapsulate
delims in a class. I suppose you could extend the class a bit with
various methods to extend the delimiters (equivalent to isspace,
ispunct, etc), although this is reasonably accomplished via subclassing.
What are the other benefits of this approach compared to mine?

Also, I have a few questions...

[snip]
int main() {
std::locale d(std::locale::classic(), new delims<char>);

Can you explain the role of locales in a little more detail? Can't you
just skip this part in most cases?
std::cin.imbue(d);
std::istream_iterator<std::string> in(std::cin), end;
std::eek:stream_iterator<std::string> out(std::cout, "\n");
std::copy(in, end, out);

How does end function here? I've only seen copy specified with iterators
associated with a container. In this case, in and end seem to have no
association with each other.
return 0;
}

/david
 
D

David Rubin

Mike said:
You might squeeze out some more performance by making
that last parameter a const reference.

Actually, I tried defining this function as

template <typename InsertIter>
void tokenize(const string& buf, const string& delim, InsertIter& ii);

(and const InsertIter&) with little success. I got compiler errors both
times. For exapmle, with the non-const refernce, I got:

; g++ tokenize.cc
tokenize.cc: In function `int main()':
tokenize.cc:31: error: could not convert `
insert_iterator<std::deque<std::basic_string<char,
std::char_traits<char>,
std::allocator<char> >, std::allocator<std::basic_string<char,
std::char_traits<char>, std::allocator<char> > > > >((&tokens),
std::deque<_Tp, _Alloc>::begin() [with _Tp = std::string, _Alloc =
std::allocator<std::string>]())' to `
std::insert_iterator<std::deque<std::string,
std::allocator said:
tokenize.cc:12: error: in passing argument 3 of `void tokenize(const
std::string&, const std::string&, InsertIter&) [with InsertIter =
std::insert_iterator<std::deque<std::string,
std::allocator said:

when invoked (12) as

tokenize(buf, delim, insert_iterator<deque<string> >(tokens,
tokens.begin()));

(g++ 3.3.1, Solaris 2.6).
 
M

Mike Wahler

David Rubin said:
Mike said:
You might squeeze out some more performance by making
that last parameter a const reference.

Actually, I tried defining this function as

template <typename InsertIter>
void tokenize(const string& buf, const string& delim, InsertIter& ii);

(and const InsertIter&) with little success. I got compiler errors both
times. For exapmle, with the non-const refernce, I got:

; g++ tokenize.cc
tokenize.cc: In function `int main()':
tokenize.cc:31: error: could not convert `
insert_iterator<std::deque<std::basic_string<char,
std::char_traits<char>,
std::allocator<char> >, std::allocator<std::basic_string<char,
std::char_traits<char>, std::allocator<char> > > > >((&tokens),
std::deque<_Tp, _Alloc>::begin() [with _Tp = std::string, _Alloc =
std::allocator<std::string>]())' to `
std::insert_iterator<std::deque<std::string,
std::allocator said:
tokenize.cc:12: error: in passing argument 3 of `void tokenize(const
std::string&, const std::string&, InsertIter&) [with InsertIter =
std::insert_iterator<std::deque<std::string,
std::allocator said:

when invoked (12) as

tokenize(buf, delim, insert_iterator<deque<string> >(tokens,
tokens.begin()));

If it's really an issue for you, I'll take a look and
see if I can locate the problem.

-Mike
 
D

David Rubin

Mike said:
[...] I got compiler errors both
times. For exapmle, with the non-const refernce, I got:
[snip]
If it's really an issue for you, I'll take a look and
see if I can locate the problem.

I'd appreciate it. I just don't understand what the compiler is doing.
Thanks.

/david
 
M

Mike Wahler

David Rubin said:
Mike said:
[...] I got compiler errors both
times. For exapmle, with the non-const refernce, I got:
[snip]
If it's really an issue for you, I'll take a look and
see if I can locate the problem.

I'd appreciate it. I just don't understand what the compiler is doing.

I didn't get exactly the same diagnostics, but I did get
complaints about 'insert_iterator::eek:perator++()', because
it modifies the iterator, so we cannot make it a const
reference, but it works for me as a nonconst reference:

#include <algorithm>
#include <deque>
#include <iterator>
#include <string>


template <typename InsertIter>
void tokenize(const std::string& buf,
const std::string& delim,
InsertIter& ii)
{
std::string::size_type sp(0); /* start position */
std::string::size_type ep(-1); /* end position */

do
{
sp = buf.find_first_not_of(delim, ep + 1);
ep = buf.find_first_of(delim, sp);

if(sp != ep)
{
if(ep == buf.npos)
ep = buf.length();

*ii++ = buf.substr(sp, ep-sp);
}

} while(sp != buf.npos);
}

int main()
{
std::string buf("We* are/parsing [a---string");
std::string delim(" */[-");
std::deque<std::string> tokens;

tokenize(buf, delim, std::inserter(tokens, tokens.begin()));

std::copy(tokens.begin(), tokens.end(),
std::eek:stream_iterator<std::string>(std::cout, "\n"));

return 0;

}


Output:

We
are
parsing
a
string


HTH,
-Mike
 
M

Mike Wahler

Mike Wahler said:
std::copy(tokens.begin(), tokens.end(),
std::eek:stream_iterator<std::string>(std::cout, "\n"));

I forgot the obvious #include <iostream> :)

(I suppose my implementation brought the decls in via
#include <iterator>, so didn'tcomplain about no 'cout')

-Mike
 
J

Jerry Coffin

[ ... ]
Interesting idea. It seems rather complex compared to the code I posted
though.

Yes and no -- the framework for creating a facet is a bit more complex
than I'd like, but creating a facet is fairly easy and using it is
pretty nearly dead simple.
One nice thing about your solution is that you encapsulate
delims in a class. I suppose you could extend the class a bit with
various methods to extend the delimiters (equivalent to isspace,
ispunct, etc), although this is reasonably accomplished via subclassing.
What are the other benefits of this approach compared to mine?

Ease of use -- it supports normal stream extractors. Since the example
at hand was taking the tokens from a stream and putting them into a
deque, I used iterators and std::copy to copy from one to the other. It
could have been done with normal stream extraction though:

std::string token;

while (stream >> token)
std::cout << token << std::endl;

would have worked just fine for the job at hand as well. For that
matter, to read them and put them into the deque, we could have done
something like:

std::string token;
std::deque tok;

while ( stream >> token)
tok.push_back(token);


To make a long story short, once you've imbued a stream with the facet
that defines the delimiters between tokens, you don't have to learn
anything new at all -- you're just extracting strings from a stream.
Also, I have a few questions...

[snip]
int main() {
std::locale d(std::locale::classic(), new delims<char>);

Can you explain the role of locales in a little more detail? Can't you
just skip this part in most cases?

A locale is basically a collection of facets (one facet each of a number
of different types). A stream, however, only knows about a complete
locale, though, not individual facets. Therefore, we create an otherwise
normal locale, but have it use our ctype facet. The rest of it probably
doesn't matter for the job at hand, but (at least TTBOMK) there's no
function that tells a stream to use on facet at a time.
How does end function here? I've only seen copy specified with iterators
associated with a container. In this case, in and end seem to have no
association with each other.

An istream iterator that's not associated with a stream is basically the
equivalent of EOF -- another istream_iterator will become equal to it
when the end of the stream is encountered.

That's strictly related to using istream_iterator's though -- it's
completely independent of changing the facet to get the parsing we want.
Just for example, if we had a file of numbers (integers for the moment)
and we wanted to read those numbers into a vector, we could do it
similarly:

std::vector<int> v;

std::istream_iterator<int> in(std::cin), end;

std::copy(in, end, std::back_inserter<std::vector<int> >(v));

Similarly, if we have a file of floating point values that we wanted to
read into a list, we could do:

std::list<double> ld;

std::istream_iterator<double> in(std::cin), end;

std::copy(in, end, std::back_inserter<std::list<double> >(ld));

and so on.

To summarize: an istream_iterator allows us to treat the contents of a
stream like a collection, so we can apply a standard algorithm to its
contents (it's basically an input_iterator, so we can't, for example,
apply an algorithm that requires a random access iterator though).
 
D

David Rubin

Mike Wahler wrote:

[snip]
template <typename InsertIter>
void tokenize(const std::string& buf,
const std::string& delim,
InsertIter& ii)
[snip]
int main()
{
std::string buf("We* are/parsing [a---string");
std::string delim(" */[-");
std::deque<std::string> tokens;

tokenize(buf, delim, std::inserter(tokens, tokens.begin()));

This only works with my compiler if I do

std::insert_iterator<std::deque<std::string> > ii(tokens,
tokens.begin());
tokenize(buf, delim, ii);

Otherwise, I get the same errors as before. I guess this means my
compiler is broken?

; g++ --version
g++ (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)

/david
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top