String tokenizer comments desired

Christopher Benson-Manica · May 12, 2004

The function in question follows:

vector<string>& tokenize(
const string& s,
vector<string>& v,
char delimiter=',' )
{
int delim_idx, begin_idx=0, len=s.length();

for( delim_idx=s.find_first_of(delimiter,begin_idx) ;
delim_idx >=0 && begin_idx < len ;
delim_idx=s.find_first_of(delimiter,begin_idx) ) {

v.push_back( s.substr(begin_idx,delim_idx-begin_idx) );
begin_idx=delim_idx+1;
}
if( begin_idx < len ) {
v.push_back( s.substr(begin_idx,len-begin_idx) );
}
return( v );
}

It seems to work well, but I'd appreciate suggestions regarding style,
technique, or subtle bugs I've missed. Thanks.

Andre Kostur · May 13, 2004

The function in question follows:

vector<string>& tokenize(
const string& s,
vector<string>& v,
char delimiter=',' )
{
int delim_idx, begin_idx=0, len=s.length();

These are not of a strictly "proper" type... they should be of type
std::string::size_type.

for( delim_idx=s.find_first_of(delimiter,begin_idx) ;
delim_idx >=0 && begin_idx < len ;
delim_idx=s.find_first_of(delimiter,begin_idx) ) {

Why aren't you checking against std::string::npos, which is the return
value of find_first_of if the thing you're looking for isn't found?

v.push_back( s.substr(begin_idx,delim_idx-begin_idx) );
begin_idx=delim_idx+1;
}
if( begin_idx < len ) {
v.push_back( s.substr(begin_idx,len-begin_idx) );
}
return( v );
}

It seems to work well, but I'd appreciate suggestions regarding style,
technique, or subtle bugs I've missed. Thanks.

I'd probably use a while loop instead of a for loop. IMHO: it's a better
documentation style. To me a for loop implies that you're going to do
something a predermined number of times, where the while loop is just "do
something until some condition is met". So I'd probably do something
like (psudocode):

begin_idx = 0;
delim_idx = find;

while (delim_idx != npos)
{
push string fragment onto vector
advance the begin_idx
delim_idx = find;
}

push last fragment onto vector
return

jose luis fernandez diaz · May 13, 2004

Hi,

This is my own version:

#include <string>
#include <vector>

#include "tokenize.h"

void Tokenize(const string& buffer,
vector<string>& tokens,
const char delimiter)
{
int pos = 0, pos_ant = 0;

pos = buffer.find(delimiter, pos_ant);
while (pos != string::npos)
{
string token = buffer.substr(pos_ant, pos-pos_ant);
tokens.push_back(token);
pos_ant = pos+1;
pos = buffer.find(delimiter, pos_ant);
}

if (!buffer.empty())
{
tokens.push_back(buffer.substr(pos_ant, buffer.size()-1));
}
}

Regards,
Jose Luis

Michiel Salters · May 13, 2004

Christopher Benson-Manica said:
The function in question follows:

vector<string>& tokenize(
const string& s,
vector<string>& v,
char delimiter=',' )
{
int delim_idx, begin_idx=0, len=s.length();

for( delim_idx=s.find_first_of(delimiter,begin_idx) ;
delim_idx >=0 && begin_idx < len ;
delim_idx=s.find_first_of(delimiter,begin_idx) ) {

v.push_back( s.substr(begin_idx,delim_idx-begin_idx) );
begin_idx=delim_idx+1;
}
if( begin_idx < len ) {
v.push_back( s.substr(begin_idx,len-begin_idx) );
}
return( v );
}

It seems to work well, but I'd appreciate suggestions regarding style,
technique, or subtle bugs I've missed. Thanks.

In addition to what Andre Kostur wrote, I'd suggest two extra
parameters. One is a template parameter; the function would
work just as well on any basic_string<CH>. The second is a
boolean parameter, whether empty strings are included in
the output. When the separator is ',', you probably want
to split "a,,b" in three strings. When ' ', and splitting
"a b", you probably want just two strings.

Regards,
Michiel Salters

David Rubin · May 13, 2004

jose said:
Hi,

This is my own version:

#include <string>
#include <vector>

#include "tokenize.h"

void Tokenize(const string& buffer,
vector<string>& tokens,
const char delimiter)
{
int pos = 0, pos_ant = 0;

pos = buffer.find(delimiter, pos_ant);
while (pos != string::npos)
{
string token = buffer.substr(pos_ant, pos-pos_ant);
tokens.push_back(token);
pos_ant = pos+1;
pos = buffer.find(delimiter, pos_ant);
}

if (!buffer.empty())
{
tokens.push_back(buffer.substr(pos_ant, buffer.size()-1));
}
}

One better?

#include <string>

template <typename InsertIter>
void
tokenize(const std::string& buf,
const std::string& delim,
InsertIter ii)
{
std::string::size_type sp = 0; /* start position */
std::string::size_type ep = -1; /* end position */

do{
sp = buf.find_first_not_of(delim, ep+1);
ep = buf.find_first_of(delim, sp);
if(sp != ep){
if(ep == buf.npos)
ep = buf.length();
*ii++ = buf.substr(sp, ep-sp);
}
}while(sp != buf.npos);
}

/david

Christopher Benson-Manica · May 13, 2004

Michiel Salters said:
In addition to what Andre Kostur wrote, I'd suggest two extra
parameters. One is a template parameter; the function would
work just as well on any basic_string<CH>. The second is a
boolean parameter, whether empty strings are included in
the output. When the separator is ',', you probably want
to split "a,,b" in three strings. When ' ', and splitting
"a b", you probably want just two strings.

The empty strings argument is a great idea - thanks!

tokenizer class	1	Sep 11, 2007
STL iterator ?	4	Dec 29, 2005
efficient string tokenizer	10	Aug 1, 2004
Tokenizer Function (plus rant on strtok documentation)	18	Jul 11, 2006
string to char array	7	Jul 13, 2006
vector and sort problem	4	Feb 3, 2005
converting from IPADDRESS string to unsigned char array	5	Mar 30, 2008
String performance	25	Oct 20, 2007

String tokenizer comments desired

Christopher Benson-Manica

Andre Kostur

jose luis fernandez diaz

Michiel Salters

David Rubin

Christopher Benson-Manica

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads