String tokenizer comments desired

  • Thread starter Christopher Benson-Manica
  • Start date
C

Christopher Benson-Manica

The function in question follows:

vector<string>& tokenize(
const string& s,
vector<string>& v,
char delimiter=',' )
{
int delim_idx, begin_idx=0, len=s.length();

for( delim_idx=s.find_first_of(delimiter,begin_idx) ;
delim_idx >=0 && begin_idx < len ;
delim_idx=s.find_first_of(delimiter,begin_idx) ) {

v.push_back( s.substr(begin_idx,delim_idx-begin_idx) );
begin_idx=delim_idx+1;
}
if( begin_idx < len ) {
v.push_back( s.substr(begin_idx,len-begin_idx) );
}
return( v );
}

It seems to work well, but I'd appreciate suggestions regarding style,
technique, or subtle bugs I've missed. Thanks.
 
A

Andre Kostur

The function in question follows:

vector<string>& tokenize(
const string& s,
vector<string>& v,
char delimiter=',' )
{
int delim_idx, begin_idx=0, len=s.length();

These are not of a strictly "proper" type... they should be of type
std::string::size_type.
for( delim_idx=s.find_first_of(delimiter,begin_idx) ;
delim_idx >=0 && begin_idx < len ;
delim_idx=s.find_first_of(delimiter,begin_idx) ) {

Why aren't you checking against std::string::npos, which is the return
value of find_first_of if the thing you're looking for isn't found?
v.push_back( s.substr(begin_idx,delim_idx-begin_idx) );
begin_idx=delim_idx+1;
}
if( begin_idx < len ) {
v.push_back( s.substr(begin_idx,len-begin_idx) );
}
return( v );
}

It seems to work well, but I'd appreciate suggestions regarding style,
technique, or subtle bugs I've missed. Thanks.

I'd probably use a while loop instead of a for loop. IMHO: it's a better
documentation style. To me a for loop implies that you're going to do
something a predermined number of times, where the while loop is just "do
something until some condition is met". So I'd probably do something
like (psudocode):

begin_idx = 0;
delim_idx = find;

while (delim_idx != npos)
{
push string fragment onto vector
advance the begin_idx
delim_idx = find;
}

push last fragment onto vector
return
 
J

jose luis fernandez diaz

Hi,

This is my own version:

#include <string>
#include <vector>

#include "tokenize.h"


void Tokenize(const string& buffer,
vector<string>& tokens,
const char delimiter)
{
int pos = 0, pos_ant = 0;

pos = buffer.find(delimiter, pos_ant);
while (pos != string::npos)
{
string token = buffer.substr(pos_ant, pos-pos_ant);
tokens.push_back(token);
pos_ant = pos+1;
pos = buffer.find(delimiter, pos_ant);
}

if (!buffer.empty())
{
tokens.push_back(buffer.substr(pos_ant, buffer.size()-1));
}
}

Regards,
Jose Luis
 
M

Michiel Salters

Christopher Benson-Manica said:
The function in question follows:

vector<string>& tokenize(
const string& s,
vector<string>& v,
char delimiter=',' )
{
int delim_idx, begin_idx=0, len=s.length();

for( delim_idx=s.find_first_of(delimiter,begin_idx) ;
delim_idx >=0 && begin_idx < len ;
delim_idx=s.find_first_of(delimiter,begin_idx) ) {

v.push_back( s.substr(begin_idx,delim_idx-begin_idx) );
begin_idx=delim_idx+1;
}
if( begin_idx < len ) {
v.push_back( s.substr(begin_idx,len-begin_idx) );
}
return( v );
}

It seems to work well, but I'd appreciate suggestions regarding style,
technique, or subtle bugs I've missed. Thanks.

In addition to what Andre Kostur wrote, I'd suggest two extra
parameters. One is a template parameter; the function would
work just as well on any basic_string<CH>. The second is a
boolean parameter, whether empty strings are included in
the output. When the separator is ',', you probably want
to split "a,,b" in three strings. When ' ', and splitting
"a b", you probably want just two strings.

Regards,
Michiel Salters
 
D

David Rubin

jose said:
Hi,

This is my own version:

#include <string>
#include <vector>

#include "tokenize.h"


void Tokenize(const string& buffer,
vector<string>& tokens,
const char delimiter)
{
int pos = 0, pos_ant = 0;

pos = buffer.find(delimiter, pos_ant);
while (pos != string::npos)
{
string token = buffer.substr(pos_ant, pos-pos_ant);
tokens.push_back(token);
pos_ant = pos+1;
pos = buffer.find(delimiter, pos_ant);
}

if (!buffer.empty())
{
tokens.push_back(buffer.substr(pos_ant, buffer.size()-1));
}
}

One better?

#include <string>

template <typename InsertIter>
void
tokenize(const std::string& buf,
const std::string& delim,
InsertIter ii)
{
std::string::size_type sp = 0; /* start position */
std::string::size_type ep = -1; /* end position */

do{
sp = buf.find_first_not_of(delim, ep+1);
ep = buf.find_first_of(delim, sp);
if(sp != ep){
if(ep == buf.npos)
ep = buf.length();
*ii++ = buf.substr(sp, ep-sp);
}
}while(sp != buf.npos);
}

/david
 
C

Christopher Benson-Manica

Michiel Salters said:
In addition to what Andre Kostur wrote, I'd suggest two extra
parameters. One is a template parameter; the function would
work just as well on any basic_string<CH>. The second is a
boolean parameter, whether empty strings are included in
the output. When the separator is ',', you probably want
to split "a,,b" in three strings. When ' ', and splitting
"a b", you probably want just two strings.

The empty strings argument is a great idea - thanks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top