efficient string tokenizer

A

Alex

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:


vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?
 
E

Ernest Brownlee

Alex said:
I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:


vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


The method used currently constructs two vectors of presumably 0 size,
scans the memory block, and for each iteration a vector has an element
inserted at its end (which could result in either vector growing
multiple times in the course of this function). Also, for every
iteration of the loop, there is a call to strlen(), and for every time
the delimiter is encountered memory is dynamically allocated for 'pChar'
(that I assume is delete later). Then there is a possibility that there
is a copy routine that immediately follows the function, as I would
suspect that tokenList has to be copied into its destination variable.

First, I would remove the call to strlen() from the comparison, and use
a variable that stores the value returned by it. Technically, you don't
even need the call to strlen() since you are scanning the string
anyways. You could also reduce the number of times the vector has to
grow by reserve()ing some memory, and you would also gain some speed
from eliminating the dynamic allocation of memory. Maybe you could pass
a reference or pointer to the destination vector<char*> to the function
to prevent invoking a copy routine.

I hope I was of help, a lot of my C++ knowledge seems to hide itself in
the depths of my memory when I am not currently working with it.

Good luck,
Ernest
 
D

David Hilsee

Alex said:
I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:


vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


I'd remove the memset() call since it's setting all of the memory to zero
when you really only need the last element of the array set to zero. Also,
I'd move the call to strlen() so it isn't called at every iteration of the
loop. If you wanted to go over the top, you might write code like this:

inline char * toarray( const vector<char>& v ) {
char * ret = new char[ v.size()+1 ];
if ( !v.empty() ) {
memcpy( ret, &v[0], v.size() );
}
ret[ v.size() ] = '\0';
return ret;
}

vector<char *> tokenize ( char * in, char delim ) {
vector<char *> tokens;
vector<char> currToken;

size_t len = strlen( in );
// some reasonable, arbitrary guesses for sizes
tokens.reserve( len );
currToken.reserve( 8 );

for( size_t i = 0; i < len; ++i ) {
if ( in == delim ) {
tokens.push_back( toarray(currToken) );
currToken.clear();
}
else {
currToken.push_back( in );
}
}

tokens.push_back( toarray(currToken) );
return tokens;
}

But I probably wouldn't bother with all of that. Also, I think it would be
simpler if the code used std::string or std::vector<char> instead of c-style
strings, because c-style strings can be a headache and code that uses
std::vector<char*> could accidentally leak memory when an exception occurs.
 
J

John Harrison

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:


vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


Quite a few.

1) Don't call strlen every time around the loop (in fact don't call strlen
at all).

2) Don't use a temporary vector (theString).

3) Don't use memset, when you are simply overwriting that same characters
later on.

4) Don't return vectors by value, pass a reference instead.

5) Considering using vector<string> instead of vector<char*>, its easier
to handle and likely to be more efficient as well if you have an
implementation that uses short string optimisation.

Overall this should be a lot more efficicent

void tokenize(const char* str, char delim, vector<string>& tokenList)
{
tokenList.clear();
int start = -1;
for (int i = 0; str; ++i)
{
if (str == delim)
{
tokenList.push_back(string(str + start + 1, str + i));
start = i;
}
}
}

This is untested code.

john
 
R

rossum

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:


vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


Here is yet another alternative to go alongside all the other
suggestions:


// Tokenising an input string using a stringstream

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <cstdlib>


std::vector<std::string> tokenize(const std::string& tok_str,
char delim) {
std::vector<std::string> all_toks;
std::stringstream my_sstream(tok_str);
std::string temp;

while (getline(my_sstream, temp, delim)) {
all_toks.push_back(temp);
} // end while

return all_toks;
} // end tokenize()


int main() {
const char SEPARATOR = '|';
const std::string THE_STRING =
"tokenA|tokenB|tokenC|tokenD|tokenE";
std::vector<std::string> my_tokens;

my_tokens = tokenize(THE_STRING, SEPARATOR);

for (int i = 0; i < my_tokens.size(); ++i) {
std::cout << "Token " << i << " = \""
<< my_tokens << "\"\n";
} // end for

system("pause");
return EXIT_SUCCESS;
} // end main()



Points:
1 Using getline() avoids reinventing the wheel.
2 This option makes a copy of the input string into the stringstream,
to be avoided if the string is large
3 Better to have the vector of parsed tokens as a reference
paramenter, avoids yet another copy of the string
4 IIRC a function with declarations cannot usually be inlined.


rossum
 
J

John Harrison

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:


vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


Here is yet another alternative to go alongside all the other
suggestions:


// Tokenising an input string using a stringstream

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <cstdlib>


std::vector<std::string> tokenize(const std::string& tok_str,
char delim) {
std::vector<std::string> all_toks;
std::stringstream my_sstream(tok_str);
std::string temp;

while (getline(my_sstream, temp, delim)) {
all_toks.push_back(temp);
} // end while

return all_toks;
} // end tokenize()


int main() {
const char SEPARATOR = '|';
const std::string THE_STRING =
"tokenA|tokenB|tokenC|tokenD|tokenE";
std::vector<std::string> my_tokens;
my_tokens = tokenize(THE_STRING, SEPARATOR);
for (int i = 0; i < my_tokens.size(); ++i) {
std::cout << "Token " << i << " = \""
<< my_tokens << "\"\n";
} // end for

system("pause");
return EXIT_SUCCESS;
} // end main()



Points:
1 Using getline() avoids reinventing the wheel.
2 This option makes a copy of the input string into the stringstream,
to be avoided if the string is large
3 Better to have the vector of parsed tokens as a reference
paramenter, avoids yet another copy of the string
4 IIRC a function with declarations cannot usually be inlined.


It also has different behaviour from the original code.

tokens = tokenise("abc|123", '|');

is only one token by the original code and it is two tokens in your code.

I wouldn't be surpised if the original specification was mistaken however.

john
 
T

tom_usenet

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:


vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];


That line above is likely to be the main performance bottleneck -
allocating new memory for each token. There are approaches that avoid
this.
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();

The use of theString is another bottleneck - likely it causes further
memory allocation each time around.
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


inlining wouldn't help in any case - you'd save a few instructions vs
the thousands that the method might have to execute (memory allocation
is generally expensive). Here's a high performance variant:

#include <vector>
#include <cstring>

//NB this modifies the passed string!
std::vector<char*> tokenize(char* str, char delim)
{
std::vector<char*> tokenList;
tokenList.reserve(10);
typedef std::string::size_type size_t;
size_t const length = std::strlen(str);
size_t start = 0;
for(size_t i=0; i!=length; ++i)
{
if(str == delim)
{
str = '\0';
tokenList.push_back(str + start);
start = i + 1;
}
}
tokenList.push_back(str + start);
return tokenList;
}

#include <iostream>
#include <algorithm>
#include <iterator>

int main()
{
char const s[] = "1|2|3|10|| ";
char* p = new char[sizeof s];
std::strcpy(p, s);
std::vector<char*> tokens = tokenize(p, '|');
std::copy(
tokens.begin(),
tokens.end(),
std::eek:stream_iterator<char const*>(std::cout, "*")
);
std::cout << '\n';
}

Less error prone implementations are possible if you write a class to
represent a token, but overall I like the above solution as long as
sufficient care is taken. Another point is that if the string ends in
a token, you get an empty token at the end of the vector, which may or
may not be desirable. You might also improve performance on compilers
that don't implement the named return value optimization by passing
the tokenList vector in by reference rather than returning it.
Alternatively, you could use an output iterator as the last parameter.

Finally, it might be worth changing the signature to:

std::vector<char*> tokenize(char*& str, char delim);

since this will disable the biggest danger of accidentally passing a
string literal directly to the function.

Tom
 
R

rossum

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:


vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?


Here is yet another alternative to go alongside all the other
suggestions:


// Tokenising an input string using a stringstream

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <cstdlib>


std::vector<std::string> tokenize(const std::string& tok_str,
char delim) {
std::vector<std::string> all_toks;
std::stringstream my_sstream(tok_str);
std::string temp;

while (getline(my_sstream, temp, delim)) {
all_toks.push_back(temp);
} // end while

return all_toks;
} // end tokenize()


int main() {
const char SEPARATOR = '|';
const std::string THE_STRING =
"tokenA|tokenB|tokenC|tokenD|tokenE";
std::vector<std::string> my_tokens;
my_tokens = tokenize(THE_STRING, SEPARATOR);
for (int i = 0; i < my_tokens.size(); ++i) {
std::cout << "Token " << i << " = \""
<< my_tokens << "\"\n";
} // end for

system("pause");
return EXIT_SUCCESS;
} // end main()



Points:
1 Using getline() avoids reinventing the wheel.
2 This option makes a copy of the input string into the stringstream,
to be avoided if the string is large
3 Better to have the vector of parsed tokens as a reference
paramenter, avoids yet another copy of the string
4 IIRC a function with declarations cannot usually be inlined.


It also has different behaviour from the original code.

tokens = tokenise("abc|123", '|');

is only one token by the original code and it is two tokens in your code.

I wouldn't be surpised if the original specification was mistaken however.

john


Well spotted John, I hadn't noticed that in my brief testing.

Thanks,

rossum
 
M

Mark

On Mon, 02 Aug 2004 13:12:21 +0100, tom_usenet

[.....]
}

Less error prone implementations are possible if you write a class to
represent a token, but overall I like the above solution as long as
sufficient care is taken. Another point is that if the string ends in
a token, you get an empty token at the end of the vector, which may or
may not be desirable. You might also improve performance on compilers
that don't implement the named return value optimization by passing
the tokenList vector in by reference rather than returning it.
Alternatively, you could use an output iterator as the last parameter.

Could you elaborate on 'named return value optimization'?

Mark
 
A

Alex

Thanks to everyone who responded. All your responses proved to be
very helpful. The new tokenizer I am using is a hybrid of all
suggestions and now roughly takes 0.76 ms, whereas the original took
3.69ms. The final method I am using is the following:

void tokenizer(const char* str, char delim, vector<char*>& tokenList)
{
tokenList.clear();
int start = 0;
for (int i = 0; str; i++)
{
if (str == delim)
{
char* token = new char[i-start+1];
memcpy(token, &str[start], i-start);
token[i-start] = '\0';
tokenList.push_back(token);
start = i;
}

}
}
 
T

tom_usenet

On Mon, 02 Aug 2004 13:12:21 +0100, tom_usenet

[.....]
}

Less error prone implementations are possible if you write a class to
represent a token, but overall I like the above solution as long as
sufficient care is taken. Another point is that if the string ends in
a token, you get an empty token at the end of the vector, which may or
may not be desirable. You might also improve performance on compilers
that don't implement the named return value optimization by passing
the tokenList vector in by reference rather than returning it.
Alternatively, you could use an output iterator as the last parameter.

Could you elaborate on 'named return value optimization'?

Read all about it
http://groups.google.co.uk/groups?q="named+return+value+optimization"

(or if you have the standard, read 12.8/15, or read the draft version:
the last paragraph on this page
http://www.csci.csusb.edu/dick/c++std/cd2/special.html)

Tom
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,054
Latest member
LucyCarper

Latest Threads

Top