efficient string tokenizer

Alex · Aug 1, 2004

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:

vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?

Ernest Brownlee · Aug 1, 2004

Alex said:
I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:

vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?

The method used currently constructs two vectors of presumably 0 size,
scans the memory block, and for each iteration a vector has an element
inserted at its end (which could result in either vector growing
multiple times in the course of this function). Also, for every
iteration of the loop, there is a call to strlen(), and for every time
the delimiter is encountered memory is dynamically allocated for 'pChar'
(that I assume is delete later). Then there is a possibility that there
is a copy routine that immediately follows the function, as I would
suspect that tokenList has to be copied into its destination variable.

First, I would remove the call to strlen() from the comparison, and use
a variable that stores the value returned by it. Technically, you don't
even need the call to strlen() since you are scanning the string
anyways. You could also reduce the number of times the vector has to
grow by reserve()ing some memory, and you would also gain some speed
from eliminating the dynamic allocation of memory. Maybe you could pass
a reference or pointer to the destination vector<char*> to the function
to prevent invoking a copy routine.

I hope I was of help, a lot of my C++ knowledge seems to hide itself in
the depths of my memory when I am not currently working with it.

Good luck,
Ernest

David Hilsee · Aug 1, 2004

Alex said:
I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:

vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?

I'd remove the memset() call since it's setting all of the memory to zero
when you really only need the last element of the array set to zero. Also,
I'd move the call to strlen() so it isn't called at every iteration of the
loop. If you wanted to go over the top, you might write code like this:

inline char * toarray( const vector<char>& v ) {
char * ret = new char[ v.size()+1 ];
if ( !v.empty() ) {
memcpy( ret, &v[0], v.size() );
}
ret[ v.size() ] = '\0';
return ret;
}

vector<char *> tokenize ( char * in, char delim ) {
vector<char *> tokens;
vector<char> currToken;

size_t len = strlen( in );
// some reasonable, arbitrary guesses for sizes
tokens.reserve( len );
currToken.reserve( 8 );

for( size_t i = 0; i < len; ++i ) {
if ( in == delim ) {
tokens.push_back( toarray(currToken) );
currToken.clear();
}
else {
currToken.push_back( in );
}
}

tokens.push_back( toarray(currToken) );
return tokens;
}

But I probably wouldn't bother with all of that. Also, I think it would be
simpler if the code used std::string or std::vector<char> instead of c-style
strings, because c-style strings can be a headache and code that uses
std::vector<char*> could accidentally leak memory when an exception occurs.

John Harrison · Aug 1, 2004

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:

vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?

Quite a few.

1) Don't call strlen every time around the loop (in fact don't call strlen
at all).

2) Don't use a temporary vector (theString).

3) Don't use memset, when you are simply overwriting that same characters
later on.

4) Don't return vectors by value, pass a reference instead.

5) Considering using vector<string> instead of vector<char*>, its easier
to handle and likely to be more efficient as well if you have an
implementation that uses short string optimisation.

Overall this should be a lot more efficicent

void tokenize(const char* str, char delim, vector<string>& tokenList)
{
tokenList.clear();
int start = -1;
for (int i = 0; str; ++i)
{
if (str == delim)
{
tokenList.push_back(string(str + start + 1, str + i));
start = i;
}
}
}

This is untested code.

john

rossum · Aug 2, 2004

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:

vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?

Here is yet another alternative to go alongside all the other
suggestions:

// Tokenising an input string using a stringstream

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <cstdlib>

std::vector<std::string> tokenize(const std::string& tok_str,
char delim) {
std::vector<std::string> all_toks;
std::stringstream my_sstream(tok_str);
std::string temp;

while (getline(my_sstream, temp, delim)) {
all_toks.push_back(temp);
} // end while

return all_toks;
} // end tokenize()

int main() {
const char SEPARATOR = '|';
const std::string THE_STRING =
"tokenA|tokenB|tokenC|tokenD|tokenE";
std::vector<std::string> my_tokens;

my_tokens = tokenize(THE_STRING, SEPARATOR);

for (int i = 0; i < my_tokens.size(); ++i) {
std::cout << "Token " << i << " = \""
<< my_tokens << "\"\n";
} // end for

system("pause");
return EXIT_SUCCESS;
} // end main()

Points:
1 Using getline() avoids reinventing the wheel.
2 This option makes a copy of the input string into the stringstream,
to be avoided if the string is large
3 Better to have the vector of parsed tokens as a reference
paramenter, avoids yet another copy of the string
4 IIRC a function with declarations cannot usually be inlined.

rossum

John Harrison · Aug 2, 2004

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:

vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?

Click to expand...

Here is yet another alternative to go alongside all the other
suggestions:

// Tokenising an input string using a stringstream

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <cstdlib>

std::vector<std::string> tokenize(const std::string& tok_str,
char delim) {
std::vector<std::string> all_toks;
std::stringstream my_sstream(tok_str);
std::string temp;

while (getline(my_sstream, temp, delim)) {
all_toks.push_back(temp);
} // end while

return all_toks;
} // end tokenize()

int main() {
const char SEPARATOR = '|';
const std::string THE_STRING =
"tokenA|tokenB|tokenC|tokenD|tokenE";
std::vector<std::string> my_tokens;
my_tokens = tokenize(THE_STRING, SEPARATOR);
for (int i = 0; i < my_tokens.size(); ++i) {
std::cout << "Token " << i << " = \""
<< my_tokens << "\"\n";
} // end for

system("pause");
return EXIT_SUCCESS;
} // end main()

Points:
1 Using getline() avoids reinventing the wheel.
2 This option makes a copy of the input string into the stringstream,
to be avoided if the string is large
3 Better to have the vector of parsed tokens as a reference
paramenter, avoids yet another copy of the string
4 IIRC a function with declarations cannot usually be inlined.

It also has different behaviour from the original code.

tokens = tokenise("abc|123", '|');

is only one token by the original code and it is two tokens in your code.

I wouldn't be surpised if the original specification was mistaken however.

john

tom_usenet · Aug 2, 2004

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:

vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];

That line above is likely to be the main performance bottleneck -
allocating new memory for each token. There are approaches that avoid
this.

memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();

Click to expand...

The use of theString is another bottleneck - likely it causes further
memory allocation each time around.

}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?

Click to expand...

inlining wouldn't help in any case - you'd save a few instructions vs
the thousands that the method might have to execute (memory allocation
is generally expensive). Here's a high performance variant:

#include <vector>
#include <cstring>

//NB this modifies the passed string!
std::vector<char*> tokenize(char* str, char delim)
{
std::vector<char*> tokenList;
tokenList.reserve(10);
typedef std::string::size_type size_t;
size_t const length = std::strlen(str);
size_t start = 0;
for(size_t i=0; i!=length; ++i)
{
if(str == delim)
{
str = '\0';
tokenList.push_back(str + start);
start = i + 1;
}
}
tokenList.push_back(str + start);
return tokenList;
}

#include <iostream>
#include <algorithm>
#include <iterator>

int main()
{
char const s[] = "1|2|3|10|| ";
char* p = new char[sizeof s];
std::strcpy(p, s);
std::vector<char*> tokens = tokenize(p, '|');
std::copy(
tokens.begin(),
tokens.end(),
std:stream_iterator<char const*>(std::cout, "*")
);
std::cout << '\n';
}

Less error prone implementations are possible if you write a class to
represent a token, but overall I like the above solution as long as
sufficient care is taken. Another point is that if the string ends in
a token, you get an empty token at the end of the vector, which may or
may not be desirable. You might also improve performance on compilers
that don't implement the named return value optimization by passing
the tokenList vector in by reference rather than returning it.
Alternatively, you could use an output iterator as the last parameter.

Finally, it might be worth changing the signature to:

std::vector<char*> tokenize(char*& str, char delim);

since this will disable the biggest danger of accidentally passing a
string literal directly to the function.

Tom

rossum · Aug 2, 2004

I'm looking for a fast way to split a string into individual tokens.
The typical format of the string is token1|token2|token3|...|tokenN|
and the number of tokens varies (which is why i use a vector to hold
the tokens). My current implementation uses the following method to
split up the string:

vector<char*> tokenize(char* str, char delim)
{
vector<char> theString;
vector<char*> tokenList;
for(int i=0; i<(int)strlen(str); i++)
{
if(str == delim)
{
char *pChar = new char[theString.size() + 1];
memset(pChar, 0, theString.size() + 1);

for (int f = 0; f < (int)theString.size(); f++)
{
pChar[f] = theString[f];
}
tokenList.push_back(pChar);
theString.clear();
}
else
{
theString.push_back(str);
}
}
theString.clear();
return tokenList;
}

I am pretty sure that this function is to big to inline. Does anyone
have any suggestions that would help speed up this function?

Click to expand...

Here is yet another alternative to go alongside all the other
suggestions:

// Tokenising an input string using a stringstream

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <cstdlib>

std::vector<std::string> tokenize(const std::string& tok_str,
char delim) {
std::vector<std::string> all_toks;
std::stringstream my_sstream(tok_str);
std::string temp;

while (getline(my_sstream, temp, delim)) {
all_toks.push_back(temp);
} // end while

return all_toks;
} // end tokenize()

int main() {
const char SEPARATOR = '|';
const std::string THE_STRING =
"tokenA|tokenB|tokenC|tokenD|tokenE";
std::vector<std::string> my_tokens;
my_tokens = tokenize(THE_STRING, SEPARATOR);
for (int i = 0; i < my_tokens.size(); ++i) {
std::cout << "Token " << i << " = \""
<< my_tokens << "\"\n";
} // end for

system("pause");
return EXIT_SUCCESS;
} // end main()

Points:
1 Using getline() avoids reinventing the wheel.
2 This option makes a copy of the input string into the stringstream,
to be avoided if the string is large
3 Better to have the vector of parsed tokens as a reference
paramenter, avoids yet another copy of the string
4 IIRC a function with declarations cannot usually be inlined.

Click to expand...

It also has different behaviour from the original code.

tokens = tokenise("abc|123", '|');

is only one token by the original code and it is two tokens in your code.

I wouldn't be surpised if the original specification was mistaken however.

john

Well spotted John, I hadn't noticed that in my brief testing.

Thanks,

rossum

Mark · Aug 3, 2004

On Mon, 02 Aug 2004 13:12:21 +0100, tom_usenet

[.....]

}

Less error prone implementations are possible if you write a class to
represent a token, but overall I like the above solution as long as
sufficient care is taken. Another point is that if the string ends in
a token, you get an empty token at the end of the vector, which may or
may not be desirable. You might also improve performance on compilers
that don't implement the named return value optimization by passing
the tokenList vector in by reference rather than returning it.
Alternatively, you could use an output iterator as the last parameter.

Could you elaborate on 'named return value optimization'?

Mark

Alex · Aug 3, 2004

Thanks to everyone who responded. All your responses proved to be
very helpful. The new tokenizer I am using is a hybrid of all
suggestions and now roughly takes 0.76 ms, whereas the original took
3.69ms. The final method I am using is the following:

void tokenizer(const char* str, char delim, vector<char*>& tokenList)
{
tokenList.clear();
int start = 0;
for (int i = 0; str; i++)
{
if (str == delim)
{
char* token = new char[i-start+1];
memcpy(token, &str[start], i-start);
token[i-start] = '\0';
tokenList.push_back(token);
start = i;
}

}
}

tom_usenet · Aug 3, 2004

On Mon, 02 Aug 2004 13:12:21 +0100, tom_usenet

[.....]

}

Less error prone implementations are possible if you write a class to
represent a token, but overall I like the above solution as long as
sufficient care is taken. Another point is that if the string ends in
a token, you get an empty token at the end of the vector, which may or
may not be desirable. You might also improve performance on compilers
that don't implement the named return value optimization by passing
the tokenList vector in by reference rather than returning it.
Alternatively, you could use an output iterator as the last parameter.

Click to expand...

Could you elaborate on 'named return value optimization'?

Read all about it
http://groups.google.co.uk/groups?q="named+return+value+optimization"

(or if you have the standard, read 12.8/15, or read the draft version:
the last paragraph on this page
http://www.csci.csusb.edu/dick/c++std/cd2/special.html)

Tom

STRING - Remove small letters from string	1	Jan 20, 2023
tokenizer class	1	Sep 11, 2007
Help with Tokenizer Please	1	Feb 26, 2006
Crossword	2	May 11, 2020
Why does strcat mess up the tokens in strtok (and strtok_r)?	92	Jun 11, 2014
String tokenizer comments desired	5	May 12, 2004
Tokenizer Function (plus rant on strtok documentation)	18	Jul 11, 2006
C language. work with text	3	Dec 10, 2021

efficient string tokenizer

Alex

Ernest Brownlee

David Hilsee

John Harrison

rossum

John Harrison

tom_usenet

rossum

Mark

Alex

tom_usenet

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads