Efficent use of the string class

A

Andrew Brampton

Hi,
I'm new to using std classes, but I was wondering how I could do the
following efficiently with the string class:

Say I had a string with delimited values such as:
"field1,field2,field3,field4"
and I wanted to to place those 4 fields into different variables I would do
something like:

/* Note code quickly pulled from my head - may be wrong :) */
char *tmp_ptr;
char *ptr = input;

while (ptr != null) {
tmp_ptr = strchr(ptr, ',');
strncpy(somewhere, ptr, tmp_ptr - ptr);
ptr = tmp_ptr;
}

Now that starts with a pointer to the beginning of the file, and moves it
along... and each new search done by strchr starts where the last one left
off.

Now I'm faced with the same problem, but I wanted to use C++... so my code
would look like this:

string input = "blah,blah,blah,blah";
int pos = 0;
int tmppos = 0;

while (tmp != npos) {
tmppos = input.find_first_of(",", pos);
//do some kind of copy from pos, to tmppos
pos = tmppos;
}

This again loops through the string, but I'm worried on how C++ handles this
behind the scenes... Everytime I do input.find_first_of(.., pos) will it
moving in its internal array from 0 to pos, and then starts the search?...In
my previous case it didn't have to move from 0 to pos on each iteration
because I stored pos as a pointer..

So I was wondering if there is some way I can make input remember the last
place I started from inside the string, or, should I just use c_str() and
use my own char * to do everything?

Also it should be noted that I'm only asking this so I can learn best
practices, but say for example I wasn't doing this on small data, but data
megabytes big, I imagine using a pointer to remember where to start from
will be faster.

Any feedback is welcome..
Thanks
Andrew
 
K

Karl Heinz Buchegger

Andrew said:
string input = "blah,blah,blah,blah";
int pos = 0;
int tmppos = 0;

while (tmp != npos) {
tmppos = input.find_first_of(",", pos);
//do some kind of copy from pos, to tmppos
pos = tmppos;
}

This again loops through the string, but I'm worried on how C++ handles this
behind the scenes... Everytime I do input.find_first_of(.., pos) will it
moving in its internal array from 0 to pos, and then starts the search?...In
my previous case it didn't have to move from 0 to pos on each iteration
because I stored pos as a pointer..

What do you think is the second argument to find_first_of() for?
So I was wondering if there is some way I can make input remember the last
place I started from inside the string,

You already did this by using the pos variable.
 
S

Siemel Naran

while (ptr != null) {
tmp_ptr = strchr(ptr, ',');
strncpy(somewhere, ptr, tmp_ptr - ptr);
ptr = tmp_ptr;
}
Now I'm faced with the same problem, but I wanted to use C++... so my code
would look like this:

What works in C++ also works in C++.

The direct translation would probably to C++ style algorithms would be
std::find, but what you have works fine too and seems efficient. You can
time.
 
A

Andrew Brampton

Thanks for all the answers, but maybe I wasn't clear enough.

Take for example this:
string blah = "1234567890";
for (int i=0; i<10; i++) {
printf( blah.at(i) );
}

This can be improved in C by doing:
char [] blah = "1234567890"
for (char *ptr = blah; ptr < blah + 10; ptr++) {
printf( ptr );
}

In such a trival example the speed incrase is negletable... but say the
string blah was a megabyte big, then the 2nd example will be alot faster.

Now I was wondering if there was some similar way to do the 2nd example, but
in C++. Something I hope like this:
string blah = "1234567890";
string_pointer ptr = blah; //some magical class I wish exists
for (; ptr < blah.end() ; ptr++) {
printf( ptr )
}

I know in this basic example the iterator would be used... but refer to my
orginal question since I'm sure a iterator can't be used in that case.

Thanks
Andrew

P.S I know I can keep my C code and use it in C++, but I was wondering if
there was a C++ specific way of doing it.

| Hi,
| I'm new to using std classes, but I was wondering how I could do the
| following efficiently with the string class:
|
| Say I had a string with delimited values such as:
| "field1,field2,field3,field4"
| and I wanted to to place those 4 fields into different variables I would
do
| something like:
|
| /* Note code quickly pulled from my head - may be wrong :) */
| char *tmp_ptr;
| char *ptr = input;
|
| while (ptr != null) {
| tmp_ptr = strchr(ptr, ',');
| strncpy(somewhere, ptr, tmp_ptr - ptr);
| ptr = tmp_ptr;
| }
|
| Now that starts with a pointer to the beginning of the file, and moves it
| along... and each new search done by strchr starts where the last one left
| off.
|
| Now I'm faced with the same problem, but I wanted to use C++... so my code
| would look like this:
|
| string input = "blah,blah,blah,blah";
| int pos = 0;
| int tmppos = 0;
|
| while (tmp != npos) {
| tmppos = input.find_first_of(",", pos);
| //do some kind of copy from pos, to tmppos
| pos = tmppos;
| }
|
| This again loops through the string, but I'm worried on how C++ handles
this
| behind the scenes... Everytime I do input.find_first_of(.., pos) will it
| moving in its internal array from 0 to pos, and then starts the
search?...In
| my previous case it didn't have to move from 0 to pos on each iteration
| because I stored pos as a pointer..
|
| So I was wondering if there is some way I can make input remember the last
| place I started from inside the string, or, should I just use c_str() and
| use my own char * to do everything?
|
| Also it should be noted that I'm only asking this so I can learn best
| practices, but say for example I wasn't doing this on small data, but data
| megabytes big, I imagine using a pointer to remember where to start from
| will be faster.
|
| Any feedback is welcome..
| Thanks
| Andrew
|
|
|
|
 
J

John Harrison

Andrew Brampton said:
Thanks for all the answers, but maybe I wasn't clear enough.

Take for example this:
string blah = "1234567890";
for (int i=0; i<10; i++) {
printf( blah.at(i) );
}

This can be improved in C by doing:
char [] blah = "1234567890"
for (char *ptr = blah; ptr < blah + 10; ptr++) {
printf( ptr );

printf( *ptr );
}

In such a trival example the speed incrase is negletable... but say the
string blah was a megabyte big, then the 2nd example will be alot faster.

No it wouldn't. You seem to have some misconception about how integer
offsets and strings work together. Now sure the second example might be 10%
faster on a small string, but if so it would be 10% faster on any length of
string.
Now I was wondering if there was some similar way to do the 2nd example, but
in C++. Something I hope like this:
string blah = "1234567890";
string_pointer ptr = blah; //some magical class I wish exists
for (; ptr < blah.end() ; ptr++) {
printf( ptr )
}

I know in this basic example the iterator would be used... but refer to my
orginal question since I'm sure a iterator can't be used in that case.

You can use an iterator, use it in combination with std::find, instead of
std::string::find_first_of

But I really don't see what you are worred about, all these methods are
essentially equivalent.

john
 
A

Andrew Brampton

Hi,
thanks for trying to help, but I've just wrote a little benchmarking
program, which proves that its ALOT more than 10%
Here is the code (which works on windows, because it uses GetTickCount()):

DWORD now;
int limit = 1000000; //size of string
char temp; //used later to hold each char in the string
std::string blah; //create a string
blah.reserve(limit); //reserve limit size of memory

//File the string up with Xs, not used in the benchmark
for (int i = 0; i<limit; i++) {
blah.append("X");
}

now = GetTickCount(); //Used for timing

//Now we have a file full of Xs..
//In this example we won't print the X out, instead
//we will just set some varible it...kinda pointless :)
//but printf is slow and takes away what we are actually benchmarking
for (int i=0; i<limit; i++) {
temp = blah.at(i) ;
}

printf("Done 1 in %dms", GetTickCount() - now);
now = GetTickCount(); //Used for timing

//Now we do the same thing, but with c char pointers
char *blah_c = (char *)blah.c_str();
for (char *ptr = blah_c; ptr < blah_c + limit; ptr++) {
temp = *ptr;
}

printf("Done 2 in %dms\n", GetTickCount() - now);

Here are the results run 10 times:
Done 1 in 235ms Done 2 in 0ms
Done 1 in 235ms Done 2 in 15ms
Done 1 in 234ms Done 2 in 16ms
Done 1 in 234ms Done 2 in 0ms
Done 1 in 235ms Done 2 in 0ms
Done 1 in 265ms Done 2 in 0ms
Done 1 in 250ms Done 2 in 16ms
Done 1 in 234ms Done 2 in 16ms
Done 1 in 265ms Done 2 in 0ms
Done 1 in 234ms Done 2 in 0ms

You are welcome to point out any mistakes I've made, BUT the reason for
example 1 is slow is because its doing this:
blah.at(0)
//internal c array is currently pointed at 0
//work out 0 + 0
//set a c pointer to 0 + 0
//return memory that pointer is pointing at;
blah.at(1)
//internal c array is again pointed at 0
//work out 0 + 1
//set a c pointer to 0 + 1
//return memory that pointer is pointing at;
blah.at(2)
//internal c array is yet again pointed at 0
//work out 0 + 2
//set a c pointer to 0 + 2
//return memory that pointer is pointing at;

Now the code using the char pointer does this
*ptr
//return memory that ptr is pointing at;
ptr++;
//increment ptr;
*ptr
//return memory that ptr is pointing at;
ptr++
//increment ptr;
*ptr
//return memory that ptr is pointing at;
ptr++
//increment ptr;

If I was any good at asm I could most likly write that down in opcodes..

I can imagine that Iterators were designed to stop this being slow, because
the iterator keeps a internal pointer, just like my ptr did.

You suggested using find with iterators, this might be the solution to my
problem, but its getting late now and I have some other stuff to do now,
I'll try that out tommorow.

Thanks for the feedback
Andrew

|
| | > Thanks for all the answers, but maybe I wasn't clear enough.
| >
| > Take for example this:
| > string blah = "1234567890";
| > for (int i=0; i<10; i++) {
| > printf( blah.at(i) );
| > }
| >
| > This can be improved in C by doing:
| > char [] blah = "1234567890"
| > for (char *ptr = blah; ptr < blah + 10; ptr++) {
| > printf( ptr );
|
| printf( *ptr );
|
| > }
| >
| > In such a trival example the speed incrase is negletable... but say the
| > string blah was a megabyte big, then the 2nd example will be alot
faster.
|
| No it wouldn't. You seem to have some misconception about how integer
| offsets and strings work together. Now sure the second example might be
10%
| faster on a small string, but if so it would be 10% faster on any length
of
| string.
|
| >
| > Now I was wondering if there was some similar way to do the 2nd example,
| but
| > in C++. Something I hope like this:
| > string blah = "1234567890";
| > string_pointer ptr = blah; //some magical class I wish exists
| > for (; ptr < blah.end() ; ptr++) {
| > printf( ptr )
| > }
| >
| > I know in this basic example the iterator would be used... but refer to
my
| > orginal question since I'm sure a iterator can't be used in that case.
| >
|
| You can use an iterator, use it in combination with std::find, instead of
| std::string::find_first_of
|
| But I really don't see what you are worred about, all these methods are
| essentially equivalent.
|
| john
|
|
 
J

JKop

Andrew Brampton posted:
while (ptr != null)


Do you not find that degrading?


while (ptr)


It reminds me of people who are learning a language, for instance English,
and intentionally say "which" instead "that".


-JKop
 
S

Siemel Naran

JKop said:
Andrew Brampton posted:

Do you not find that degrading?

while (ptr)

It reminds me of people who are learning a language, for instance English,
and intentionally say "which" instead "that".

Neither (ptr != NULL) nor (ptr) should be considered better than the other.
Similarly, between putting braces on the same line and next line, neither
one is better. The thing to pay attention to is to follow a consistent
coding standard throughout your code.
 
S

Siemel Naran

You can use an iterator, use it in combination with std::find, instead of
std::string::find_first_of

On some implementations, std::string::const_iterator is just const char *;
so using std::find with iterators should be very fast.
 
J

John Harrison

Andrew Brampton said:
Hi,
thanks for trying to help, but I've just wrote a little benchmarking
program, which proves that its ALOT more than 10%
Here is the code (which works on windows, because it uses GetTickCount()):

DWORD now;
int limit = 1000000; //size of string
char temp; //used later to hold each char in the string
std::string blah; //create a string
blah.reserve(limit); //reserve limit size of memory

//File the string up with Xs, not used in the benchmark
for (int i = 0; i<limit; i++) {
blah.append("X");
}

now = GetTickCount(); //Used for timing

//Now we have a file full of Xs..
//In this example we won't print the X out, instead
//we will just set some varible it...kinda pointless :)
//but printf is slow and takes away what we are actually benchmarking
for (int i=0; i<limit; i++) {
temp = blah.at(i) ;
}

printf("Done 1 in %dms", GetTickCount() - now);
now = GetTickCount(); //Used for timing

//Now we do the same thing, but with c char pointers
char *blah_c = (char *)blah.c_str();
for (char *ptr = blah_c; ptr < blah_c + limit; ptr++) {
temp = *ptr;
}

printf("Done 2 in %dms\n", GetTickCount() - now);

Here are the results run 10 times:
Done 1 in 235ms Done 2 in 0ms
Done 1 in 235ms Done 2 in 15ms
Done 1 in 234ms Done 2 in 16ms
Done 1 in 234ms Done 2 in 0ms
Done 1 in 235ms Done 2 in 0ms
Done 1 in 265ms Done 2 in 0ms
Done 1 in 250ms Done 2 in 16ms
Done 1 in 234ms Done 2 in 16ms
Done 1 in 265ms Done 2 in 0ms
Done 1 in 234ms Done 2 in 0ms

You are welcome to point out any mistakes I've made, BUT the reason for
example 1 is slow is because its doing this:

Obviously I didn't express my self very well.

I made some changes to your code to express the point I'm making better and
to make the test fairer.

1) I compiled in release mode, so the compile could optimise the code.
blah.at(i) is a function call, dereferencing a pointer is not. But in
release mode the compiler may optimise away the function call.

2) I made temp a global variable, without this change my compiler is smart
enough to realise that both sections of code do nothing, and to optimise
away both sections entirely.

3) I replaced blah.at(i) with blah. This is a fairer comparison, at(i)
does range checking and throws an exception on an out of range index,
something neither blah or *ptr do.

4) I repeated each loop 10000 times to get better timings.

5) This is the point I was trying to make, I repeated the test for different
length strings

Here's the results

length 10000, string method 359, pointer method 156
length 100000, string method 3453, pointer method 1704
length 1000000, string method 36438, pointer method 26578

No I was obviously wrong about 10%, its more like 100%. But my point is that
it is 100% slower irrespective of the length of the string (although for the
longest string its less than that, probably some virtual memory effect is
kicking in).

I thought your were implying that things got worse as the string got longer,
but maybe you didn't mean that. My point is that you might see a percentage
difference between all these methods, and I doubt you'll see a method more
efficient than using a pointer, but its a constant proportion. Its not
dependent on the length of the string.

john
 
S

Siemel Naran

I thought your were implying that things got worse as the string got longer,
but maybe you didn't mean that. My point is that you might see a percentage
difference between all these methods, and I doubt you'll see a method more
efficient than using a pointer, but its a constant proportion. Its not
dependent on the length of the string.

It does, kind of. If the optimized process is 1 sec and the unefficient way
is 100% slower, than it is 2 secs. If the process is 1 hours, then the
unefficient way is 2 hours. Percentage way, no difference. But amount
wise, big difference.
 
U

Uwe Schnitker

Andrew Brampton said:
Hi,
thanks for trying to help, but I've just wrote a little benchmarking
program, which proves that its ALOT more than 10%
Here is the code (which works on windows, because it uses GetTickCount()):

DWORD now;
int limit = 1000000; //size of string
char temp; //used later to hold each char in the string
std::string blah; //create a string
blah.reserve(limit); //reserve limit size of memory

//File the string up with Xs, not used in the benchmark
for (int i = 0; i<limit; i++) {
blah.append("X");
}

now = GetTickCount(); //Used for timing

//Now we have a file full of Xs..
//In this example we won't print the X out, instead
//we will just set some varible it...kinda pointless :)
//but printf is slow and takes away what we are actually benchmarking
for (int i=0; i<limit; i++) {
temp = blah.at(i) ;

Please try to use blah instead of blah.at(i).

The at function performs range-checking, but the operator[] function
typically does not - like the C-style code.

Using at is like using
if(ptr < blah_c + limit) temp = *ptr;
in the C-style example (so just for fun you might want to benchmark
that, too).

Hopefully you did a release build (with optimization on) for your benchmarks?

The standard library depends a lot on inline functions, if those are not
optimized, a C++ program using it will easily be an - or several - order(s)
of magnitude slower than a C-style program.

Of course, since you throw away what is assigned to temp in every loop
iteration, the compiler might just decide to take out the whole loop
anyway :-<

You are welcome to point out any mistakes I've made, BUT the reason for
example 1 is slow is because its doing this:
blah.at(0)
//internal c array is currently pointed at 0
//work out 0 + 0
//set a c pointer to 0 + 0
//return memory that pointer is pointing at;
blah.at(1)
//internal c array is again pointed at 0
//work out 0 + 1
//set a c pointer to 0 + 1
//return memory that pointer is pointing at;
blah.at(2)
//internal c array is yet again pointed at 0
//work out 0 + 2
//set a c pointer to 0 + 2
//return memory that pointer is pointing at;

Now the code using the char pointer does this
*ptr
//return memory that ptr is pointing at;
ptr++;
//increment ptr;
*ptr
//return memory that ptr is pointing at;
ptr++
//increment ptr;
*ptr
//return memory that ptr is pointing at;
ptr++
//increment ptr;

If I was any good at asm I could most likly write that down in opcodes..

I can imagine that Iterators were designed to stop this being slow, because
the iterator keeps a internal pointer, just like my ptr did.

You suggested using find with iterators, this might be the solution to my
problem, but its getting late now and I have some other stuff to do now,
I'll try that out tommorow.

Your analysis is impressively straightforward, but your compiler can optimize
away that overhead in most cases. It might just transform the code from the
first example to the second one behind the scenes, and create (almost) the
same executable for both cases. (After replacing at with [])
Thanks for the feedback
Andrew

Uwe

<SNIP>
 
A

Andrew Brampton

No, sorry I was just implying, that say example 1 was going to take a
minute, example 2 would take half that time, and 30seconds is a lot of time
:)

Thanks
Andrew

|
| | > Hi,
| > thanks for trying to help, but I've just wrote a little benchmarking
| > program, which proves that its ALOT more than 10%
| > Here is the code (which works on windows, because it uses
GetTickCount()):
| >
| > DWORD now;
| > int limit = 1000000; //size of string
| > char temp; //used later to hold each char in the string
| > std::string blah; //create a string
| > blah.reserve(limit); //reserve limit size of memory
| >
| > //File the string up with Xs, not used in the benchmark
| > for (int i = 0; i<limit; i++) {
| > blah.append("X");
| > }
| >
| > now = GetTickCount(); //Used for timing
| >
| > //Now we have a file full of Xs..
| > //In this example we won't print the X out, instead
| > //we will just set some varible it...kinda pointless :)
| > //but printf is slow and takes away what we are actually benchmarking
| > for (int i=0; i<limit; i++) {
| > temp = blah.at(i) ;
| > }
| >
| > printf("Done 1 in %dms", GetTickCount() - now);
| > now = GetTickCount(); //Used for timing
| >
| > //Now we do the same thing, but with c char pointers
| > char *blah_c = (char *)blah.c_str();
| > for (char *ptr = blah_c; ptr < blah_c + limit; ptr++) {
| > temp = *ptr;
| > }
| >
| > printf("Done 2 in %dms\n", GetTickCount() - now);
| >
| > Here are the results run 10 times:
| > Done 1 in 235ms Done 2 in 0ms
| > Done 1 in 235ms Done 2 in 15ms
| > Done 1 in 234ms Done 2 in 16ms
| > Done 1 in 234ms Done 2 in 0ms
| > Done 1 in 235ms Done 2 in 0ms
| > Done 1 in 265ms Done 2 in 0ms
| > Done 1 in 250ms Done 2 in 16ms
| > Done 1 in 234ms Done 2 in 16ms
| > Done 1 in 265ms Done 2 in 0ms
| > Done 1 in 234ms Done 2 in 0ms
| >
| > You are welcome to point out any mistakes I've made, BUT the reason for
| > example 1 is slow is because its doing this:
|
| Obviously I didn't express my self very well.
|
| I made some changes to your code to express the point I'm making better
and
| to make the test fairer.
|
| 1) I compiled in release mode, so the compile could optimise the code.
| blah.at(i) is a function call, dereferencing a pointer is not. But in
| release mode the compiler may optimise away the function call.
|
| 2) I made temp a global variable, without this change my compiler is smart
| enough to realise that both sections of code do nothing, and to optimise
| away both sections entirely.
|
| 3) I replaced blah.at(i) with blah. This is a fairer comparison, at(i)
| does range checking and throws an exception on an out of range index,
| something neither blah or *ptr do.
|
| 4) I repeated each loop 10000 times to get better timings.
|
| 5) This is the point I was trying to make, I repeated the test for
different
| length strings
|
| Here's the results
|
| length 10000, string method 359, pointer method 156
| length 100000, string method 3453, pointer method 1704
| length 1000000, string method 36438, pointer method 26578
|
| No I was obviously wrong about 10%, its more like 100%. But my point is
that
| it is 100% slower irrespective of the length of the string (although for
the
| longest string its less than that, probably some virtual memory effect is
| kicking in).
|
| I thought your were implying that things got worse as the string got
longer,
| but maybe you didn't mean that. My point is that you might see a
percentage
| difference between all these methods, and I doubt you'll see a method more
| efficient than using a pointer, but its a constant proportion. Its not
| dependent on the length of the string.
|
| john
|
|
 
A

Andrew Brampton

I've found the answer to my problems now.

Three people mentioned the answer, but only as a side note. I'm starting to
see this trend in Newsgroups, you ask one question, and get the answer to 10
others which you didn't ask :)

But anyway thanks to all that answered, my eventual solution was to use
iterators with std::find... Before now I didn't know std::find existed
:).... and as a quick estimate the code runs nice and fast, faster than my
previous find_first_of, but not as fast as my C... However the point of this
exercise was to learn the most C++'s ish way to do something :)

Thanks to all
Andrew

| Hi,
| I'm new to using std classes, but I was wondering how I could do the
| following efficiently with the string class:
|
| Say I had a string with delimited values such as:
| "field1,field2,field3,field4"
| and I wanted to to place those 4 fields into different variables I would
do
| something like:
|
| /* Note code quickly pulled from my head - may be wrong :) */
| char *tmp_ptr;
| char *ptr = input;
|
| while (ptr != null) {
| tmp_ptr = strchr(ptr, ',');
| strncpy(somewhere, ptr, tmp_ptr - ptr);
| ptr = tmp_ptr;
| }
|
| Now that starts with a pointer to the beginning of the file, and moves it
| along... and each new search done by strchr starts where the last one left
| off.
|
| Now I'm faced with the same problem, but I wanted to use C++... so my code
| would look like this:
|
| string input = "blah,blah,blah,blah";
| int pos = 0;
| int tmppos = 0;
|
| while (tmp != npos) {
| tmppos = input.find_first_of(",", pos);
| //do some kind of copy from pos, to tmppos
| pos = tmppos;
| }
|
| This again loops through the string, but I'm worried on how C++ handles
this
| behind the scenes... Everytime I do input.find_first_of(.., pos) will it
| moving in its internal array from 0 to pos, and then starts the
search?...In
| my previous case it didn't have to move from 0 to pos on each iteration
| because I stored pos as a pointer..
|
| So I was wondering if there is some way I can make input remember the last
| place I started from inside the string, or, should I just use c_str() and
| use my own char * to do everything?
|
| Also it should be noted that I'm only asking this so I can learn best
| practices, but say for example I wasn't doing this on small data, but data
| megabytes big, I imagine using a pointer to remember where to start from
| will be faster.
|
| Any feedback is welcome..
| Thanks
| Andrew
|
|
|
|
 
D

Default User

Andrew said:
I've found the answer to my problems now.

Three people mentioned the answer, but only as a side note. I'm starting to
see this trend in Newsgroups, you ask one question, and get the answer to 10
others which you didn't ask :)

That's a feature, not a bug.

Another trend you should note on this newsgroup is the strong preference
for not using top-posting. Your replies belong following or interspersed
with properly trimmed quotes.



Brian Rodenborn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top