Efficiently Building Strings

T

Travis Parks

In most other languages, you have to be careful when building large strings.. I never really worried about this in C++. Is there overhead in using the += operator over and over again? I am wondering if the class std::stringstream should be used to reduce overhead, or is it just implemented in termsof a std::string?
 
I

Ian Collins

On 05/22/12 03:55 PM, Travis Parks wrote:

Please wrap lines to something sensible!
In most other languages, you have to be careful when building large strings.. I never really worried about this in C++. Is there overhead in using the += operator over and over again? I am wondering if the class std::stringstream should be used to reduce overhead, or is it just implemented in terms of a std::string?

I don't thank it's a matter of efficiency, ostringstream is appending to
a string under the hood. It is more a matter of what fits. If you
adding text, then you may as well just use a string. If you are adding
something that requires formatting, use an ostringstream.
 
A

Alf P. Steinbach

In most other languages, you have to be careful when building large strings.
I never really worried about this in C++. Is there overhead in using the +=
operator over and over again?

Not particularly. It's O(n) on average, where n is the length of the
final string.

In contrast, building a string by using infix "+" can be rather
inefficient, and if it's built up of many small pieces, typically O(n^2).


I am wondering if the class std::stringstream should be used to reduce
overhead, or is it just implemented in terms of a std::string?

std::stringstream is convenient but usually, due to silly locale support
etc., it's quite inefficient.

In other words, use it as a convenient default converter, but be aware
of the performance issues.

Consider instead using a class like


<code>
class S
{
private:
string text_;

template< class Type >
static string generalStringFrom( Type const& v )
{
ostringstream stream;
stream << v; return stream.str();
}

public:
template< class Type >
S& operator<<( Type const& v ) { text_ += generalStringFrom( v );
return *this; }

operator char const* () const { return text_.c_str(); }
operator string const& () const { return text_; }
};
<code>


It's easy to optimize by specializing `generalStringFrom` or a function
that it calls.

Anyway then you can write e.g.

void foo( char const s[] );

void bar()
{
foo( S() << "The answer is " << 6*7 << "!" );
}

and the expression evaluation is O(n) in the size of the resulting string.


Cheers & hth.,

- Alf
 
S

SG

Not particularly. It's O(n) on average, where n is the length of the
final string.

I expect a quality implementation to have this behaviour. But I just
looked up whether += makes any complexity guarantees. It turns out,
+= is defined in terms of .append where many append overloads in turn
are defined in terms of the following member function

basic_string& append(const charT* s, type_type n);

(Effects: The function replaces the string controlled by *this with a
string of length size() + n whose first size() elements are a copy of
the original string controlled by *this and whose remaining elements
are a copy of the initial n elements of s)

But there's actually no sign of a complexity guarantee.
In contrast, building a string by using infix "+" can be rather
inefficient, and if it's built up of many small pieces, typically O(n^2).

We can expect this to change (C++2011) due to move semantics and
operator+ being also defined in terms of append -- at least with a
decent append implementation. ;)

Cheers!
 
J

Jorgen Grahn

In most other languages, you have to be careful when building large
strings. I never really worried about this in C++.

I have two partly conflicting comments on that:

- You probably shouldn't worry in any language until you can measure a
problem. On modern hardware and outside critical sections, you are
unlikely to notice.

- Building large strings is IME usually a sign of a design weakness.
There's often an alternative if you look more closely. E.g. using
POSIX writev() on a socket instead of assembling the data yourself and
sending it as one buffer to the socket. Or doing needed
post-processing on the fly while writing to file. And so on.

/Jorgen
 
T

Travis Parks

I have two partly conflicting comments on that:

- You probably shouldn't worry in any language until you can measure a
problem. On modern hardware and outside critical sections, you are
unlikely to notice.

I only brought it up because this time through "The C++ Programming Language"
it occurred to me I never considered string building performance. I had a
professor once tell me that it can be really inefficient to read an entire
file into a single string using noskipws. Instead, he suggested using a
vector<char> and an ostream_iterator, passing the result to the string ctor.
My guess this had to do with how strings were represented internally. I also
know, unlike most of the other languages I deal with, strings are mutable, so
they probably handle string building a little better.
- Building large strings is IME usually a sign of a design weakness.
There's often an alternative if you look more closely. E.g. using
POSIX writev() on a socket instead of assembling the data yourself and
sending it as one buffer to the socket. Or doing needed
post-processing on the fly while writing to file. And so on.

I usually spit out results immediate as output. I rarely build large strings,
since it is rarely necessary in C++, thanks to fewer format strings and
operator<<. Still, I was wondering if += had performance issues.
 
L

Luca Risolia

I only brought it up because this time through "The C++ Programming Language"
it occurred to me I never considered string building performance. I had a
professor once tell me that it can be really inefficient to read an entire
file into a single string using noskipws. Instead, he suggested using a
vector<char> and an ostream_iterator, passing the result to the string ctor.
My guess this had to do with how strings were represented internally. I also
know, unlike most of the other languages I deal with, strings are mutable, so
they probably handle string building a little better.

If you want to copy a file to a string, you don't need the vector. You
can build the string directly from stream buffer, as in the example
below. You can make it more efficient by preallocating the size of the
string and std::copy() to copy the buffer into the string (which will be
overwritten). Note that getting the size of the file can usually be done
in constant time, but there is no guarantee (it depends on the device).

#include <fstream>
#include <string>
#include <iterator>
#include <iostream>

using namespace std;

int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char > i(f), eof;
string s(i, eof);
cout << s;
}
 
J

Jorgen Grahn

I don't know ... To me, part of the problem is once you have the file
as a string, you've lost the one-line-at-a-time functionality of
getline() and friends, and have to reinvent it if you want it (or turn
the string back into a stringstream).

Also you risk failing, or suffering, or being impolite to other
processes due to excessive memory usage. You cannot know for sure all
files will be small, and users will not like "error: file too large to
be safely processed" error messages.
If you want to copy a file to a string, you don't need the vector. You
can build the string directly from stream buffer, as in the example
below. You can make it more efficient by preallocating the size of the
string and std::copy() to copy the buffer into the string (which will be
overwritten). Note that getting the size of the file can usually be done
in constant time, but there is no guarantee (it depends on the device).

What's "usual" also depends on what OS you use and which user
interface you choose. I'm on Unix, and I try to support reading from
standard input and writing to standard output if it makes any sense at
all. Not only it it impossible to tell the size then -- the size may
turn out to be infinite. (And processing an infinite file may
actually be desireable. In such scenarios you *really* don't want to
read the whole file before processing it.)

/Jorgen
 
J

Juha Nieminen

Luca Risolia said:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char > i(f), eof;
string s(i, eof);
cout << s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
std::fclose(inFile);

For example in my system reading a 700-megabyte file takes about 7.5 seconds
by using the istreambuf_iterator method, while with fread it takes about
1.5 seconds.

(If maximum reading speed is not imperative, or if input files are
expected to be small, then use whichever method is safest and most
convenient.)
 
J

Juha Nieminen

Scott Lurndal said:
std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
std::fclose(inFile);


If I was really aiming for the fastest way to make an in-memory
string of a file, I'd use mmap:

The difference is that my example code above is portable while mmap
isn't.
 
T

Travis Parks

I only brought it up because this time through "The C++ Programming Language"
it occurred to me I never considered string building performance. I hada
professor once tell me that it can be really inefficient to read an entire
file into a single string using noskipws. Instead, he suggested using a
vector<char> and an ostream_iterator, passing the result to the stringctor.
My guess this had to do with how strings were represented internally. Ialso
know, unlike most of the other languages I deal with, strings are mutable, so
they probably handle string building a little better.

If you want to copy a file to a string, you don't need the vector. You
can build the string directly from stream buffer, as in the example
below. You can make it more efficient by preallocating the size of the
string and std::copy() to copy the buffer into the string (which will be
overwritten). Note that getting the size of the file can usually be done
in constant time, but there is no guarantee (it depends on the device).

#include <fstream>
#include <string>
#include <iterator>
#include <iostream>

using namespace std;

int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char > i(f), eof;
string s(i, eof);
cout << s;
}

I was just using the file reading to explain my point about string buildingtaking a long time. Although, it is pretty cool to see so many potential implementations. I was more concerned with building strings in code, like HTML, XML or whatever. When building somewhat large strings that you can't immediately output (say you plan on storing it in a database), is there something better than string's += operator?
 
L

Luca Risolia

Luca Risolia said:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

You can provide your preferred char type and use it with basic_string<>
If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);

The couple seek() and rewind() can be done in constant time only if the
device supports random access storage, but it is slow (typically linear
time) with drives providing sequential access only.
 
J

Jorgen Grahn

.
If I was really aiming for the fastest way to make an in-memory
string of a file, I'd use mmap: ....

Unfortunatly, there is no std::string constructor that takes a
pointer and length and makes _it_ the string (rather than copy it).

That would not be a string -- strings (like the containers) must own
their content. I agree though that it would be useful with something
like that, in some other class.

/Jorgen
 
L

Luca Risolia

You can, of course, wrap it in a class and compile the windows version
or linux version of the "mapfile" class as necessary. Then it's just
a library call.

As I suspected, my fstream implementation on Linux already uses mmap()
internally, so the above mmap() optimization is pointless. I bet it's
the same on Windows. Once again, this shows that it's better to focus on
simple and portable C++ code, unless you are sure your implementation is
NOT already optimized.
 
L

Luca Risolia

Juha Nieminen said:
Scott Lurndal said:
std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
std::fclose(inFile);


If I was really aiming for the fastest way to make an in-memory
string of a file, I'd use mmap:

The difference is that my example code above is portable while mmap
isn't.

You can, of course, wrap it in a class and compile the windows version
or linux version of the "mapfile" class as necessary. Then it's just
a library call.

This the most portable and fastest way probably:

ifstream f(argv[1], ifstream::in | ifstream::binary);
ostringstream o;
o << f.rdbuf();
cout << o.str();
 
J

Juha Nieminen

Luca Risolia said:
This the most portable and fastest way probably:

ifstream f(argv[1], ifstream::in | ifstream::binary);
ostringstream o;
o << f.rdbuf();
cout << o.str();

Perhaps faster, but not fastest. In my system, with that same file, that
takes about 4.6 seconds (vs. the 1.5 seconds of the fread method).
 
L

Luca Risolia

Luca Risolia said:
This the most portable and fastest way probably:

ifstream f(argv[1], ifstream::in | ifstream::binary);
ostringstream o;
o<< f.rdbuf();
cout<< o.str();

Perhaps faster, but not fastest. In my system, with that same file, that
takes about 4.6 seconds (vs. the 1.5 seconds of the fread method).

What is your platform? Can you trace o << f.rdbuf() ? I am curious.
 
J

Juha Nieminen

Luca Risolia said:
Luca Risolia said:
This the most portable and fastest way probably:

ifstream f(argv[1], ifstream::in | ifstream::binary);
ostringstream o;
o<< f.rdbuf();
cout<< o.str();

Perhaps faster, but not fastest. In my system, with that same file, that
takes about 4.6 seconds (vs. the 1.5 seconds of the fread method).

What is your platform? Can you trace o << f.rdbuf() ? I am curious.

Linux.

I believe that one problem may be that using ostringstream's operator<<
the string is being gradually resized as more data is read, while with the
fread method the size is set with one single allocation and the data read
directly into it.

If I replace the fread code with a similar code that reads smaller chunks
and resizes the string to match, the reading time increases to over 3
seconds.
 
P

Pavel

Alf said:
Not particularly. It's O(n) on average, where n is the length of the final string.

In contrast, building a string by using infix "+" can be rather inefficient, and
if it's built up of many small pieces, typically O(n^2).

I wonder why 26.3.1-3 relaxation for return types of valarray operations is not
extended to std::string '+' operation. This would allow an intelligent
implementation to achieve the same complexity for + and += in expressions with
multiple '+'; and my guess is that std::string's + syntax is used more often
than valarray.


-Pavel
 
P

Pavel

Juha said:
Luca Risolia said:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std::perror(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
I know it is often assumed, but std::string's characters are not guaranteed to
be stored contiguously (despite many implementations' doing so).
std::fclose(inFile);

For example in my system reading a 700-megabyte file takes about 7.5 seconds
by using the istreambuf_iterator method, while with fread it takes about
1.5 seconds.

(If maximum reading speed is not imperative, or if input files are
expected to be small, then use whichever method is safest and most
convenient.)

Just a nit-pick
-Pavel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top