Efficiently Building Strings

Travis Parks · May 22, 2012

In most other languages, you have to be careful when building large strings.. I never really worried about this in C++. Is there overhead in using the += operator over and over again? I am wondering if the class std::stringstream should be used to reduce overhead, or is it just implemented in termsof a std::string?

Ian Collins · May 22, 2012

On 05/22/12 03:55 PM, Travis Parks wrote:

Please wrap lines to something sensible!

In most other languages, you have to be careful when building large strings.. I never really worried about this in C++. Is there overhead in using the += operator over and over again? I am wondering if the class std::stringstream should be used to reduce overhead, or is it just implemented in terms of a std::string?

I don't thank it's a matter of efficiency, ostringstream is appending to
a string under the hood. It is more a matter of what fits. If you
adding text, then you may as well just use a string. If you are adding
something that requires formatting, use an ostringstream.

Alf P. Steinbach · May 22, 2012

In most other languages, you have to be careful when building large strings.
I never really worried about this in C++. Is there overhead in using the +=
operator over and over again?

Not particularly. It's O(n) on average, where n is the length of the
final string.

In contrast, building a string by using infix "+" can be rather
inefficient, and if it's built up of many small pieces, typically O(n^2).

I am wondering if the class std::stringstream should be used to reduce
overhead, or is it just implemented in terms of a std::string?

std::stringstream is convenient but usually, due to silly locale support
etc., it's quite inefficient.

In other words, use it as a convenient default converter, but be aware
of the performance issues.

Consider instead using a class like

<code>
class S
{
private:
string text_;

template< class Type >
static string generalStringFrom( Type const& v )
{
ostringstream stream;
stream << v; return stream.str();
}

public:
template< class Type >
S& operator<<( Type const& v ) { text_ += generalStringFrom( v );
return *this; }

operator char const* () const { return text_.c_str(); }
operator string const& () const { return text_; }
};
<code>

It's easy to optimize by specializing `generalStringFrom` or a function
that it calls.

Anyway then you can write e.g.

void foo( char const s[] );

void bar()
{
foo( S() << "The answer is " << 6*7 << "!" );
}

and the expression evaluation is O(n) in the size of the resulting string.

Cheers & hth.,

- Alf

SG · May 22, 2012

Not particularly. It's O(n) on average, where n is the length of the
final string.

I expect a quality implementation to have this behaviour. But I just
looked up whether += makes any complexity guarantees. It turns out,
+= is defined in terms of .append where many append overloads in turn
are defined in terms of the following member function

basic_string& append(const charT* s, type_type n);

(Effects: The function replaces the string controlled by *this with a
string of length size() + n whose first size() elements are a copy of
the original string controlled by *this and whose remaining elements
are a copy of the initial n elements of s)

But there's actually no sign of a complexity guarantee.

In contrast, building a string by using infix "+" can be rather
inefficient, and if it's built up of many small pieces, typically O(n^2).

We can expect this to change (C++2011) due to move semantics and
operator+ being also defined in terms of append -- at least with a
decent append implementation.

Cheers!

Jorgen Grahn · May 22, 2012

In most other languages, you have to be careful when building large
strings. I never really worried about this in C++.

I have two partly conflicting comments on that:

- You probably shouldn't worry in any language until you can measure a
problem. On modern hardware and outside critical sections, you are
unlikely to notice.

- Building large strings is IME usually a sign of a design weakness.
There's often an alternative if you look more closely. E.g. using
POSIX writev() on a socket instead of assembling the data yourself and
sending it as one buffer to the socket. Or doing needed
post-processing on the fly while writing to file. And so on.

/Jorgen

Travis Parks · May 22, 2012

I have two partly conflicting comments on that:

- You probably shouldn't worry in any language until you can measure a
problem. On modern hardware and outside critical sections, you are
unlikely to notice.

I only brought it up because this time through "The C++ Programming Language"
it occurred to me I never considered string building performance. I had a
professor once tell me that it can be really inefficient to read an entire
file into a single string using noskipws. Instead, he suggested using a
vector<char> and an ostream_iterator, passing the result to the string ctor.
My guess this had to do with how strings were represented internally. I also
know, unlike most of the other languages I deal with, strings are mutable, so
they probably handle string building a little better.

- Building large strings is IME usually a sign of a design weakness.
There's often an alternative if you look more closely. E.g. using
POSIX writev() on a socket instead of assembling the data yourself and
sending it as one buffer to the socket. Or doing needed
post-processing on the fly while writing to file. And so on.

I usually spit out results immediate as output. I rarely build large strings,
since it is rarely necessary in C++, thanks to fewer format strings and
operator<<. Still, I was wondering if += had performance issues.

Luca Risolia · May 22, 2012

I only brought it up because this time through "The C++ Programming Language"
it occurred to me I never considered string building performance. I had a
professor once tell me that it can be really inefficient to read an entire
file into a single string using noskipws. Instead, he suggested using a
vector<char> and an ostream_iterator, passing the result to the string ctor.
My guess this had to do with how strings were represented internally. I also
know, unlike most of the other languages I deal with, strings are mutable, so
they probably handle string building a little better.

If you want to copy a file to a string, you don't need the vector. You
can build the string directly from stream buffer, as in the example
below. You can make it more efficient by preallocating the size of the
string and std::copy() to copy the buffer into the string (which will be
overwritten). Note that getting the size of the file can usually be done
in constant time, but there is no guarantee (it depends on the device).

#include <fstream>
#include <string>
#include <iterator>
#include <iostream>

using namespace std;

int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char > i(f), eof;
string s(i, eof);
cout << s;
}

Jorgen Grahn · May 22, 2012

I don't know ... To me, part of the problem is once you have the file
as a string, you've lost the one-line-at-a-time functionality of
getline() and friends, and have to reinvent it if you want it (or turn
the string back into a stringstream).

Also you risk failing, or suffering, or being impolite to other
processes due to excessive memory usage. You cannot know for sure all
files will be small, and users will not like "error: file too large to
be safely processed" error messages.

If you want to copy a file to a string, you don't need the vector. You
can build the string directly from stream buffer, as in the example
below. You can make it more efficient by preallocating the size of the
string and std::copy() to copy the buffer into the string (which will be
overwritten). Note that getting the size of the file can usually be done
in constant time, but there is no guarantee (it depends on the device).

What's "usual" also depends on what OS you use and which user
interface you choose. I'm on Unix, and I try to support reading from
standard input and writing to standard output if it makes any sense at
all. Not only it it impossible to tell the size then -- the size may
turn out to be infinite. (And processing an infinite file may
actually be desireable. In such scenarios you *really* don't want to
read the whole file before processing it.)

/Jorgen

Juha Nieminen · May 23, 2012

Luca Risolia said:
int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char > i(f), eof;
string s(i, eof);
cout << s;
}

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std:

error(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
std::fclose(inFile);

For example in my system reading a 700-megabyte file takes about 7.5 seconds
by using the istreambuf_iterator method, while with fread it takes about
1.5 seconds.

(If maximum reading speed is not imperative, or if input files are
expected to be small, then use whichever method is safest and most
convenient.)

Juha Nieminen · May 24, 2012

Scott Lurndal said:
std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std:error(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
std::fclose(inFile);

Click to expand...

If I was really aiming for the fastest way to make an in-memory
string of a file, I'd use mmap:

The difference is that my example code above is portable while mmap
isn't.

Travis Parks · May 24, 2012

I only brought it up because this time through "The C++ Programming Language"
it occurred to me I never considered string building performance. I hada
professor once tell me that it can be really inefficient to read an entire
file into a single string using noskipws. Instead, he suggested using a
vector<char> and an ostream_iterator, passing the result to the stringctor.
My guess this had to do with how strings were represented internally. Ialso
know, unlike most of the other languages I deal with, strings are mutable, so
they probably handle string building a little better.

Click to expand...

If you want to copy a file to a string, you don't need the vector. You
can build the string directly from stream buffer, as in the example
below. You can make it more efficient by preallocating the size of the
string and std::copy() to copy the buffer into the string (which will be
overwritten). Note that getting the size of the file can usually be done
in constant time, but there is no guarantee (it depends on the device).

#include <fstream>
#include <string>
#include <iterator>
#include <iostream>

using namespace std;

int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char > i(f), eof;
string s(i, eof);
cout << s;
}

I was just using the file reading to explain my point about string buildingtaking a long time. Although, it is pretty cool to see so many potential implementations. I was more concerned with building strings in code, like HTML, XML or whatever. When building somewhat large strings that you can't immediately output (say you plan on storing it in a database), is there something better than string's += operator?

Luca Risolia · May 24, 2012

Luca Risolia said:
Luca Risolia said:

int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Click to expand...

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

You can provide your preferred char type and use it with basic_string<>

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);

The couple seek() and rewind() can be done in constant time only if the
device supports random access storage, but it is slow (typically linear
time) with drives providing sequential access only.

Jorgen Grahn · May 24, 2012

.

If I was really aiming for the fastest way to make an in-memory
string of a file, I'd use mmap: ....

Unfortunatly, there is no std::string constructor that takes a
pointer and length and makes _it_ the string (rather than copy it).

That would not be a string -- strings (like the containers) must own
their content. I agree though that it would be useful with something
like that, in some other class.

/Jorgen

Luca Risolia · May 25, 2012

You can, of course, wrap it in a class and compile the windows version
or linux version of the "mapfile" class as necessary. Then it's just
a library call.

As I suspected, my fstream implementation on Linux already uses mmap()
internally, so the above mmap() optimization is pointless. I bet it's
the same on Windows. Once again, this shows that it's better to focus on
simple and portable C++ code, unless you are sure your implementation is
NOT already optimized.

Luca Risolia · May 25, 2012

Juha Nieminen said:
Juha Nieminen said:

Scott Lurndal said:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std:error(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);
std::fclose(inFile);

If I was really aiming for the fastest way to make an in-memory
string of a file, I'd use mmap:

Click to expand...

The difference is that my example code above is portable while mmap
isn't.

Click to expand...

You can, of course, wrap it in a class and compile the windows version
or linux version of the "mapfile" class as necessary. Then it's just
a library call.

This the most portable and fastest way probably:

ifstream f(argv[1], ifstream::in | ifstream::binary);
ostringstream o;
o << f.rdbuf();
cout << o.str();

Juha Nieminen · May 25, 2012

Luca Risolia said:
This the most portable and fastest way probably:

ifstream f(argv[1], ifstream::in | ifstream::binary);
ostringstream o;
o << f.rdbuf();
cout << o.str();

Perhaps faster, but not fastest. In my system, with that same file, that
takes about 4.6 seconds (vs. the 1.5 seconds of the fread method).

Luca Risolia · May 25, 2012

Luca Risolia said:
Luca Risolia said:

This the most portable and fastest way probably:

ifstream f(argv[1], ifstream::in | ifstream::binary);
ostringstream o;
o<< f.rdbuf();
cout<< o.str();

Click to expand...

Perhaps faster, but not fastest. In my system, with that same file, that
takes about 4.6 seconds (vs. the 1.5 seconds of the fread method).

What is your platform? Can you trace o << f.rdbuf() ? I am curious.

Juha Nieminen · May 25, 2012

Luca Risolia said:
Luca Risolia said:

This the most portable and fastest way probably:

ifstream f(argv[1], ifstream::in | ifstream::binary);
ostringstream o;
o<< f.rdbuf();
cout<< o.str();

Click to expand...

Perhaps faster, but not fastest. In my system, with that same file, that
takes about 4.6 seconds (vs. the 1.5 seconds of the fread method).

Click to expand...

What is your platform? Can you trace o << f.rdbuf() ? I am curious.

Linux.

I believe that one problem may be that using ostringstream's operator<<
the string is being gradually resized as more data is read, while with the
fread method the size is set with one single allocation and the data read
directly into it.

If I replace the fread code with a similar code that reads smaller chunks
and resizes the string to match, the reading time increases to over 3
seconds.

Pavel · May 27, 2012

Alf said:
Not particularly. It's O(n) on average, where n is the length of the final string.

In contrast, building a string by using infix "+" can be rather inefficient, and
if it's built up of many small pieces, typically O(n^2).

I wonder why 26.3.1-3 relaxation for return types of valarray operations is not
extended to std::string '+' operation. This would allow an intelligent
implementation to achieve the same complexity for + and += in expressions with
multiple '+'; and my guess is that std::string's + syntax is used more often
than valarray.

- Alf

-Pavel

Pavel · May 27, 2012

Juha said:
Luca Risolia said:

int main(int argc, char** argv) {
ifstream f(argv[1], ifstream::in | ifstream::binary);
istreambuf_iterator<char> i(f), eof;
string s(i, eof);
cout<< s;
}

Click to expand...

Note that this is not the fastest possible way of reading a large file
because reading one byte at a time from a C++ stream is pretty slow.

If maximum reading speed is required, the best compromise between speed
and relative ease of usage is to use std::fread() instead. It would go
something like this:

std::FILE* inFile = std::fopen(argv[1], "rb");
if(!inFile) { std:error(argv[1]); return 1; }
std::fseek(inFile, 0, SEEK_END);
std::string s(std::ftell(inFile), ' ');
std::rewind(inFile);
std::fread(&s[0], 1, s.length(), inFile);

I know it is often assumed, but std::string's characters are not guaranteed to
be stored contiguously (despite many implementations' doing so).

std::fclose(inFile);

For example in my system reading a 700-megabyte file takes about 7.5 seconds
by using the istreambuf_iterator method, while with fread it takes about
1.5 seconds.

(If maximum reading speed is not imperative, or if input files are
expected to be small, then use whichever method is safest and most
convenient.)

Just a nit-pick
-Pavel

Comparing Multiple Strings More Efficiently	7	Jun 21, 2010
TF-IDF	1	Aug 19, 2021
Memory efficient way to store strings in hash maps using	3	Apr 17, 2012
What should I do Before I give up programming?	6	Jan 14, 2023
String building	14	May 19, 2013
I'm tempted to quit out of frustration	1	Aug 13, 2023
Unicode and Python - how often do you index strings?	33	Jun 4, 2014
efficiently splitting up strings based on substrings	7	Sep 5, 2009

Efficiently Building Strings

Travis Parks

Ian Collins

Alf P. Steinbach

SG

Jorgen Grahn

Travis Parks

Luca Risolia

Jorgen Grahn

Juha Nieminen

Juha Nieminen

Travis Parks

Luca Risolia

Jorgen Grahn

Luca Risolia

Luca Risolia

Juha Nieminen

Luca Risolia

Juha Nieminen

Pavel

Pavel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads