SSO

Hicham Mouline · Jul 28, 2009

hello
is there a document written somewhere that explains an example
implementation of small string optimization?

rds,

Victor Bazarov · Jul 28, 2009

Hicham said:
is there a document written somewhere that explains an example
implementation of small string optimization?

Yes. Have you tried using a "search engine" on "the web" to find what
you're looking for? I know, those are new concepts and not everybody is
familiar with those...

V

tni · Jul 28, 2009

Hicham said:
is there a document written somewhere that explains an example
implementation of small string optimization?

You simply put a fixed length buffer into the string object. All strings
that fit into that fixed buffer are put there. For anything larger, you
have a pointer to dynamically allocated storage.

Depending on how sophisticated you want to get, you can reuse the space
occupied by some members (e.g. the pointer to dynamic storage, a
capacity member, a size member) as part of the fixed buffer.

class string {
....
struct optional {
size_t capacity;
char* buffer;
};

size_t size;
union {
char fixed_buffer[fixed_buf_len];
optional opt_members;
};
};

You can use criteria like "size <= fixed_buf_len" to determine if the
string is a small string or has dynamic storage.

Juha Nieminen · Jul 28, 2009

tni said:
You simply put a fixed length buffer into the string object. All strings
that fit into that fixed buffer are put there. For anything larger, you
have a pointer to dynamically allocated storage.

Depending on how sophisticated you want to get, you can reuse the space
occupied by some members (e.g. the pointer to dynamic storage, a
capacity member, a size member) as part of the fixed buffer.

class string {
...
struct optional {
size_t capacity;
char* buffer;
};

size_t size;
union {
char fixed_buffer[fixed_buf_len];
optional opt_members;
};
};

You can use criteria like "size <= fixed_buf_len" to determine if the
string is a small string or has dynamic storage.

What would be the optimal copying strategy for the dynamic part, deep
copying or copy-on-write?

What advantages does a string class with SSO have over one which
always allocates the string dynamically and uses CoW?

tni · Jul 29, 2009

Juha said:
What would be the optimal copying strategy for the dynamic part, deep
copying or copy-on-write?

That's going to depend on the use case. COW will cost performance in
working with a string, but can offer much faster copying.

What advantages does a string class with SSO have over one which
always allocates the string dynamically and uses CoW?

Copying is much faster for small strings (at least on some important
platforms like x86). The memory access pattern will be much better as
well (e.g. in a vector, everything is in a contiguous block of memory,
instead of being spread around).

For COW, you generally need some form of reference counting with atomic
counters. That's going to cost a few dozen clock cycles for a counter
access on x86 and causes the bus to be locked (which is a very bad thing
if you have lots of CPUs/cores). So, in a highly parallel environment,
you probably don't want COW.

The dynamic allocation will typically have at least 8-32 bytes of book
keeping overhead (which can matter if you have huge numbers of small
strings). A dynamic allocation/deallocation is generally going to take
several hundred clock cycles (and may involve bus locks as well, unless
the memory allocator is quite good).

But that kind of stuff is going to be highly
platform/compiler/implementation specific...

(If you want to see a good/fast COW implementation, look at Qt.)

For some numbers (populating a vector with 8 char long strings via
push_back(), running on a Core i7 on Windows):

Small string optimization (as I posted, 8-char buffer): 780ms
Small string optimization (as I posted, 32-char buffer): 1550ms
Dynamic allocation: 15'000ms
Thread-safe COW String (Qt QByteArray): 3400ms
std::basic_string (Dinkumware, uses small string optimization): 3900ms

Juha Nieminen · Jul 29, 2009

tni said:
Copying is much faster for small strings (at least on some important
platforms like x86).

I don't see how that would be the case. With a regular CoW string a
copy entails assigning one pointer and updating one counter at the end
of that pointer. Copying a string which uses SSO entails a check whether
the string is small or dynamically allocated, and then (assuming it is
small) a copy of a member array (which is probably several times larger
than a pointer) and the size member variable.

I could even believe that with optimal cache behavior and such you
could even achieve similar speeds, but much faster? I don't see that
happening.

(And of course if the string was not small, but dynamically allocated,
the copying will be basically identical except that there will be the
additional check before copying.)

For COW, you generally need some form of reference counting with atomic
counters.

The STL, at least the implementation in gcc, has never bothered with
atomicity...

That's going to cost a few dozen clock cycles for a counter
access on x86 and causes the bus to be locked (which is a very bad thing
if you have lots of CPUs/cores). So, in a highly parallel environment,
you probably don't want COW.

OTOH always deep-copying can be detrimental for speed in many
situations, such as when sorting an array of large strings. I'd bet that
even using locking would be faster.

For some numbers (populating a vector with 8 char long strings via
push_back(), running on a Core i7 on Windows):

Small string optimization (as I posted, 8-char buffer): 780ms
Small string optimization (as I posted, 32-char buffer): 1550ms
Dynamic allocation: 15'000ms
Thread-safe COW String (Qt QByteArray): 3400ms
std::basic_string (Dinkumware, uses small string optimization): 3900ms

Naturally allocation will be significantly slower than simply
assigning to a member array. But only using 8 char long strings for such
a test feels a bit artificial.

James Kanze · Jul 29, 2009

I don't see how that would be the case. With a regular CoW
string a copy entails assigning one pointer and updating one
counter at the end of that pointer. Copying a string which
uses SSO entails a check whether the string is small or
dynamically allocated, and then (assuming it is small) a copy
of a member array (which is probably several times larger than
a pointer) and the size member variable.

I could even believe that with optimal cache behavior and such
you could even achieve similar speeds, but much faster? I
don't see that happening.

Fine grained locality plays a very large role. Most modern
processors access memory in more or less large chunks; when you
read the size (to test whether the local buffer is used or not),
the hardware also loads the local buffer at the same time, and
it's already in the pipeline when you start to copy it.

But IIRC, the real speed gain occurs in multi-threaded
environments. The fact that no part implementation is shared
means that no synchronization is required. And on modern
processors, synchronization is expensive. (On the other hand...
in the applications I've worked on, most of the strings are
large enough that the small string optimization wouldn't apply,
and dynamic allocation also requires synchronization, so
COW---even using pthread_mutex for synchronization---is still an
optimization if I'm copying strings a lot.)

(And of course if the string was not small, but dynamically
allocated, the copying will be basically identical except that
there will be the additional check before copying.)

The STL, at least the implementation in gcc, has never
bothered with atomicity...

The implementation of std::string in g++ is not thread safe.

OTOH always deep-copying can be detrimental for speed in many
situations, such as when sorting an array of large strings.
I'd bet that even using locking would be faster.

I've wondered about that too. In most of my work, I'd guess
that the shortest strings are about 20 characters (more than the
SSO handles in most of the implementations I've seen), and the
average is probably double that. On the other hand, I tend to
deal with objects more complex than strings---in a lot of cases,
the strings themselves are in objects which use COW (or aren't
copiable). So I really don't know what the best solution is.

Naturally allocation will be significantly slower than simply
assigning to a member array. But only using 8 char long
strings for such a test feels a bit artificial.

It depends on what you're doing. In my case, where I'm working
now, it is artificial. But I've seen applications where most of
the strings were either numerical values (six to eight digits),
or country or currency codes (two or three characters).

This is definitely a case where one size doesn't fit all.

tni · Jul 29, 2009

Juha said:
I don't see how that would be the case. With a regular CoW string a
copy entails assigning one pointer and updating one counter at the end
of that pointer.

Not for a multi-threaded environment.

E.g. for QByteArray, you have code like this (x86 disassembly):

00401BF5 or ecx,0FFFFFFFFh
00401BF8 lock xadd dword ptr [eax],ecx
00401BFC jne 00401C0C

Note the lock prefix. That's quite expensive.

Copying a string which uses SSO entails a check whether
the string is small or dynamically allocated, and then (assuming it is
small) a copy of a member array (which is probably several times larger
than a pointer) and the size member variable.

I could even believe that with optimal cache behavior and such you
could even achieve similar speeds, but much faster? I don't see that
happening.

(And of course if the string was not small, but dynamically allocated,
the copying will be basically identical except that there will be the
additional check before copying.)

The STL, at least the implementation in gcc, has never bothered with
atomicity...

You are quite wrong. GCC 4.4, basic_string.h:

struct _Rep_base
{
size_type _M_length;
size_type _M_capacity;
_Atomic_word _M_refcount;
};

Note the atomic reference counter.

Even for the oldest version of GCC 3 from 2000 (I haven't bothered to
check 2.x), it's atomic:
http://gcc.gnu.org/cgi-bin/cvsweb.c...g.h?rev=1.1&content-type=text/x-cvsweb-markup

Naturally allocation will be significantly slower than simply
assigning to a member array. But only using 8 char long strings for such
a test feels a bit artificial.

That's going to be highly application specific. I've had use cases like
that.

For 32-char strings:

Small string optimization (as I posted, 32-char buffer): 1830ms
Dynamic allocation: 17'000ms
Thread-safe COW String (Qt QByteArray): 3900ms
std::basic_string (Dinkumware, uses small string optimization): 8400ms

The previous 8-char numbers:

Small string optimization (as I posted, 8-char buffer): 780ms
Small string optimization (as I posted, 32-char buffer): 1550ms
Dynamic allocation: 15'000ms
Thread-safe COW String (Qt QByteArray): 3400ms
std::basic_string (Dinkumware, uses small string optimization): 3900ms

Bo Persson · Jul 29, 2009

Juha said:
I don't see how that would be the case. With a regular CoW string a
copy entails assigning one pointer and updating one counter at the
end of that pointer. Copying a string which uses SSO entails a
check whether the string is small or dynamically allocated, and
then (assuming it is small) a copy of a member array (which is
probably several times larger than a pointer) and the size member
variable.

This assumes that you pass large strings around by value, but do not
change them. Is this a common scenario?

As soon as you update the string, you will have to reallocate anyway.
If you know that you never update the string parameter, perhaps you
can pass it by const reference instead of by value?

The C++ std::string also has a lot of cases where you end up with
Copy-on-Potential-Write, like anywhere you have reference or iterator
to the internals of the string.

Bo Persson

Juha Nieminen · Jul 29, 2009

Bo said:
This assumes that you pass large strings around by value, but do not
change them. Is this a common scenario?

Sorting a vector of strings is not all that uncommon.

Heck, simply having a vector of strings in the first place: When the
vector capacity grows, all the strings are copied to another place.

(The new C++ standard will probably alleviate both problems, though,
with the introduction of move semantics, which will probably allow swap
semantics as well.)

The C++ std::string also has a lot of cases where you end up with
Copy-on-Potential-Write, like anywhere you have reference or iterator
to the internals of the string.

Well, copy-on-write has its flaws when you can create pointers to the
elements. For example, assume you have an unshared string, and you make
a pointer to point to one of its characters (no deep-copying happens, of
course, because the data is not shared). Now assign the string to
another: They will share the data. Now modifying the character through
that pointer will change *both* strings.

There's no way for the string class to know whether something is
(still) pointing to the data or not. Thus if you use CoW you are taking
the risk of such flaws.

OTOH, in many situations CoW is a huge performance boost over always
deep-copying (and I'm not talking exclusively about strings here).

Juha Nieminen · Jul 29, 2009

tni said:
Not for a multi-threaded environment.

If you are going to copy a string from one thread to another, I
believe you will need a locking mechanism regardless of how the string
has been implemented. I don't see how SSO would avoid the need. (Unless
member arrays are always assigned atomically, which I don't think is the
case.)

tni · Jul 29, 2009

Juha said:
If you are going to copy a string from one thread to another, I
believe you will need a locking mechanism regardless of how the string
has been implemented.

But not at the string level.

I don't see how SSO would avoid the need. (Unless
member arrays are always assigned atomically, which I don't think is the
case.)

The SSO string doesn't need any internal locking/atomicity. Once it's
copied / handed over, that's it, no connection to any previous instance.

If you make a copy of a COW string, and pass it to another thread, two
different string instances living in different threads can have shared
data (in particular a shared reference counter that requires atomic
access; for any other data, you would create a copy and change that copy).

E.g.:

Thread A:
lock(queue)
queue.push_back(str)
unlock(queue)

Thread B:
lock(queue)
retrieve strings
unlock(queue)

With SSO that's perfectly safe without any locking/atomicity in the
string, no shared string data between the threads. With COW, you can
have shared string data between the threads.

Bo Persson · Jul 29, 2009

Juha said:
Sorting a vector of strings is not all that uncommon.

std::sort will use swap which is specialized for std::string, it will
just swap the internals.

Heck, simply having a vector of strings in the first place: When
the vector capacity grows, all the strings are copied to another
place.

This is a harder problem, but can be solved for a specific
implementation using internal knowledge of the std::string. Some
implementations already use move semantics for known library types.

(The new C++ standard will probably alleviate both problems,
though, with the introduction of move semantics, which will
probably allow swap semantics as well.)

Using move semantics helps making things like vector growth more
portable. It also works for user defined types with move constructors.

Bo Persson

Jerry Coffin · Jul 29, 2009

[email protected] said:
If you are going to copy a string from one thread to another, I
believe you will need a locking mechanism regardless of how the string
has been implemented. I don't see how SSO would avoid the need. (Unless
member arrays are always assigned atomically, which I don't think is the
case.)

Not really. As long as you always do a deep copy (regardless of SSO),
you avoid sharing access to the string's internals. Any particular
string can only belong to one thread at any given time. Since there's
never any shared access to any particular data, you never have to
synchronize access to the data.

With a CoW string, (or CoW whatever, but strings are the most common)
you get shared access to the string's data, so when/if you pass the
string to a different thread, you need to ensure that access is
properly synchronized.

All this, of course, assumes that you pass your string by value when
you pass it across thread boundaries. If you pass a reference, it's
going to be up to you to synchronize access to the shared object, as
well as manage the lifetime to ensure that it lives until both (or
all) threads are finished using it -- but this will be true whether
it uses CoW or not.

Juha Nieminen · Jul 30, 2009

Jerry said:
Not really. As long as you always do a deep copy (regardless of SSO),
you avoid sharing access to the string's internals. Any particular
string can only belong to one thread at any given time. Since there's
never any shared access to any particular data, you never have to
synchronize access to the data.

I don't understand how there is never any synchronization problems
when deep-copying data from one thread to another. To me it seems fairly
obvious that *during* the copying you need to lock the data: If the
source thread modifies the data or the destination thread reads it while
it's being copied, the data will be corrupted (or in the case of the
reading thread the data will not be corrupted, but it will get the wrong
data because it will be reading data which has only been partially copied).

If, for example, thread A is copying a string to thread B, it has to
say to thread B "don't read it yet, I'm not done" until the copying has
finished.

If you are talking about a "create copies for all thread *before*
creating the threads" then that's different because the copying is not
done from one thread to another in the first place.

Juha Nieminen · Jul 30, 2009

Bo said:
std::sort will use swap which is specialized for std::string, it will
just swap the internals.

Most std::sort() implementations use introsort, which in most
situations triggers an insertion sort, which copies elements, doesn't
swap them. I checked the implementation used in gcc, and it indeed uses
straightforward assignments rather than any swapping.

Also in quicksort there is a pivot value used to partition the array.
How do you keep this pivot value with swapping only? I checked the
implementation in gcc, and it seems that it uses the median-of-three
method for choosing the pivot value, and then passes it to the
partitioning function by value, hence triggering a copy.

Jerry Coffin · Jul 30, 2009

[ ... ]

If you are talking about a "create copies for all thread *before*
creating the threads" then that's different because the copying is not
done from one thread to another in the first place.

Perhaps it's best to back up for a moment and think about things. The
usual intent with a string class is to give value semantics. For the
moment, let's define value semantics as "like an int". So, let's
consider what happens with an int. If we pass an int by value from
one thread to another we don't have to synchronize access to the data
because each thread has its own copy of the data. If we pass a
reference or pointer to an int from one thread to another, we create
shared access to that data, so we have to synchronize access to the
shared data.

So, what we want with a string is to get the same semantics: if the
user passes it by value from one thread to another, code in each
thread can use the object without synchronization. If the user passes
a reference or pointer to an object, that creates a shared object and
the user is responsible for synchronizing access to that shared data.

A deep copied string provides exactly that, automatically.

A CoW string does not provide that by default -- even when the user
passes the string by value, the internal data for the string is NOT
copied. A CoW string creates shared access to the data even when it's
passed by value. Therefore, we have a couple of possible choices: 1)
live with the string not having what we've defined as "value
semantics" (i.e. it's a lot different from an int, and the user has
to synchronize access even when objects are passed by value), or 2)
enforce synchronization inside of the string itself.

To summarize: to maintain value semantics in a multi-threaded
environment, a CoW string (or CoW almost anything) has to synchronize
access to the internal representation of the string data, but a deep-
copied string does not.

Juha Nieminen · Jul 30, 2009

Jerry said:
A deep copied string provides exactly that, automatically.

Could you please elaborate more on *why*? I just can't see that happening.

Let's assume we have two threads running, A and B. Thread A owns a
string which it wants to pass to thread B. How does it do that?

I assume that thread A sees some string that is owned by B, and then
goes and assigns it to that string. Let's assume the string uses a
deep-copy.

If *during* that deep-copying thread B starts reading the string, it
will get incorrect data (because it has only been partially copied).

The other possibility I can think of is that thread B makes the copy.
In other words, it sees the string owned by thread A, and assigns it to
its own string.

Again: If thread A goes and modifies its string while this
deep-copying is happening, the data which is copied to B will become
corrupted.

I just can't see how deep-copying solves any synchronization problems.
The only way to copy the string from A to B safely is to lock access to
it for the duration of the copying (regardless of how the copying is
done). I can't see any other way.

tni · Jul 30, 2009

Juha said:
Could you please elaborate more on *why*? I just can't see that happening.

Let's assume we have two threads running, A and B. Thread A owns a
string which it wants to pass to thread B. How does it do that?

I assume that thread A sees some string that is owned by B, and then
goes and assigns it to that string. Let's assume the string uses a
deep-copy.

You have some kind of shared data structure between the threads that is
used for communication. Access to that must be locked.

So, thread A would lock the shared data structure, put a copy of the
string into it (and doesn't retain any pointer or reference to that
copy), unlocks. Now thread B, can lock that data structure and retrieve
the string from it.

If you simply have some global string object, both threads have access
to it, you do need locking for that.

Jerry Coffin · Jul 30, 2009

[ ... ]

I just can't see how deep-copying solves any synchronization problems.
The only way to copy the string from A to B safely is to lock access to
it for the duration of the copying (regardless of how the copying is
done). I can't see any other way.

It's pretty simple really: regardless of the function I call to get
the data to the new thread, the calling of the function happens in
the original thread. There's a sequence point after evaluating the
arguments to the function and executing any code inside that
function, so by the time we've entered the function, the copy is
complete, and the code inside the function no longer has any access
to the original data at all.

A typical implementation of that function is going to take the
parameters from the stack, copy them into some structure (probably
dynamically allocated) and pass the address of that structure to the
new thread. This is necessary because the data needs to survive for
some length of time that's indeterminate to the calling thread, and
may easily extend until well after the function returns. Even in an
atypical implementation, however, the code in that function can't do
anything with the original object, because all it gets is a copy, not
any access to the original object.

If you were absolutely determined to shoot yourself in the foot, you
could have that function send the address of the second copy of the
string (the one that was actually passed to the new thread) _back_ to
other code in the original thread. At that point, you _would_ have
shared access to that copy of the string, and you'd have to
synchronize access accordingly. Likewise, you could write a copy ctor
that disguised a pass by reference as a pass by value (e.g. each copy
would contain a vector of address of objects from which it was
copied, so the copy gave access not only to the data, but also to the
original object from which the data derived.

Once the new thread is running, there's no way for the parent thread
to directly invoke any code in the new thread. Communication is only
via putting data somewhere, and the new thread picking that data up
from there. As when starting a thread, however, calling such a
function happens entirely in the parent thread, so that function will
only ever have access to a copy of anything that was passed by value.

SSO automation	1	Sep 28, 2009
default contructing an object of unknown type T	32	Jul 30, 2010
multiple branches in an if test	5	Aug 4, 2009
"partial" iterator?	5	Sep 3, 2010
boost::array< pair<>, maxsize >::iterator and adaptor	1	Apr 1, 2009
Converting several Markdown files into DOCX with pandoc	4	Feb 1, 2023
deducting class template parameters from its ctor arguments	4	Feb 19, 2009
historical data from many sources: design questions	4	Sep 4, 2010

SSO

Hicham Mouline

Victor Bazarov

tni

Juha Nieminen

tni

Juha Nieminen

James Kanze

tni

Bo Persson

Juha Nieminen

Juha Nieminen

tni

Bo Persson

Jerry Coffin

Juha Nieminen

Juha Nieminen

Jerry Coffin

Juha Nieminen

tni

Jerry Coffin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads