high traffic/availability application and gnu_cxx::hash_map problem -better to use tr1/unordered_map

Alex Wanderleit · Jan 5, 2011

I have a question regarding behaviour of the __gnu_cxx::hash_map
data structure being part of <ext/hash_map> (using RedHat EL 4.6)
when it comes to massive inserts / deletes and the time after
those inserts / deletes.

The application routes TCP/IP based requests from A to B,
thereby filling a hash_map with key-value pairs(std::string -> int),
pointing from concatenated id's (std::string) to locations (int) in a
file
(serving as a "database" to match confirmation of delivery requests).

During the day, about 10 million entries are inserted.

To get rid of unused entries, a second hash is built in parallel
from entries in the file once a night. That second hash will
only contain the necessary items (old ones will be omitted, in
order to avoid that the main hash grows uncontrolled - not every
insert enjoys a delete in the course of application processing).

When the second hash is ready (i.e. contains only the required items),
the main hash is cleared ("clear") and the second hash is assigned
to the main hash, then the second hash is cleared - all in one go:

hash.clear();
hash = hashTemp;
hashTemp.clear();

This works once after 24 hours - but: every second day the main
process gets slower and slower short after these assignments,
until it gets killed by a crontab job (detecting that the process,
that normally writes an alive file every 5 seconds hasn't done so
for more than 120 seconds).

From an strace I can observe that the assignments above are performed
(even though this takes about 15 seconds to complete). A subsequent
config file open and read (that normally lasts less than 0.001
seconds)
now takes 7 seconds (!). And this slowness also applies to
all socket interactions (read/write) and other system calls (time,
getpeername, ...).
That's why the crontab cleaner kills the "assumed dead" process -
though it's only slow - but actually too slow to make it to
the next alive file write.

Currently I suspect that the gnu_cxx::hash_map performs
some action behind the scenes that triggers this behaviour.
There is no *alloc system call in the strace visible - so
what could cause the delay? (Is the gnu_cxx::hash_map doing
something nasty in another thread not monitored by strace?)

May be some kind of rehashing takes place?

Or excessive swapping occurs?

Is there a way to find out what causes the observed behaviour?

As far as I can see from the documentation, it is not possible
to control internal behaviour of rehashing or memory control in
general with __gnu_cxx::hash_map (though I perform a "reserve" of
10.000.000 buckets at startup):

http://www.sgi.com/tech/stl/hash_map.html

This is actually annoying, since the process consumes more and
more memory (1.9GB at startup) over time until a certain
"gnu_cxx::hash_map satisfaction"-level is reached (up to 2.6GB and
more -
which should still be within the limit of a 32bit process) - but you
can't
influence that behaviour - it is not possible to grab the required
memory in advance - even not with "reserve".

Also the clear should enable the hash_map implementation to
start over with the already allocated memory - there shouldn't
be a "swiss cheese" memory allocation.

The straced process does not reveal any *alloc calls.

What else could be the reason for the observed slowness of the
process?

Should <tr1/unordered_map> be tried instead with a newer version
of gcc? Would there be more "control" over memory consumption or
rehashing?

Thanks,
Alex

--

Currently I am using this gcc version:

[root@localhost ~]# gcc --version
gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-3)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There
is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.

--

Should unordered_map be tried(instead of hash_map)?

#include <ext/hash_map>
__gnu_cxx::hash_map<int> s;

vs.

#include <tr1/unordered_map>
std::tr1::unordered_map<int> s;

--

This hash function is used for the hash_map data_type creation (see
last line):

class stringhasher {
public:
/**
* Required by
* Inspired by the java.lang.String.hashCode() algorithm
* (it's easy to understand, and somewhat processor cache-
friendly)
* @param The string to be hashed
* @return The hash value of s
*/
size_t operator() (const std::string& s) const {
size_t h = 0;
std::string::const_iterator p, p_end;
for(p = s.begin(), p_end = s.end(); p != p_end; ++p) {
h = 31 * h + (*p);
}
return h;
}

/**
*
* @param s1 The first string
* @param s2 The second string
* @return true if the first string comes before the second in
lexicographical order
*/
bool operator() (const std::string& s1, const std::string& s2)
const {
return s1 < s2;
}
};

typedef __gnu_cxx::hash_map<std::string, int, stringhasher> HASH_S2I;

Marcel Müller · Jan 5, 2011

Alex said:
When the second hash is ready (i.e. contains only the required items),
the main hash is cleared ("clear") and the second hash is assigned
to the main hash, then the second hash is cleared - all in one go:

hash.clear();
hash = hashTemp;
hashTemp.clear();

You should avoid copying large container. Especially not within the
mutex where the main lookup table is locked.

Use the swap method that almost all STL like containers provide. instead.

This works once after 24 hours - but: every second day the main
process gets slower and slower short after these assignments,

What says the memory footprint of your application? Do you leak memory
somewhere?

From an strace I can observe that the assignments above are performed
(even though this takes about 15 seconds to complete). A subsequent
config file open and read (that normally lasts less than 0.001
seconds)
now takes 7 seconds (!). And this slowness also applies to
all socket interactions (read/write) and other system calls (time,
getpeername, ...).

Looks like memory pollution. Maybe the internal memory pool of the
runtime is highly fragmented. In this case you should use a custom
allocator for each hash instance, that allocates from a private, hash
instance specific memory pool. Preferably requested in large chunks
directly from the operating system. Once the hash gets replaced, simply
destroy the old hash and the entire memory pool it uses. This will clean
up gracefully.

Currently I suspect that the gnu_cxx::hash_map performs
some action behind the scenes that triggers this behaviour.

Maybe. But maybe your code causes small leaks of memory that perforate
the virtual memory space. This can lead to highly inefficient memory
access and can completely eliminate the efficiency of any cache memory
in your computer, causing a slow down of some orders of magnitude.

There is no *alloc system call in the strace visible - so
what could cause the delay? (Is the gnu_cxx::hash_map doing
something nasty in another thread not monitored by strace?)

May be some kind of rehashing takes place?

Or excessive swapping occurs?

Maybe. But your operating system should tell you that.

Is there a way to find out what causes the observed behaviour?

Track your memory usage.

As far as I can see from the documentation, it is not possible
to control internal behaviour of rehashing or memory control in
general with __gnu_cxx::hash_map (though I perform a "reserve" of
10.000.000 buckets at startup):

What would not be that bad, if you actually know the approximate size.
Especially in conjunction with the swap instead of the assignment, you
could allocate the second hash, that becomes the first one after the
next swap, pretty accurate.

This is actually annoying, since the process consumes more and
more memory (1.9GB at startup) over time until a certain
"gnu_cxx::hash_map satisfaction"-level is reached (up to 2.6GB and
more -
which should still be within the limit of a 32bit process)

No. Usually the limit is 2GB private memory for each process plus 2GB
shared memory for /all/ 32 Bit processes. Your memory is allocated from
the private arena.
Most operating systems provide a kernel setting to enlarge the private
arena at the expense of less shared memory. But this is a system wide
setting and may cause other serious impact.

So everything is clear. Memory is your problem.

You should strongly reduce the memory footprint of your application at
least by a factor of 2. Or choose the easy way and switch to 64 bit
environment (hardware and software).

- but you can't
influence that behaviour - it is not possible to grab the required
memory in advance - even not with "reserve".

You can allocate a memory pool in advance where the custom allocator of
your hash map takes the storage from. However keep in mind that most of
your storage might not be located in the hashmap but rather in the
string IDs. This is the first point, where I would start optimization.

- 1. Store your IDs in a binary, compressed representation rather than
in a readable string. E.g. do not use hex numbers.

- 2. Check your IDs. Are they all of similar and limited length?
=> In this case use a fixed char array as key parameter of your hash.

- 3. Check if your IDs follow some regular pattern. E.g. all IDs start
with a limited set of of values. In this case you should split your IDs
into pieces of fixed length and create a tree of lookup tables instead
of one large one. This has the benefit of not storing common parts of
the keys for each entry redundantly some million times like a relational
database. Of course, the nested lookup tables should not contain the
common parts of the keys anymore. They derive it from their location in
the object tree.

- 4. Maybe use an adaptive structure that uses simple, fixed lookup
tables for the first parts of the keys and hash tables only in deeper
elements. E.g. use an array of 256 entries which contains the subtables
for any key that starts with the byte matching the array index. This is
highly effective because this byte of the key is no longer stored at all
into memory. You can apply this recursively as long as the the
fillfactor keeps high enough. I.e. as long as it is likely that many of
the entries in the lookup table do exist. The nested lookup objects
should only be created if at leas one entry uses them.
Of course, the exponential growth of this structures will limit the
usage to only a few levels. But for these few levels it is extremely fast.
If your keys are well distributed, you can also collect several bytes
into one lookup table with e.g. 65536 entries. This is even more
effective because you save the pointers for the nested tables.
The break even is quite easy to calculate: If you have 10 mio. entries
an use a 2 byte lookup table with 65536 entries you save 20 MB for the
first two bytes of the keys and need only 256kB for the lookup table.
OK, the overhead of the nested hash maps will add approximately another
one or two MB, but no more. If your keys are not that well distributed
it will be even more effective, because some of the nested hash maps are
not created at all. This is usually the major indication for such a design.

- 5. Align the nesting levels of your data structures at the logical
structure of your IDs. E.g. if you want to make a DNS service, the dots
in the domain name could be used. The top level structure contains one
entry for each TLD. The nested structures are for the subdomains a.s.f.
Since the value distribution in each sub map is limited and mainly
independant form other submaps (except for 'www', which should be
handled special) there is almost no redundancy in memory.
You mentioned a concationation of your IDs. This suggests that there is
a structure in the keys.

If you optimize your code well, it should be possible to process 10
million entries with a memory footprint of at most 300MB! Maybe even less.

Also the clear should enable the hash_map implementation to
start over with the already allocated memory - there shouldn't
be a "swiss cheese" memory allocation.

The clear is a design flaw anyway.

Should <tr1/unordered_map> be tried instead with a newer version
of gcc? Would there be more "control" over memory consumption or
rehashing?

This will most likely not solve your problem.

Marcel

hash_map elements deletion	4	Mar 7, 2007
Good ole gnu::hash_map, I'm impressed	8	Jul 16, 2008
SGI hash_map	4	Mar 21, 2005
STL hash_map and free/delete the key object	3	Jun 2, 2004
Problems with tr1::bind and reference to abstract class	0	Aug 21, 2009
error with tr1 unordered_map iterator	6	Feb 24, 2007
STL hash_map hash	3	Feb 8, 2004
Why won't this hash_map compile?	3	Dec 15, 2003

high traffic/availability application and gnu_cxx::hash_map problem -better to use tr1/unordered_map

Alex Wanderleit

Marcel Müller

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads