How to find out memory occupied by vector, set, or map?

L

Lambda

I'm trying to develop several interesting components
of a simple search engine follow some text books.

A book introduces some ways to compress dictionary,
allow it to stay on main memory.
In the dictionary, is a list of millions of words
with related data of several bytes.

Word / Term Freq(int) / Pointer to disk file
Hello 100

One method is:
concatenate all the words into one long contiguous string
and an array of 4-byte character pointers is used to access.

But I think STL map is a suitable data structure for this problem.
I don't know the detailed internal implementation of map.
I think it's implemented as a binary search tree.
Every node occupy just enough memory. (I'm not sure)
I also need the search feature of BST.

So I'd like to know if it is a need to build the data structure
myself.
How can I know the whole memory occupied by a map,
as well as other data structures such as vector, set.
(Or memory occupied by a map element).

BTW, should I use std::string or char[] to store the millions of
words?
How much is the string overhead?
 
D

dizzy

Lambda said:
But I think STL map is a suitable data structure for this problem.
I don't know the detailed internal implementation of map.
I think it's implemented as a binary search tree.
Every node occupy just enough memory. (I'm not sure)
I also need the search feature of BST.

Depends on implementation. g++ 4.1.2 comes with a red-black tree based
implementation.
So I'd like to know if it is a need to build the data structure
myself.
How can I know the whole memory occupied by a map,
as well as other data structures such as vector, set.
(Or memory occupied by a map element).

You could instantiate your used containers with your own allocator type that
would keep counts on the allocated/deallocated memory (as the containers
are mandated to use the allocator for all allocation needs) and call upon
the std::allocator for the actual allocation/deallocation.

Other than that there is no standard API to provide this information.

You could also aproximate the used memory by assuming some overhead for each
allocated node (for node based containers, like list, map, set, etc but NOT
vector, deque) and also considering the size of the element (with sizeof()
of course).
BTW, should I use std::string or char[] to store the millions of
words?

If by char[] you mean an char[N] array type, you must know that such a type
requires a compile time fixed N value. Which means that unless your words
do not vary greatly in length you would be wasting alot of memory by having
a big enough N to hold all possible words. And then again, if you need to
allocate the words at runtime depending on some runtime input then you
still need to perform dynamic allocation allocating static arrays somewhere
in that dynamic allocated memory which kinda beats the purpose of static
arrays. So just use std::string.
How much is the string overhead?

What kind of overhead? Let's discuss a naive implementation (non reference
counted) of std::string:
- it probably has some optimization for empty strings and/or strings with
short length (that is, when you create emtpy strings of strings of short
length there isn't any dynamic memory allocation done by the string itself)
- for all other cases, std::string will use the given allocator to allocate
memory and put the string contents (we don't know if it does it continously
or in chunks like std::deque usually does)
- when you remove data from the string it will update internal counters and
possibly copy bytes around but almost for sure it won't deallocate any
memory
- when you add data to the string (appending) it may allocate more memory to
hold the new data if it doesn't have enough space yet

So, from that perspective, what usecase will your std::string have? That is,
will you create few std::string that constanlty get updated (in contents)?
Or will you create many std::string that are basically constant? And when
you compare if something has overhead you do it relative to another thing,
if you try to imagine your own implementation of a std::string you won't
come better than what your C++ library has anyway.

Or otherwise put, make it work, profile, optimize, in that order!
 
J

James Kanze

I'm trying to develop several interesting components
of a simple search engine follow some text books.
A book introduces some ways to compress dictionary,
allow it to stay on main memory.
In the dictionary, is a list of millions of words
with related data of several bytes.
Word / Term Freq(int) / Pointer to disk file
Hello 100
One method is:
concatenate all the words into one long contiguous string
and an array of 4-byte character pointers is used to access.

A classical string pool. It's not a solution to your problem,
just a very special way to represent strings.
But I think STL map is a suitable data structure for this
problem. I don't know the detailed internal implementation of
map. I think it's implemented as a binary search tree. Every
node occupy just enough memory. (I'm not sure) I also need the
search feature of BST.

Every node is dynamically allocated, and contains several
pointers in addition to the raw data, so it can be fairly
expensive in terms of memory use. On the other hand, for
something small, like a couple of million elements, why not.

Note that access time is O(lg n), and not constant. For more
than a couple of thousand elements, a hash table will be faster
(provided you use a good hash function).
So I'd like to know if it is a need to build the data
structure myself. How can I know the whole memory occupied by
a map, as well as other data structures such as vector, set.
(Or memory occupied by a map element).

You can't really, since there are so many hidden overheads. I'd
just go ahead and implement using std::map, but carefully
encapsulating it so that I could replace it if it did cause a
problem later.
BTW, should I use std::string or char[] to store the millions
of words?

std::string, definitely.
How much is the string overhead?

Again, it depends on the implementation. A lot of
implementations today use a small string optimization which
avoids dynamic allocation for strings shorter than some cut-off
size; if your words are English, there's a good chance that most
of them will fall under this limit.

All sorts of optimizations are possible, once you have the
application working, and can see if they're necessary.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top