Memory allocation failure in map container

  • Thread starter Saeed Amrollahi
  • Start date
S

Saeed Amrollahi

Dear friends
Hi

I am working on an application to manage Tehran Securities Exchange
organization.
In this program, I try to manipulate a lot of institutional investors
(about 1,500,000). The Investors are a struct like this:

struct Investor {
int ID;
std::wstring NID;
std::wstring Name;
std::wstring RegNum;
std::wstring RegDate;
std::wstring RegProvince;
std::wstring RegCity;
std::wstring Type;
std::wstring HQAddr;
// the deblanked or squeezed info
std::wstring sq_Name;
std::wstring sq_RegNum;
std::wstring sq_RegProvince;
std::wstring sq_RegCity;
};
The investors are stored in an Access Database and I read them into
memory at the program loading time. Because, each investor is
registered in a City, I tried
to use a map container: map a city to the collection of investors:
map<wstring, vector<Investor>*> g_Investor; // map city name to the
investors located in
For the sake of efficiency, I used pointer to vector as a value of
the map.
the following code fills the map:

void InvestorSearchEngine::FillInvestors(const vector<Investor>& m)
{
try {
for (vector<Investor>::const_iterator cit = m.begin(); cit !=
m.end(); ++cit) {
if (g_Investor.find(cit->second.RegCity()) != g_Investor.end()) {
// so, there is a node with reg. place already
map<wstring, vector<Investor>*>::iterator it =
g_Investor.find(cit->second.RegCity());
it->second->push_back(cit->second);
}
else { // the registered city is new
vector<Investor>* vp = new vector<Investor>();
vp->push_back(cit->second);
g_Investor[cit->second.RegCity] = vp;
}
}
}
catch(const std::bad_alloc&)
{
wofstream LogFile("D:/log.txt", std::ios_base::app);
LogFile << "Memory exhausted ... " << L'\n';
LogFile << g_Investor.max_size() << L'\n';
LogFile.close();
}

Unfortunately, sometimes, the memory allocation is failed and the
"Memory exhauseted ..."
message was written to log file.
FYI, the max_size() function returns 119304647 as of largest possible
map.
What's wrong with my code?
I use Pentium 4 CPU 3.00 GHz, 4 GB RAM and Windows XP, Visual Studio
2008.

Please throw some light.
-- Saeed Amrollahi
 
Joined
Jan 3, 2011
Messages
6
Reaction score
0
So you have 1.5 million instances of your Investor struct, plus storage for your maps and vectors and you're surprised that you've run out of memory?

Have you thought about using a proper database for this stuff?


P.S. Oops, I see you are using a database for this, but still ... loading ALL of it into memory doesn't seem like a pratical solution. Your database probably provides an API offering all the functionality you need. Don't just suck the entire thing into memory, use the database API to query it or modify the data.
 
Last edited:
G

Goran

Dear friends
Hi

I am working on an application to manage Tehran Securities Exchange
organization.
In this program, I try to manipulate a lot of institutional investors
(about 1,500,000). The Investors are a struct like this:...
The investors are stored in an Access Database and I read them into
memory at the program loading time.
I use Pentium 4 CPU 3.00 GHz, 4 GB RAM and Windows XP, Visual Studio
2008.

Address space for any program in 32-bit XP is, by default, is 2GB (you
can raise this to 3, but you really shouldn't). It does not matter
that you have 4GB of memory or whatever big swap file, your program
data must fit in 2GB. It looks like you exhausted that.

You must redesign the program not to load that much data into memory.
You already have a database, and a database is made exactly for that
(among other things): manipulating amounts of data that do not fit
into memory. So use it.

Goran.
 
J

James Kanze

(e-mail address removed):
Size of this struct is 584 bytes with my compiler. 1500000*584 is ca 835
MB. You are holding them twice (both in vector and map), which makes it
ca 1.7 GB. This brings you already quite close to the maximum limits for
32-bit applications (2G in Windows by default), hence the memory problems
are expected.

Independantly of sizeof(Investor), all those strings are likely
to require additional memory (unless they are empty, or unless
the compiler uses the small string optimization, and they are
small enough---but things like names, provinces and cities
generally aren't). This could easily double the necessary
memory (with small string optimization) or multiply it by 5 to
10, or more (without small string optimization, but then,
sizeof(Investor) would probably be around 64).
The correct solution would be to redesign the application so that you
don't need to load all the data in the memory at once. Normally database
applications are happy to load only the data they need at the moment.

It's hard to say what the correct solution is without more
context, but there's a very good chance you're right, at least
if the actual data is stored in a relational data base.
Another solution would be to use a 64-bit machine, OS and compilation,
this lifts the limits a bit, but does not help if your database happens
to grow 100 times.

If he's near the limits on a 32 bit machine, a 64 bit machine
should multiply the number he can handle by significantly more
than a 100. Moving to 64 bits is probably the simplest
solution, *IF* the program really needs to keep all of the data
virtually in memory.
Yet another solution is to reduce the size of
Investor, for example putting all strings together in a single
std::wstring, separated by some delimiter like a zero byte. This would
make finding, accessing and changing the individual fields slower and
more cumbersome of course.

I'm not sure that that would gain that much. What is probably,
on the other hand, is that there are a lot of duplicates in the
fields with province and city; using a string pool for these
could help. And it may be possible to generate the sq_...
values algorithmically from the non sq_ value; in that case,
they could be eliminated completely from the structure. But
none of these optimizations will last for long, if the table
significantly increases in size.
 
S

Saeed Amrollahi

Hi Paavo.
Thank you for your feedback, and I'm sorry for a delay.

(e-mail address removed):






Size of this struct is 584 bytes with my compiler. 1500000*584 is ca 835
MB. You are holding them twice (both in vector and map), which makes it
ca 1.7 GB. This brings you already quite close to the maximum limits for
32-bit applications (2G in Windows by default), hence the memory problems
are expected.
In general, you are right, but the sizeof of Investor on my machine is
388 bytes
(4 + 7 * 32). and the map and the vector take a lot of memory. please
note I have
other data structures too. I reconsider the Investor, and I really
don't need to
some data members: RegProvince, RegDate, Type, HQAddr and
sq_RegProvince.
So I don't load these fields and the size of Investor decreases by
%40.
At the time being, my program is OK, but It's not the final and good
solution.
The correct solution would be to redesign the application so that you
don't need to load all the data in the memory at once. Normally database
applications are happy to load only the data they need at the moment.
Yes. At the moment, the redesign of my program is somehow expensive
but in future I have to think about pre-loading strategy.
Another solution would be to use a 64-bit machine, OS and compilation,
this lifts the limits a bit, but does not help if your database happens
to grow 100 times. Yet another solution is to reduce the size of
Investor, for example putting all strings together in a single
std::wstring, separated by some delimiter like a zero byte. This would
make finding, accessing and changing the individual fields slower and
more cumbersome of course.
At the moment, I can't migrate to Windows 7 or 64 bits versions of
visual
studio. BTW, As you noted, I don't like a bad design with good
hardware.
Your final solution was good and it was a clue, so I removed the
redundant
data members.
hth
Paavo
Thanks again.
-- Saeed Amrollahi
 
B

Balog Pal

"Saeed Amrollahi"
In general, you are right, but the sizeof of Investor on my machine is
388 bytes (4 + 7 * 32). and the map and the vector take a lot of memory.
I reconsider the Investor, and I really
don't need to
some data members: RegProvince, RegDate, Type, HQAddr and
sq_RegProvince.

If you need that amount of records, you better drop using std::wstring.
(unless the strings you store happen to be very close to 31 bytes (guess ~14
letters if you use some UCS2 variant and 16 bit wchar_t) long. The
essential sice of a String is just 4 bytes on 32bit platform (certainly you
need to calculate the size of actual data plus heap blocks overhead). I
bet you can find string implementations tuned for different purposes,
including low memory footprint.
 
S

Saeed Amrollahi

Saeed Amrollahi said:
I am working on an application to manage Tehran Securities Exchange
organization.
In this program, I try to manipulate a lot of institutional investors
(about 1,500,000). The Investors are a struct like this:
struct Investor {
  int ID;
  std::wstring NID;
  std::wstring Name;
  std::wstring RegNum;
  std::wstring RegDate;
  std::wstring RegProvince;
  std::wstring RegCity;
  std::wstring Type;
  std::wstring HQAddr;
  // the deblanked or squeezed info
  std::wstring sq_Name;
  std::wstring sq_RegNum;
  std::wstring sq_RegProvince;
  std::wstring sq_RegCity;
};
The investors are stored in an Access Database and I read them into
memory at the program loading time. Because, each investor is
registered in a City, I tried
to use a map container: map a city to the collection of investors:
  map<wstring, vector<Investor>*> g_Investor; // map city name to the
investors located in
For the sake of efficiency,  I used  pointer to vector as a value of
the map.

Using a pointer as you do above is not inherently any more efficient
than map said:
the following code fills the map:
void InvestorSearchEngine::FillInvestors(const vector<Investor>& m)
{

So at this point, 'm' has 1,500,000 members? I can only assume that it
isn't referring to some temporary structure. If it isn't, then one thing
that may help is to have the map hold a vector of Investor* instead of
Investor. That way you won't have two copies of each Investor object.
However, if the vector that 'm' refers to changes, you will have to
rebuild g_Investor all over again. There is probably a better solution.

I see that your class is called "InvestorSearchEngine." If the goal of
this class is to help clients retrive Investors that meet specific
criteria, then keeping a huge map of all the Investor objects by city
seems inappropriate.


  try {
    for (vector<Investor>::const_iterator cit = m.begin(); cit !=
m.end(); ++cit) {
 if (g_Investor.find(cit->second.RegCity()) != g_Investor.end()) {
            // so, there is a node with reg. place already
    map<wstring, vector<Investor>*>::iterator it =
g_Investor.find(cit->second.RegCity());
           it->second->push_back(cit->second);
 }
 else { // the registered city is new
    vector<Investor>* vp = new vector<Investor>();
    vp->push_back(cit->second);
    g_Investor[cit->second.RegCity] = vp;
 }
    }
 }
catch(const std::bad_alloc&)
 {
    wofstream LogFile("D:/log.txt", std::ios_base::app);
 LogFile << "Memory exhausted ... " << L'\n';
    LogFile << g_Investor.max_size() << L'\n';
 LogFile.close();
}
Unfortunately, sometimes, the memory allocation is failed and the
"Memory exhauseted ..."
message was written to log file.

It sounds like you need to sacrifice some of this preloading in order to
save space.

HiDaniel
Thank you for your answer.
I tried to make smaller the Investor class by removing unnecessary
data members.

Regards,
-- Saeed Amrollahi
 
S

Saeed Amrollahi

Address space for any program in 32-bit XP is, by default, is 2GB (you
can raise this to 3, but you really shouldn't). It does not matter
that you have 4GB of memory or whatever big swap file, your program
data must fit in 2GB. It looks like you exhausted that.

You must redesign the program not to load that much data into memory.
You already have a database, and a database is made exactly for that
(among other things): manipulating amounts of data that do not fit
into memory. So use it.

Goran.

It's wise. Thank you for your answer.
I consider it.

Regards,
-- Saeed Amrollahi
 
J

James Kanze

You are right of course, I was assuming most of these strings
are short and fit in the small string optimization buffer.

What are usual values for the small string cutoff? I seem to
remember 8 being used somewhere, but if sizeof this structure is
584, and wchar_t is 2 bytes (Windows), then I'd guess more
likely 16 (or 8 if wchar_t is 4 bytes, as is the case on most
Unix systems). With 16, your almost certainly right---most of
these fields would fit into 16 characters (supposing the field
names mean what they seem to mean, and judging from the names I
see on maps, etc.). That does change things.
If he used UTF-8 encoded
std::string instead of std::wstring, this assumption might be true more
often (assuming the texts are mostly ASCII of course).

He said that this was for the Teheran securities exchange. The
local language in Teheran uses the Farsi alphabet, a modified
(or extended) form of the Arabic alphabet. In UTF-8, these are
all two byte characters, in the range D8 80--DB BF. If wchar_t
uses UTF-16, he gains absolutely nothing. (If it uses UTF-32,
on the other hand, he divides his memory use by 2. Maybe.)

As to whether it fits into the small string optimization or not,
all depends on how that is implemented. If the optimization
limit depends on the number of code points (char or wchar_t, and
the sizeof you mention would seem to suggest that), then he will
miss the small string optimization more often using UTF-8 than
using UTF-16 or UTF-32. (If the limit depends on the number of
bytes, then UTF-16 and UTF-8 are roughly equal, supposing most
text written in Farsi.)
The address space limits would be gone in 64-bit machine, but the new
limits will be set by the physical RAM and disk space present. A normal
desktop PC would have e.g. 8GB of RAM, which is only 4 times larger than
the 2GB limit for Windows. The disks are maybe 200GB or something, if one
(ab)uses that all for the pagefile, then one could hold theoretically
hold ca 100 times more data in process memory than for a 32-bit process
(assuming one has enough patience to wait for such process to complete
anything!).

Terabyte disks and larger are readily available, and not
expensive. On a 64 bit system, he should easily be able to get
rid of the bad_alloc. With in return so much swapping that the
system becomes unusable.
There are lots of space overhead associated with each std::wstring. In my
implementation it seems the size of std::wstring is 32 bytes in 32-bit
compilations, from which only 16 bytes is used for the small string
buffer.

That seems like a lot, although... It's not too difficult to
implement the small string optimization with an overhead of only
two pointers. On the other hand, any quality implementation
will have a debugging option, which will require extra pointers
to keep a list of iterators, etc. From what I've seen, I have
the impression that VC++ tries to make it possible to link
debugging and non-debugging versions (although at least in VC8,
this doesn't actually work), and so uses the extra space even
when debugging is turned off.
For example, if all strings are very short and fit in the small
string buffers, then one can easily reduce the overall memory
consumptions by half by packing them all in a single string. If the
strings are large then this would not help of course.

If the strings are large, you would only have one allocation,
instead of many. And each allocation has significant overhead
as well (maybe 16 bytes), without mentionning the small string
buffer which isn't used.

No, you're right. Using a single string with a separator will
represent a significant gain (unless the strings are really,
really long, which I doubt is the case). Regardless of the
implementation of std::basic_string, and whether the strings fit
into the small string buffer or not.

Even without many duplicates: using string pooling would allow
putting all of the actual character data in a single string,
which as you correctly point out, is a win in itself. Each
string in his structure would have type PooledString, which
would consist of two size_t (indexes to the begin and end of the
string in the pool). On a 32 bit machine, that's 8 bytes,
instead of 584; there's still the overhead for the pool, but it
wouldn't surprise me if this halved the total memory use.
 
K

Kevin P. Fleming

Even without many duplicates: using string pooling would allow
putting all of the actual character data in a single string,
which as you correctly point out, is a win in itself. Each
string in his structure would have type PooledString, which
would consist of two size_t (indexes to the begin and end of the
string in the pool). On a 32 bit machine, that's 8 bytes,
instead of 584; there's still the overhead for the pool, but it
wouldn't surprise me if this halved the total memory use.

Since the combined strings would never need to be accessed as a combined
string, then the choice of delimiter is irrelevant... which means that a
zero byte is just as valid as a comma. Using a zero byte to separate
strings in the pool means that each string object only needs a pointer
to the beginning of the string (unless storing its length is valuable,
but that's a separate issue entirely). Since a pointer and a size_t are
probably the same size (certainly they are on modern x86 platforms),
each string object becomes very small, and access to them is quite fast.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top