Pre-allocate large amount of memory?

Carsten Gehling · Sep 14, 2009

I've created a small daemon, that serves certain data very fast to our
remaining web-application. It does so by pre-loading a lot of data from
our database into a special structure in memory. Boiled down, the
pre-loading procedure looks like this:

@data_hash = {}
DBI.connect(dns, user, pass) do |dbh|
sth = dbh.execute("some-sql-returning >300.000 rows")
while row = sth.fetch_hash
hash_key = <calculated-from-row-data>

@owner_relations[hash_key] ||= []
@owner_relations[hash_key] << row
end
sth.finish
end

It gives a @data_hash with a structure like this:

@data_hash = {
'key1' => [row1, row2, row3, ...]
'key2' => [row4, row5, ...]
...
}

I have a problem with the speed of the pre-loading though. I am loading
~360.000 rows into memory (about 500MB). The first 50% of the rows are
read and placed in the hash pretty quick. But after that it slows down.

I suspect that Ruby's dynamic memory allocation for the Hash is to
blaim. Can I somehow pre-allocate 500MB memory for the hash? Or tweak
the way Ruby allocates memory?

Thanks in advance!

- Carsten

Jason Stewart · Sep 14, 2009

You could probably just use memcached. It's optimized for efficient
memory allocation of hash based data stores.

Carsten Gehling · Sep 14, 2009

Jason said:
You could probably just use memcached. It's optimized for efficient
memory allocation of hash based data stores.

I thought of that too. But doesn't memcached expire its data after a
while? Or is it possible to disable that?

- Carsten

James Edward Gray II · Sep 14, 2009

I thought of that too. But doesn't memcached expire its data after a
while? Or is it possible to disable that?

You might consider using Redis:

http://code.google.com/p/redis/

It definitely doesn't have to expire keys. That's an optional feature.

I'll be posting a series of articles to my blog this week about using
it from Ruby, one each morning:

http://blog.grayproductions.net/articles/using_keyvalue_stores_from_ruby

James Edward Gray II

Robert Klemme · Sep 14, 2009

2009/9/14 Carsten Gehling said:
I've created a small daemon, that serves certain data very fast to our
remaining web-application. It does so by pre-loading a lot of data from
our database into a special structure in memory. Boiled down, the
pre-loading procedure looks like this:

@data_hash =3D {}
DBI.connect(dns, user, pass) do |dbh|
=A0sth =3D dbh.execute("some-sql-returning >300.000 rows")
=A0while row =3D sth.fetch_hash
=A0 =A0hash_key =3D <calculated-from-row-data>

=A0 =A0@owner_relations[hash_key] ||=3D []
=A0 =A0@owner_relations[hash_key] << row
=A0end
=A0sth.finish
end

It gives a @data_hash with a structure like this:

@data_hash =3D {
=A0'key1' =3D> [row1, row2, row3, ...]
=A0'key2' =3D> [row4, row5, ...]
=A0...
}

I have a problem with the speed of the pre-loading though. I am loading
~360.000 rows into memory (about 500MB). The first 50% of the rows are
read and placed in the hash pretty quick. But after that it slows down.

I suspect that Ruby's dynamic memory allocation for the Hash is to
blaim. Can I somehow pre-allocate 500MB memory for the hash? Or tweak
the way Ruby allocates memory?

Maybe it's not such a good idea to read everything in memory. After
all, retrieving large volumes of data from disk is where RDBMS's
shine. What about querying individual records and storing them in a
LRU storage. You'll find such a beast here which can be used like a
Hash:

http://github.com/rklemme/muppet-laboratories/tree/9eb71ab7bf7d5db7396905b5=
00bd3ac7cc153e0a/lib

Then you don't burn large amounts of memory but retain the caching effect.

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Kirk Haines · Sep 14, 2009

Maybe it's not such a good idea to read everything in memory. After
all, retrieving large volumes of data from disk is where RDBMS's
shine. =A0What about querying individual records and storing them in a
LRU storage. You'll find such a beast here which can be used like a
Hash:

http://github.com/rklemme/muppet-laboratories/tree/9eb71ab7bf7d5db7396905= b500bd3ac7cc153e0a/lib

Then you don't burn large amounts of memory but retain the caching effect=

Carsten Gehling · Sep 14, 2009

Robert said:
Maybe it's not such a good idea to read everything in memory. After
all, retrieving large volumes of data from disk is where RDBMS's
shine.

My reason for doing this is simply to avoid multiple sql-queries in a
recursive self-referencing datastructure. Usually this can be avoided by
querying the entire subset of data needed, and manipulate it in memory.
Unfortunately the data in question does not enable me to just query the
subset of data.

I will definetely have a look at Redis. Seems promising.

Thanks.

- Carsten

Josh Cheek · Sep 14, 2009

[Note: parts of this message were removed to make it a legal post.]

My reason for doing this is simply to avoid multiple sql-queries in a
recursive self-referencing datastructure. Usually this can be avoided by
querying the entire subset of data needed, and manipulate it in memory.
Unfortunately the data in question does not enable me to just query the
subset of data.

I will definetely have a look at Redis. Seems promising.

Thanks.

- Carsten

I had a situation like that, where I ended up using one of the nested set
implementations (I think it was Awesomer Nested Set) instead of a tree
hierarchy. I ended up having to write my own (very ugly) iterating
interpreter which analyzed each node in to figure out it's recursive
context. It took something like 3 or 4 blocks for the different places you
could inject code based on a recursive interpretation, but it did get me
down to a single database call, and it did only require a single pass
through the data. At one point (debugging, ahem) I was very strongly
considering going with the approach you are taking. Good luck with it, and
let me know how it turns out, I might opt for that approach instead, next
time.

Robert Klemme · Sep 14, 2009

My reason for doing this is simply to avoid multiple sql-queries in a
recursive self-referencing datastructure. Usually this can be avoided by
querying the entire subset of data needed, and manipulate it in memory.
Unfortunately the data in question does not enable me to just query the
subset of data.

Is your structure strictly hierarchical (i.e. a tree) or do you need to
query part of a graph with cycles? If it is strictly hierarchical there
is a solution that works for all RDBMS and in the latter case there are
solutions for some RDBMS.

Kind regards

robert

Carsten Gehling · Sep 15, 2009

Robert said:
Is your structure strictly hierarchical (i.e. a tree) or do you need to
query part of a graph with cycles? If it is strictly hierarchical there
is a solution that works for all RDBMS and in the latter case there are
solutions for some RDBMS.

Unfortunately it is not strictly hierarchial. It is relations between
companies-companies and companies-persons, that represents shareholders,
parent-/subsidiary companies, etc.

My job is to - given a certain company X - to extract all its owners
and, recursively, their owners, etc. until I reach the "top". Likewise
the other way to extract all companies owned by company X and,
recursively, all companies owned by them, etc.

Conceptually, the table (actually a view) I am querying holds the data:

CompanyA, Direction, CompanyB, Share
"FooCorp", "owns", "BarCorp", "10%"
"BarCorp", "ownedby", "FooCorp", "10%"
"BarCorp", "owns", "BazCorp", "45%"
"BasCorp", "ownedby", "BarCorp", "45%"
"QweCorp", "owns", "RteCorp", "20%"
"RteCorp", "ownedby", "QweCorp", "20%"
etc.

The table is not of my doing. It is very difficult (at least for me) to
devise a way to only query the nessecery rows in the table, without
sorting to recursive calls.

In the example above: If given "FooCorp", all but the last two rows
should be extracted and used in the result. How would you go about doing
that?

I haven't found a solution yet. This is why I've gone and made a
service, that holds all these data in memory (in a hash), to speed up on
things.

BTW: Thanks for all your great suggestions so far.

- Carsten

Josh Cheek · Sep 15, 2009

[Note: parts of this message were removed to make it a legal post.]

Unfortunately it is not strictly hierarchial. It is relations between
companies-companies and companies-persons, that represents shareholders,
parent-/subsidiary companies, etc.

My job is to - given a certain company X - to extract all its owners
and, recursively, their owners, etc. until I reach the "top". Likewise
the other way to extract all companies owned by company X and,
recursively, all companies owned by them, etc.

Conceptually, the table (actually a view) I am querying holds the data:

CompanyA, Direction, CompanyB, Share
"FooCorp", "owns", "BarCorp", "10%"
"BarCorp", "ownedby", "FooCorp", "10%"
"BarCorp", "owns", "BazCorp", "45%"
"BasCorp", "ownedby", "BarCorp", "45%"
"QweCorp", "owns", "RteCorp", "20%"
"RteCorp", "ownedby", "QweCorp", "20%"
etc.

The table is not of my doing. It is very difficult (at least for me) to
devise a way to only query the nessecery rows in the table, without
sorting to recursive calls.

In the example above: If given "FooCorp", all but the last two rows
should be extracted and used in the result. How would you go about doing
that?

I haven't found a solution yet. This is why I've gone and made a
service, that holds all these data in memory (in a hash), to speed up on
things.

BTW: Thanks for all your great suggestions so far.

- Carsten

As far as retrieving them all in a single query, nested sets would do this.
There are other hits that you take from using them, though. Changing the
hierarchy, for example, is very expensive (ie company a sells company b or
purchases company c), and treating it as a recursive structure after you
have it retrieved is quite painful. If you do opt for this solution, I can
send you my code to give recursive functionality through iteration. Though I
did change some lines in the plugin itself, to get more flexible use, but I
think I marked them all.

http://github.com/collectiveidea/awesome_nested_set

Robert Klemme · Sep 15, 2009

2009/9/15 Carsten Gehling said:
Unfortunately it is not strictly hierarchial. It is relations between
companies-companies and companies-persons, that represents shareholders,
parent-/subsidiary companies, etc.

So you do actually allow for loops, i.e. company A owns 10% of company
B which owns 10% of company A.

My job is to - given a certain company X - to extract all its owners
and, recursively, their owners, etc. until I reach the "top". Likewise
the other way to extract all companies owned by company X and,
recursively, all companies owned by them, etc.

Conceptually, the table (actually a view) I am querying holds the data:

CompanyA, =A0Direction, CompanyB, =A0Share
"FooCorp", "owns", =A0 =A0"BarCorp", "10%"
"BarCorp", "ownedby", "FooCorp", "10%"
"BarCorp", "owns", =A0 =A0"BazCorp", "45%"
"BasCorp", "ownedby", "BarCorp", "45%"
"QweCorp", "owns", =A0 =A0"RteCorp", "20%"
"RteCorp", "ownedby", "QweCorp", "20%"
etc.

The table is not of my doing. It is very difficult (at least for me) to
devise a way to only query the nessecery rows in the table, without
sorting to recursive calls.

If I understand that table design properly it is awful because
semantics of columns one, three and four change based on content of
column two. The usual way would be to model this with fixed
semantics, i.e. only have one direction of ownership relation in the
table. In your case you will probably have to do a normalization step
by defining a view on this table with a UNION ALL or use a WITH clause
in the query to ensure the query can be built in a reasonable way.

In the example above: If given "FooCorp", all but the last two rows
should be extracted and used in the result. How would you go about doing
that?

There are features in modern RDBMS which allow for recursive querying.
In PostgreSQL and Microsoft SQL Server you can use WITH expression:
http://www.postgresql.org/docs/8.4/static/queries-with.html
http://msdn.microsoft.com/en-us/library/ms175972(SQL.90).aspx

In Oracle there is CONNECT BY
http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/statements_1=
0002.htm#i2066102

The downside is that recursive queries tend to have a performance hit
as the DB engine cannot fetch everything via a single index and needs
to look at results so far to know what other records it has to
retrieve.

The nested set model (Josh mentioned it as well) might help although I
haven't thought through all implications in your case:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
http://www.codeproject.com/KB/database/nestedsets.aspx

I haven't found a solution yet. This is why I've gone and made a
service, that holds all these data in memory (in a hash), to speed up on
things.

You still have the issue that you maintain redundant data and must
find a way to invalidate your cache when the base data changes. Since
a complete reload of the cache is expensive (as you have experienced)
this will get complicated soon because in order to find the data that
must be refreshed you need similar queries like those to find answers
to the questions you placed above.

BTW: Thanks for all your great suggestions so far.

You're welcome!

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Carsten Gehling · Sep 15, 2009

Robert said:
So you do actually allow for loops, i.e. company A owns 10% of company
B which owns 10% of company A.

Yes that is allowed (we define some rules in the extraction code to
avoid infinite recursion).

If I understand that table design properly it is awful because
semantics of columns one, three and four change based on content of
column two. The usual way would be to model this with fixed
semantics, i.e. only have one direction of ownership relation in the
table. In your case you will probably have to do a normalization step
by defining a view on this table with a UNION ALL or use a WITH clause
in the query to ensure the query can be built in a reasonable way.

Actually the above structure is a view, which adds this direction. The
real data-table only holds one record each relation, that is:

"BasCorp", "ownedby", "BarCorp", "45%"
"RteCorp", "ownedby", "QweCorp", "20%"

There are features in modern RDBMS which allow for recursive querying.
In PostgreSQL and Microsoft SQL Server you can use WITH expression:
http://www.postgresql.org/docs/8.4/static/queries-with.html
http://msdn.microsoft.com/en-us/library/ms175972(SQL.90).aspx

I am using Microsoft SQL Server, but I didn't know about recursive
querying.

The downside is that recursive queries tend to have a performance hit

Perhaps the best way will be going with my current setup (i.e. loading
the entire data). But use recursive querying to reload part of the data
when relations are changed.

The nested set model (Josh mentioned it as well) might help although I
haven't thought through all implications in your case:

I would rather not begin to alter my data structure.

You still have the issue that you maintain redundant data and must
find a way to invalidate your cache when the base data changes.

I think that I may have found a way to calculate, exactly how many
records I need to reload, if the relation between two companies are
added/changed/deleted. And SQL Sever's WITH option might help me there.

I will have a look at that now. Thanks a bunch for your input - all of
you.

I will post my results, when I get there.

- Carsten

Robert Klemme · Sep 15, 2009

2009/9/15 Carsten Gehling said:
Yes that is allowed (we define some rules in the extraction code to
avoid infinite recursion).
Good.

Actually the above structure is a view, which adds this direction. The
real data-table only holds one record each relation, that is:

"BasCorp", "ownedby", "BarCorp", "45%"
"RteCorp", "ownedby", "QweCorp", "20%"
Oh.

I am using Microsoft SQL Server, but I didn't know about recursive
querying.

Perhaps the best way will be going with my current setup (i.e. loading
the entire data). But use recursive querying to reload part of the data
when relations are changed.

Before you do that: I would first try out the query and see how well
it performs with your data. If not, then maybe additional indexing
may help. Only if that also fails _then_ I would switch to a caching
approach.

The reason: I would try to keep architecture simple and avoid
redundant data storage as this tends to cause issues. If you can get
the data you need just in time via a reasonable efficient query then
this is much more preferable to your big caching process approach.

I would rather not begin to alter my data structure.
Why?

I think that I may have found a way to calculate, exactly how many
records I need to reload, if the relation between two companies are
added/changed/deleted. And SQL Sever's WITH option might help me there.

I will have a look at that now. Thanks a bunch for your input - all of
you.

I will post my results, when I get there.

We'll be waiting eagerly.

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Piyush Ranjan · Sep 16, 2009

[Note: parts of this message were removed to make it a legal post.]

My 2 cents

I have used this kind of caching method in past and works really well for
me. I was able to speed up a database-with-very-heavy-nested-data
to csv script by more than 30x and this despite the fact that I had analyzed
all queries and had all necessary indexes in place.
So I guess atleast it works if the amount of data is not very big. I
guess(and I can be severely wrong) that can also be overcome by using
something like localmemcache gem which can be used to dump this data on an
arbitrary memory location which is not handled by ruby process.

Piyush

Slow loading of large in-memory tables	1	Sep 7, 2004
manipulating a hash of hashes	3	Jul 7, 2004
Data structure problem to solve Linux memory fault	2	May 10, 2007
Script not accepting some values	2	Oct 18, 2008
Performance Improvement of complex data structure (hash of hashes of hashes)	2	Aug 24, 2004
how to write these info to file?	2	Aug 31, 2005
write to output file query	2	Sep 1, 2005
Advanced: Populating, Printing Tree Structure	3	Feb 21, 2004

Pre-allocate large amount of memory?

Carsten Gehling

Jason Stewart

Carsten Gehling

James Edward Gray II

Robert Klemme

Kirk Haines

Carsten Gehling

Josh Cheek

Robert Klemme

Carsten Gehling

Josh Cheek

Robert Klemme

Carsten Gehling

Robert Klemme

Piyush Ranjan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads