Data structure problem to solve Linux memory fault

D

dalyea

I am running into a memory fault error when processing the
contents of a database table into a lookup hash.

The problem is that I store item equivalencies (IE) in a db table IE
with primary key (a, b). a and b are integers and correspond
to productId values.

For IE products a and b, I insert as primary key (a, b) and, notably,
leave out (b, a). This obviously reduces the IE table size by 50%.

To build a single lookup data structure prior to writing to a lookup
file,
I try to create hash %ie which for each row has:

$ie{a}{b}=1;
$ie{b}{a}=1;

Then to get all the IE for a product a, one just looks at the slice
%ie{a} and gets all the values b1, b2, b3, ... for it.

The IE table has 1.4 million rows for 80,000 products, and sometimes
(b, a) is written [which is not a problem really]. When I create %ie,
it must
be upwards of 2.7 million rows, and hence, I am getting a memory fault
error.

The problem is, can I use a better (meaning smaller) data structure
to capture the IE data?

I could lookup each product's IE data as needed, but that would be
80,000+ single queries to the database. I don't even want to try
that.

Also, my algorithm for matching products as IE pretty much puts
all combinations of (a, b, c, d) in the table for products a b c d.
There are 4c2 or 6 combinations, not counting order, but up to 12
combinations if pairs such as (a, b) and (b, a) are entered. But it
is
possible that (a, b) and (b, c) exist, but not (a, c). Therefore, my
%ie
structure is technically incomplete as it is today. (Though I know
from the way I entered IE data that is probably 99% complete.)
The point is, a single query won't necessarily get all the IE data
for a single product. The whole data structure should be pre-loaded
all at once, as I'm trying to do.

Any ideas?

Thanks,
David
 
X

xhoster

I am running into a memory fault error when processing the
contents of a database table into a lookup hash.

The problem is that I store item equivalencies (IE) in a db table IE
with primary key (a, b). a and b are integers and correspond
to productId values.

For IE products a and b, I insert as primary key (a, b) and, notably,
leave out (b, a). This obviously reduces the IE table size by 50%.

Is that important? If your data is so large that you need to resort to
space saving tricks even when storing it in a disk-based database, then
trying to load the whole thing into memory in an uncompressed format
seems to be a losing proposition.
To build a single lookup data structure prior to writing to a lookup
file,

Why? Database systems are designed to efficiently handle such tasks.
If you need to take the data *out* of a database to make a lookup file,
then what is the point of having it in a database in the first place?
I try to create hash %ie which for each row has:

$ie{a}{b}=1;
$ie{b}{a}=1;

Then to get all the IE for a product a, one just looks at the slice
%ie{a} and gets all the values b1, b2, b3, ... for it.

The IE table has 1.4 million rows for 80,000 products, and sometimes
(b, a) is written [which is not a problem really]. When I create %ie,
it must
be upwards of 2.7 million rows, and hence, I am getting a memory fault
error.

2.7e6/80000 = almost 34 equivalences per item on average. Is that right?
The problem is, can I use a better (meaning smaller) data structure
to capture the IE data?

Yes, but perhaps you should try to implement that structure in the
database, rather than Perl's memory. Probably the best way would be to
define a canonical identifier to each equivalence class. For example, just
pick the alphabetically first member of each class to be the canonical id
of that class.

In Perl , you would have two top-level hashes. One would be a simple
single-level hash which would map each item ID to the canonical ID for the
class it belongs to, and would have 80,000 entries (the number of items).
The other would have one entry for each canonical item, and the value would
be a reference to either a hash or an array with all the other items in
that class. For example, if you have A1, A2, A3 and B1, B2, B3, with the
As being equivalent to each other and the Bs being equivalent to each
other, you would have something like:

my %equiv=(
'A3' => 'A1',
'B3' => 'B1',
'B1' => 'B1',
'A1' => 'A1',
'B2' => 'B1',
'A2' => 'A1',
);
my %class=(
'B1' => {
'B3' => undef,
'B1' => undef,
'B2' => undef,
},
'A1' => {
'A3' => undef,
'A1' => undef,
'A2' => undef,
},
);

I could lookup each product's IE data as needed, but that would be
80,000+ single queries to the database. I don't even want to try
that.

If you are going to read 1.4 million rows anyway, I don't see why issuing
80,000 queries would be out of the question.
Also, my algorithm for matching products as IE pretty much puts
all combinations of (a, b, c, d) in the table for products a b c d.

If you are motivated to save 50% by only storing one ordering of each pair,
it seems silly to use many times as much space to store each pairing
separately. You are straining at a gnat but swallowing a camel.

Xho
 
D

Dr.Ruud

(e-mail address removed) schreef:
To build a single lookup data structure prior to writing to a lookup
file, I try to create hash %ie which for each row has:

$ie{a}{b}=1;
$ie{b}{a}=1;

No need to add both, no need to give it a real value. Just make sure
that a < b, and let the value be undefined.

Then to get all the IE for a product a, one just looks at the slice
%ie{a} and gets all the values b1, b2, b3, ... for it.

So you also need a hash with a > b. (I am assuming that a and b are
never equal.)

The problem is, can I use a better (meaning smaller) data structure
to capture the IE data?

Use a proper database, with both (a,b) and (b,a) as index, and guard at
insert/update that a <= b.
The point is, a single query won't necessarily get all the IE data
for a single product. The whole data structure should be pre-loaded
all at once, as I'm trying to do.

SELECT t1.a AS a, t1.b AS b, t2.b AS c FROM ab AS t1 JOIN ab AS t2 ON
t1.b = t2.a WHERE ...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top