Data structure problem to solve Linux memory fault

Discussion in 'Perl Misc' started by dalyea@gmail.com, May 10, 2007.

  1. Guest

    I am running into a memory fault error when processing the
    contents of a database table into a lookup hash.

    The problem is that I store item equivalencies (IE) in a db table IE
    with primary key (a, b). a and b are integers and correspond
    to productId values.

    For IE products a and b, I insert as primary key (a, b) and, notably,
    leave out (b, a). This obviously reduces the IE table size by 50%.

    To build a single lookup data structure prior to writing to a lookup
    file,
    I try to create hash %ie which for each row has:

    $ie{a}{b}=1;
    $ie{b}{a}=1;

    Then to get all the IE for a product a, one just looks at the slice
    %ie{a} and gets all the values b1, b2, b3, ... for it.

    The IE table has 1.4 million rows for 80,000 products, and sometimes
    (b, a) is written [which is not a problem really]. When I create %ie,
    it must
    be upwards of 2.7 million rows, and hence, I am getting a memory fault
    error.

    The problem is, can I use a better (meaning smaller) data structure
    to capture the IE data?

    I could lookup each product's IE data as needed, but that would be
    80,000+ single queries to the database. I don't even want to try
    that.

    Also, my algorithm for matching products as IE pretty much puts
    all combinations of (a, b, c, d) in the table for products a b c d.
    There are 4c2 or 6 combinations, not counting order, but up to 12
    combinations if pairs such as (a, b) and (b, a) are entered. But it
    is
    possible that (a, b) and (b, c) exist, but not (a, c). Therefore, my
    %ie
    structure is technically incomplete as it is today. (Though I know
    from the way I entered IE data that is probably 99% complete.)
    The point is, a single query won't necessarily get all the IE data
    for a single product. The whole data structure should be pre-loaded
    all at once, as I'm trying to do.

    Any ideas?

    Thanks,
    David
    , May 10, 2007
    #1
    1. Advertising

  2. Guest

    "" <> wrote:
    > I am running into a memory fault error when processing the
    > contents of a database table into a lookup hash.
    >
    > The problem is that I store item equivalencies (IE) in a db table IE
    > with primary key (a, b). a and b are integers and correspond
    > to productId values.
    >
    > For IE products a and b, I insert as primary key (a, b) and, notably,
    > leave out (b, a). This obviously reduces the IE table size by 50%.


    Is that important? If your data is so large that you need to resort to
    space saving tricks even when storing it in a disk-based database, then
    trying to load the whole thing into memory in an uncompressed format
    seems to be a losing proposition.

    >
    > To build a single lookup data structure prior to writing to a lookup
    > file,


    Why? Database systems are designed to efficiently handle such tasks.
    If you need to take the data *out* of a database to make a lookup file,
    then what is the point of having it in a database in the first place?

    > I try to create hash %ie which for each row has:
    >
    > $ie{a}{b}=1;
    > $ie{b}{a}=1;
    >
    > Then to get all the IE for a product a, one just looks at the slice
    > %ie{a} and gets all the values b1, b2, b3, ... for it.
    >
    > The IE table has 1.4 million rows for 80,000 products, and sometimes
    > (b, a) is written [which is not a problem really]. When I create %ie,
    > it must
    > be upwards of 2.7 million rows, and hence, I am getting a memory fault
    > error.


    2.7e6/80000 = almost 34 equivalences per item on average. Is that right?

    >
    > The problem is, can I use a better (meaning smaller) data structure
    > to capture the IE data?


    Yes, but perhaps you should try to implement that structure in the
    database, rather than Perl's memory. Probably the best way would be to
    define a canonical identifier to each equivalence class. For example, just
    pick the alphabetically first member of each class to be the canonical id
    of that class.

    In Perl , you would have two top-level hashes. One would be a simple
    single-level hash which would map each item ID to the canonical ID for the
    class it belongs to, and would have 80,000 entries (the number of items).
    The other would have one entry for each canonical item, and the value would
    be a reference to either a hash or an array with all the other items in
    that class. For example, if you have A1, A2, A3 and B1, B2, B3, with the
    As being equivalent to each other and the Bs being equivalent to each
    other, you would have something like:

    my %equiv=(
    'A3' => 'A1',
    'B3' => 'B1',
    'B1' => 'B1',
    'A1' => 'A1',
    'B2' => 'B1',
    'A2' => 'A1',
    );
    my %class=(
    'B1' => {
    'B3' => undef,
    'B1' => undef,
    'B2' => undef,
    },
    'A1' => {
    'A3' => undef,
    'A1' => undef,
    'A2' => undef,
    },
    );


    > I could lookup each product's IE data as needed, but that would be
    > 80,000+ single queries to the database. I don't even want to try
    > that.


    If you are going to read 1.4 million rows anyway, I don't see why issuing
    80,000 queries would be out of the question.

    >
    > Also, my algorithm for matching products as IE pretty much puts
    > all combinations of (a, b, c, d) in the table for products a b c d.


    If you are motivated to save 50% by only storing one ordering of each pair,
    it seems silly to use many times as much space to store each pairing
    separately. You are straining at a gnat but swallowing a camel.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , May 10, 2007
    #2
    1. Advertising

  3. Dr.Ruud Guest

    schreef:

    > To build a single lookup data structure prior to writing to a lookup
    > file, I try to create hash %ie which for each row has:
    >
    > $ie{a}{b}=1;
    > $ie{b}{a}=1;


    No need to add both, no need to give it a real value. Just make sure
    that a < b, and let the value be undefined.


    > Then to get all the IE for a product a, one just looks at the slice
    > %ie{a} and gets all the values b1, b2, b3, ... for it.


    So you also need a hash with a > b. (I am assuming that a and b are
    never equal.)


    > The problem is, can I use a better (meaning smaller) data structure
    > to capture the IE data?


    Use a proper database, with both (a,b) and (b,a) as index, and guard at
    insert/update that a <= b.

    > The point is, a single query won't necessarily get all the IE data
    > for a single product. The whole data structure should be pre-loaded
    > all at once, as I'm trying to do.


    SELECT t1.a AS a, t1.b AS b, t2.b AS c FROM ab AS t1 JOIN ab AS t2 ON
    t1.b = t2.a WHERE ...

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, May 11, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Baby Lion
    Replies:
    30
    Views:
    1,276
    =?ISO-8859-1?Q?Arne_Vajh=F8j?=
    Oct 6, 2006
  2. Replies:
    2
    Views:
    586
  3. sdinu96

    Please solve this(Structure Program)

    sdinu96, May 20, 2010, in forum: C Programming
    Replies:
    1
    Views:
    286
    barati21
    May 22, 2010
  4. jaswinder
    Replies:
    15
    Views:
    3,986
    Barry Schwarz
    Aug 14, 2010
  5. A
    Replies:
    27
    Views:
    1,574
    Jorgen Grahn
    Apr 17, 2011
Loading...

Share This Page