explaining how memory works with tie()ed hashs

Discussion in 'Perl Misc' started by botfood, Sep 20, 2006.

  1. botfood

    botfood Guest

    I would like to know more about how perl uses memory when managing a
    hash that is tie()ed to a file on disk. Using DB_File and variable
    length records....

    I have an application where the DB file gotten quite big, not giant,
    but around 20k records and is a file about 11MB in size. My
    understanding of DB_FIle is that while there is no programtic limit on
    the number of records, there may be memory-driven limits depending what
    one does with the hash. Looping thru the keys is still fast and so the
    number of records doesnt seem to be a problem.

    In this particular case, each record is not that big, except for one
    specific type of 'book-keeping' record that is used to keep track of
    what records are considered 'complete' by this particular application.
    With a couple reports, I need to whip thru all complete records
    searching for various things.... And in another spot I know that the
    code looks for matches within this big record that contains around 20k
    'words' consisting of about 20 digits.

    what I am wondering is whether it is likely that the simple number of
    records eats up large amounts of memory just by being tie()ed, or if it
    is more likely that this one particular internal index record is
    causing me problems when it gets pulled into memory to do things like
    m// or s// on its contents to find or edit a 'word' which is simply a
    list of the keys having a specific status.

    The next part of the question is.... if it sounds like a large internal
    indexing record is likely to be a problem, what would some recommended
    techniques be to break that out? should I create a separate DB file to
    use as an index? I am really wondering how best to 'fake' large
    database capabilities to manage keeping track of status without eating
    tons of memory.

    TIA,
    d
     
    botfood, Sep 20, 2006
    #1
    1. Advertising

  2. botfood

    -berlin.de Guest

    botfood <> wrote in comp.lang.perl.misc:
    > I would like to know more about how perl uses memory when managing a
    > hash that is tie()ed to a file on disk. Using DB_File and variable
    > length records....


    DB_File is Berkeley DB, so that would primarily be a question about
    storage management in Berkeley DB.

    > I have an application where the DB file gotten quite big, not giant,
    > but around 20k records and is a file about 11MB in size. My


    That's a bit more than 500 bytes per record. I'm not a bit surprised.

    [...]

    > In this particular case, each record is not that big, except for one
    > specific type of 'book-keeping' record that is used to keep track of
    > what records are considered 'complete' by this particular application.
    > With a couple reports, I need to whip thru all complete records
    > searching for various things.... And in another spot I know that the
    > code looks for matches within this big record that contains around 20k
    > 'words' consisting of about 20 digits.


    For an experiment, take the extra long record(s) out of the DB and store
    it/them otherwise. See if it makes a difference. I wouldn't expect so,
    but who knows.

    > what I am wondering is whether it is likely that the simple number of
    > records eats up large amounts of memory just by being tie()ed, or if it


    Tie has nothing to do with disk storage management. That's entirely
    the DB's business.

    > is more likely that this one particular internal index record is
    > causing me problems


    See above.

    >when it gets pulled into memory to do things like
    > m// or s// on its contents to find or edit a 'word' which is simply a
    > list of the keys having a specific status.


    What has "pulling into memory" to do with disk space consumption?

    > The next part of the question is.... if it sounds like a large internal
    > indexing record is likely to be a problem, what would some recommended
    > techniques be to break that out? should I create a separate DB file to
    > use as an index? I am really wondering how best to 'fake' large
    > database capabilities to manage keeping track of status without eating
    > tons of memory.


    Databases are not primarily optimized to be as small as possible but
    to be fast and flexible. Also, they often grow in relatively large
    steps. It could well be that you could increase the number of records
    for a long while in your 11MB until it grows again. I'd do that too:
    Add random records and watch how the DB grows while you do. That
    will give you a better idea of the overhead.

    Anno
     
    -berlin.de, Sep 21, 2006
    #2
    1. Advertising

  3. botfood

    J. Gleixner Guest

    botfood wrote:
    [...]
    > In this particular case, each record is not that big, except for one
    > specific type of 'book-keeping' record that is used to keep track of
    > what records are considered 'complete' by this particular application.
    > With a couple reports, I need to whip thru all complete records
    > searching for various things.... And in another spot I know that the
    > code looks for matches within this big record that contains around 20k
    > 'words' consisting of about 20 digits.


    [...]

    > The next part of the question is.... if it sounds like a large internal
    > indexing record is likely to be a problem, what would some recommended
    > techniques be to break that out? should I create a separate DB file to
    > use as an index? I am really wondering how best to 'fake' large
    > database capabilities to manage keeping track of status without eating
    > tons of memory.



    If I understand your issue correctly, possibly using another key
    for these completed entries, or a separate DBM, would be better.

    For example, if the key were by a user name, then add an additional
    key of ${username}_completed. That could be stored in the same DBM or
    you could create another "completed" DBM containing the user name as
    its key, the 20-digits, or the "various things" you're looking for,
    could be stored where ever it makes sense.

    This way you'd know what user names were completed and could
    easily access the data for the user name in the other table.
    I'd think that'd be much more efficient when compared to a key of
    'completed' containing all of the user names.

    Try to design it as if it were simply a hash, which is all
    it is. Using the DBM will really optimize memory and
    disk space, however you're responsible for design the
    keys and records to work well as a hash.
     
    J. Gleixner, Sep 21, 2006
    #3
  4. botfood

    botfood Guest

    botfood wrote:
    > I would like to know more about how perl uses memory when managing a
    > hash that is tie()ed to a file on disk. Using DB_File and variable
    > length records....

    ....snip
    > In this particular case, each record is not that big, except for one
    > specific type of 'book-keeping' record that is used to keep track of
    > what records are considered 'complete' by this particular application.
    > With a couple reports, I need to whip thru all complete records
    > searching for various things.... And in another spot I know that the
    > code looks for matches within this big record that contains around 20k
    > 'words' consisting of about 20 digits.
    > ------------------------------


    thanks for comments so far people.... its sounding like my suspicion
    that the memory problems I am having are more likely to be from using a
    'record' inside the DB to manage a large list of values, which I need
    to sift, sort, and edit... is more likely to be the problem than the
    memory allocated to manage the access and paging of the DB itself.

    Allow me to clarify in that the nature of the failure SEEMED to be more
    like a limit placed by the Apache Web Server running the process,
    rather than a hard limit on memory available on the machine.

    while I think I am going to try some re-design to extract the hash of
    'complete' records to a separate DB, I am trying to get a handle on a
    short-term fix by increasing the Apache::SizeLimit parameter to accept
    the memory use required for the current design.

    so.... the question changes to:

    how can I estimate the memory required by perl for a m// or s//
    operation on a string that is about 20k 'words' consisting of 20 digits
    each separated by a single space.

    thanks,
    d
     
    botfood, Sep 21, 2006
    #4
  5. botfood

    J. Gleixner Guest

    botfood wrote:

    > so.... the question changes to:
    >
    > how can I estimate the memory required by perl for a m// or s//
    > operation on a string that is about 20k 'words' consisting of 20 digits
    > each separated by a single space.


    Why estimate it? Simply run it from the command line, or some other
    method, maybe adding a fairly long sleep, after the point you want to
    measure, and watch the memory usage using top, ps, etc.
     
    J. Gleixner, Sep 21, 2006
    #5
  6. botfood

    Guest

    "botfood" <> wrote:
    >
    > so.... the question changes to:
    >
    > how can I estimate the memory required by perl for a m// or s//
    > operation on a string that is about 20k 'words' consisting of 20 digits
    > each separated by a single space.


    I just system out to "ps" (on linux).

    It seems to be trivial, as long the regex doesn't degenerate badly.


    $ perl -le 'my $x = join " ", map rand, 1..20_000; \
    $_=~s/1[23]45/foobar/g; system "ps -p $$ -o rss ";'
    RSS #(this is the size in kilobytes)
    4340


    $ perl -le 'my $x = join " ", map rand, 1..20_000; \
    $x=~s/1[23]45/foobar/g; system "ps -p $$ -o rss ";'
    RSS
    4476

    So it takes about 136K more to do the substitution than it does just to
    start perl and build the string (plus do a dummy substituion on an empty
    varaible)

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Sep 21, 2006
    #6
  7. botfood

    botfood Guest

    J. Gleixner wrote:
    > botfood wrote:
    >
    > > so.... the question changes to:
    > >
    > > how can I estimate the memory required by perl for a m// or s//
    > > operation on a string that is about 20k 'words' consisting of 20 digits
    > > each separated by a single space.

    >
    > Why estimate it? Simply run it from the command line, or some other
    > method, maybe adding a fairly long sleep, after the point you want to
    > measure, and watch the memory usage using top, ps, etc.

    ----------------
    the machine running the script is a webserver on a remote host... they
    dont give access to any place I can watch memory.
     
    botfood, Sep 21, 2006
    #7
  8. botfood

    botfood Guest

    wrote:

    > I just system out to "ps" (on linux).
    >
    > It seems to be trivial, as long the regex doesn't degenerate badly.
    >
    >
    > $ perl -le 'my $x = join " ", map rand, 1..20_000; \
    > $_=~s/1[23]45/foobar/g; system "ps -p $$ -o rss ";'
    > RSS #(this is the size in kilobytes)
    > 4340
    >
    >
    > $ perl -le 'my $x = join " ", map rand, 1..20_000; \
    > $x=~s/1[23]45/foobar/g; system "ps -p $$ -o rss ";'
    > RSS
    > 4476
    >
    > So it takes about 136K more to do the substitution than it does just to
    > start perl and build the string (plus do a dummy substituion on an empty
    > varaible)
    > -----------------------------------------


    not sure exactly what you did with these little tests to build the test
    string, but think the estimate is moving in the right direction. The
    exact size value I think might be low since each of my 20k 'words' is
    20 characters long rather than a random number.

    unfortunately I do not have access to the LINUX machine since it is a
    remote web server.....

    d
     
    botfood, Sep 21, 2006
    #8
  9. botfood

    botfood Guest

    Jim Gibson wrote:
    > > ----------------
    > > the machine running the script is a webserver on a remote host... they
    > > dont give access to any place I can watch memory.

    >
    > That makes it tough. However, you can do yourself a favor by setting up
    > a local Perl installation and running your tests on it. My guess is
    > that the regular expression engine doesn't vary very much from platform
    > to platform.
    > --------------------------------


    however, Berkley DB and memory management is probably different between
    win32 and Linux, so it wouldnt give me any more than a rough ballpark.
    My test server at home is not Apache, I use Xitami, so I cant really
    emulate the SizeLimit stuff that was a problem on the web server......
    kinda shooting in the dark.

    best guess at this point is that s// on a string that is 20k 'words' of
    20 characters, *probably* eats up more memory that the host server
    wants to allocate for any single process.

    d
     
    botfood, Sep 22, 2006
    #9
  10. botfood

    Guest

    "botfood" <> wrote:
    > Jim Gibson wrote:
    > > > ----------------
    > > > the machine running the script is a webserver on a remote host...
    > > > they dont give access to any place I can watch memory.

    > >
    > > That makes it tough. However, you can do yourself a favor by setting up
    > > a local Perl installation and running your tests on it. My guess is
    > > that the regular expression engine doesn't vary very much from platform
    > > to platform.
    > > --------------------------------

    >
    > however, Berkley DB and memory management is probably different between
    > win32 and Linux, so it wouldnt give me any more than a rough ballpark.


    Often a rough ballpark is enough.


    > My test server at home is not Apache, I use Xitami, so I cant really
    > emulate the SizeLimit stuff that was a problem on the web server......
    > kinda shooting in the dark.


    You probably don't need to emulate SizeLimit. You just need to know what
    it is.

    > best guess at this point is that s// on a string that is 20k 'words' of
    > 20 characters, *probably* eats up more memory that the host server
    > wants to allocate for any single process.


    I doubt it. Or at least, if this is the case, then you are living so close
    to the edge that any random thing is doing to push you over it, anyway.

    It is easy enough to write a 5 line CGI which constructs a string of 20k
    words and try a realistic s// on it, dump it on your providers server, and
    see if it runs afoul of SizeLimit or not. Then try it again with 40k, 60k,
    etc just to see how much breathing room you have.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Sep 22, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Arthur T. Murray

    Explaining how a (Mind) program works

    Arthur T. Murray, Sep 26, 2003, in forum: Java
    Replies:
    1
    Views:
    380
    Kent Paul Dolan
    Sep 27, 2003
  2. Arthur T. Murray
    Replies:
    2
    Views:
    400
    E. Robert Tisdale
    Sep 26, 2003
  3. Arthur T. Murray

    Explaining how a (Mind) program works

    Arthur T. Murray, Sep 26, 2003, in forum: Python
    Replies:
    1
    Views:
    346
    Stephen Horne
    Sep 27, 2003
  4. Arthur T. Murray

    Explaining how a (Mind) program works

    Arthur T. Murray, Sep 26, 2003, in forum: Javascript
    Replies:
    0
    Views:
    123
    Arthur T. Murray
    Sep 26, 2003
  5. botfood

    tie() with DB_File not tie()ing ?

    botfood, Apr 24, 2006, in forum: Perl Misc
    Replies:
    23
    Views:
    458
    botfood
    Apr 26, 2006
Loading...

Share This Page