Efficient Data Storage

Discussion in 'Perl Misc' started by Aaron DeLoach, Sep 8, 2004.

  1. Hello,

    I have run into unfamiliar ground. Some guidance would be appreciated.

    This project has grown from 1,000 or so users to over 50,000 users. The
    project has been an overall success, so it's time to spend a little on the
    investment. Currently, we are getting our own servers (in lieu of ISP shared
    servers) setup with mod_perl and are revisiting a lot of the code to make
    things more efficient. Hopefully, in a month or so we can make the switch.

    At present, user records are stored each in a single file using the
    Data::Dumper module and the whole project works through the %user = eval
    <FILE> method. User files are stored in directories named after the first
    two characters of the user ID to keep the directories smaller, in theory,
    for quicker searching of files (?). The records are read/written throughout
    the use of the program in the method described.

    I don't know how much efficiency would be gained by using an alternate
    storage method. Perhaps MySQL? None of us are very familiar with databases,
    although it doesn't seem very hard. We are looking into storing the records
    as binary files which seems promising, but would like some input on the data
    storage/retrieval methods available before we do anything.

    I should mention that the project was first written in Perl and will remain
    that way. Some suggestions were to investigate a different language. But
    that's out of the question for now. We would rather increase efficiency in
    the Perl code. Servers will remain Linux/Apache.

    Any thoughts?
    Aaron DeLoach, Sep 8, 2004
    #1
    1. Advertising

  2. Aaron DeLoach <> wrote:

    > This project has grown from 1,000 or so users to over 50,000 users.


    > At present, user records are stored each in a single file using the
    > Data::Dumper module



    > I don't know how much efficiency would be gained by using an alternate
    > storage method. Perhaps MySQL?



    Some form of relational database would be an easy way to get
    performance gains over a roll-your-own flat file approach.

    I'd recommend postgreSQL over MySQL though.


    > We are looking into storing the records
    > as binary files which seems promising, but would like some input on the data
    > storage/retrieval methods available before we do anything.



    If you use an RDBMS you won't _need_ to do anything with regard
    to storage and retrieval as the DB will handle all of that for you.

    That wheel has been invented and heavily refined, just roll with it! :)


    > Any thoughts?



    Use an RDBMS.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Sep 8, 2004
    #2
    1. Advertising

  3. Aaron DeLoach

    Uri Guttman Guest

    >>>>> "AD" == Aaron DeLoach <> writes:

    AD> At present, user records are stored each in a single file using
    AD> the Data::Dumper module and the whole project works through the
    AD> %user = eval <FILE> method. User files are stored in directories
    AD> named after the first two characters of the user ID to keep the
    AD> directories smaller, in theory, for quicker searching of files
    AD> (?). The records are read/written throughout the use of the
    AD> program in the method described.

    as tad suggested a dbms would be a good idea if you want to migrate from
    a flat file. but just using File::Slurp will get you some immediate
    speedups over <FILE> with almost no code changes.

    also changing from data::dumper to Storable will also speed things up
    and also require minimal code changes. try those before you make the
    leap to a dbms.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
    Uri Guttman, Sep 8, 2004
    #3
  4. Aaron DeLoach wrote:

    > At present, user records are stored each in a single file using the
    > Data::Dumper module and the whole project works through the %user = eval
    > <FILE> method.


    This suggests a minor tweak that could result in big gains under
    mod_perl. Under traditional CGI, the file needs to be read and eval()'d
    for each hit on the CGI.

    Reducing the time it takes to read a user record is a good idea, but
    with mod_perl you can also reduce the number of times a record is read.
    You could take advantage of mod_perl's persistent environment here; keep
    a hash of user records, and use an "Orcish Maneuver" to read and eval a
    record only if the record you want is currently undef:

    $users{$this_user} |= get_user($this_user);

    The same sort of thing can be done for output templates, XSLT
    transformer objects, and more. It's a very common technique for writing
    mod_perl optimized code - Google for "Orcish Maneuver" for many examples.

    There are naturally trade-offs to consider too. For example, if the file
    has changed, the new data won't be read until the next time a new server
    instance spawns. If your traffic is very high, and your server instances
    have a lifetime measured in seconds, that may not be a problem. If not,
    you might need a more involved conditional that also checks for the age
    of the file, instead of the simplistic |= used above.

    sherm--

    --
    Cocoa programming in Perl: http://camelbones.sourceforge.net
    Hire me! My resume: http://www.dot-app.org
    Sherm Pendley, Sep 8, 2004
    #4
  5. At 2004-09-08 04:36PM, Sherm Pendley <> wrote:
    [...]
    > a hash of user records, and use an "Orcish Maneuver" to read and eval a
    > record only if the record you want is currently undef:
    >
    > $users{$this_user} |= get_user($this_user);


    you mean:
    $users{$this_user} ||= get_user($this_user);


    --
    Glenn Jackman
    NCF Sysadmin
    Glenn Jackman, Sep 9, 2004
    #5
  6. Glenn Jackman wrote:

    > you mean:
    > $users{$this_user} ||= get_user($this_user);


    Yes, of course. Dang fingers don't always type what they're told to
    type... :-(

    sherm--

    --
    Cocoa programming in Perl: http://camelbones.sourceforge.net
    Hire me! My resume: http://www.dot-app.org
    Sherm Pendley, Sep 9, 2004
    #6
  7. Aaron DeLoach

    Guest

    "Aaron DeLoach" <> wrote:
    > Hello,
    >
    > I have run into unfamiliar ground. Some guidance would be appreciated.
    >
    > This project has grown from 1,000 or so users to over 50,000 users. The
    > project has been an overall success, so it's time to spend a little on
    > the investment.


    It makes a huge difference whether those 50,000 users access one cgi page
    per week, on average, or one cgi page per minute.

    > Currently, we are getting our own servers (in lieu of ISP
    > shared servers) setup with mod_perl and are revisiting a lot of the code
    > to make things more efficient. Hopefully, in a month or so we can make
    > the switch.


    Do you have specific performance complaints? One should keep general
    efficiency in mind, but it is better focus on specific problems if they
    exist.

    > At present, user records are stored each in a single file using the
    > Data::Dumper module and the whole project works through the %user = eval
    > <FILE> method.


    How big are these files?

    > User files are stored in directories named after the first
    > two characters of the user ID to keep the directories smaller, in theory,
    > for quicker searching of files (?). The records are read/written
    > throughout the use of the program in the method described.


    Hopefully there are a few subroutines which are invoked throughout the
    program to cause the files to be read or written. If it is IO code, not
    subroutines, which are throughout the program, than any changes will be
    difficult. Then the first thing I would do is leave the actual physical
    storage the same, but consildate all the IO into a few subroutines, so that
    you can just swap out subroutines to test different methods.


    > I don't know how much efficiency would be gained by using an alternate
    > storage method. Perhaps MySQL?


    My gut feeling is that it would not lead to large performance
    improvements if all you do is use MySQL instead of the file system
    as a bit bucket to store your Data::Dumper strings. Especially if
    your server has a lot of memory and aggresively caches the FS.


    > None of us are very familiar with
    > databases, although it doesn't seem very hard. We are looking into
    > storing the records as binary files which seems promising, but would like
    > some input on the data storage/retrieval methods available before we do
    > anything.


    By binary files, do you mean using Storable rather than Data::Dumper?

    I would expect that to make more of a performance difference than the
    MySQL vs file system.

    > I should mention that the project was first written in Perl and will
    > remain that way. Some suggestions were to investigate a different
    > language. But that's out of the question for now. We would rather
    > increase efficiency in the Perl code. Servers will remain Linux/Apache.
    >
    > Any thoughts?


    I'd spend some time investigating where the time is going now.
    Make a script that does something like:

    use strict;
    use blahblah;
    ## all the other preliminaries that your real programs have to go through.

    exit if $ARGV[0] == 1;

    my $data = load_file_for_user($ARGV[1]);
    exit if $ARGV[0] == 2;

    my $user_ref = eval $data;
    exit if $ARGV[0] == 3;

    Do_whatever_your_most_common_task_is($user_ref);
    exit if $ARGV[0] ==4;
    ### etc.


    then write another program:

    my $start=time;
    foreach my $level (1..4) {
    foreach (1..1000) {
    my $u = randomly_chosen_user();
    system "./first_program.pl $level $u" and die $!;
    };
    print "Level $level, ", $start-time, "\n";
    };

    if level 1 time is almost as big as level 4 time, then the overhead
    of compilation and startup is your biggest problem. etc.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
    , Sep 9, 2004
    #7
  8. [...]

    >> This project has grown from 1,000 or so users to over 50,000 users. The
    >> project has been an overall success, so it's time to spend a little on
    >> the investment.

    >
    > It makes a huge difference whether those 50,000 users access one cgi page
    > per week, on average, or one cgi page per minute.


    On average, there are approximately 15,000 'hits' each day. These 'hits' are
    through 2 different cgi programs, using several different sub-programs each.

    >> Currently, we are getting our own servers (in lieu of ISP
    >> shared servers) setup with mod_perl and are revisiting a lot of the code
    >> to make things more efficient. Hopefully, in a month or so we can make
    >> the switch.

    >
    > Do you have specific performance complaints? One should keep general
    > efficiency in mind, but it is better focus on specific problems if they
    > exist.


    We've noticed some performance issues as the user base grows, and wish to
    regain some by moving from the ISP to our own server(s) where we can benefit
    from unavailable modules/configurations (mod_perl, etc.). Our ISP is working
    with us to smooth the transition and share knowledge. They're a local
    company, and we were one of their first customers.

    >> At present, user records are stored each in a single file using the
    >> Data::Dumper module and the whole project works through the %user = eval
    >> <FILE> method.

    >
    > How big are these files?


    Average 4kb. However, we plan to store some additional information in them
    to eliminate calls to slower methods/modules. This will make them average
    10kb or so.

    > Hopefully there are a few subroutines which are invoked throughout the
    > program to cause the files to be read or written. If it is IO code, not
    > subroutines, which are throughout the program, than any changes will be
    > difficult. Then the first thing I would do is leave the actual physical
    > storage the same, but consildate all the IO into a few subroutines, so
    > that
    > you can just swap out subroutines to test different methods.


    The entire record (file) is eval(ed) into a href in the beginning of the
    program(s). The user actions are performed upon the href until the end of
    the session. The href is then written back to the file (replacing the
    original contents). Writing is achieved through the Data:Dumper module. Uri
    has suggested using different methods which we're looking into.

    >> I don't know how much efficiency would be gained by using an alternate
    >> storage method. Perhaps MySQL?

    >
    > My gut feeling is that it would not lead to large performance
    > improvements if all you do is use MySQL instead of the file system
    > as a bit bucket to store your Data::Dumper strings. Especially if
    > your server has a lot of memory and aggresively caches the FS.


    Your gut feeling is correct according to our initial research into the db
    server/methods. Increasing the FS cache would compensate for any gains from
    using a db scenario. Memory is not going to be an issue. It seems that the
    mod_perl requirements are going to govern that.

    >> None of us are very familiar with
    >> databases, although it doesn't seem very hard. We are looking into
    >> storing the records as binary files which seems promising, but would like
    >> some input on the data storage/retrieval methods available before we do
    >> anything.

    >
    > By binary files, do you mean using Storable rather than Data::Dumper?


    We were educated on the Storable module by Uri. Until then, there was just a
    knowledge of the process.

    > I would expect that to make more of a performance difference than the
    > MySQL vs file system.


    I'm glad to hear that. It's the way we were hoping to go.

    > I'd spend some time investigating where the time is going now.
    > Make a script that does something like:
    >
    > use strict;
    > use blahblah;
    > ## all the other preliminaries that your real programs have to go through.
    >
    > exit if $ARGV[0] == 1;
    >
    > my $data = load_file_for_user($ARGV[1]);
    > exit if $ARGV[0] == 2;
    >
    > my $user_ref = eval $data;
    > exit if $ARGV[0] == 3;
    >
    > Do_whatever_your_most_common_task_is($user_ref);
    > exit if $ARGV[0] ==4;
    > ### etc.
    >
    >
    > then write another program:
    >
    > my $start=time;
    > foreach my $level (1..4) {
    > foreach (1..1000) {
    > my $u = randomly_chosen_user();
    > system "./first_program.pl $level $u" and die $!;
    > };
    > print "Level $level, ", $start-time, "\n";
    > };
    >
    > if level 1 time is almost as big as level 4 time, then the overhead
    > of compilation and startup is your biggest problem. etc.


    Hmmm... I will update you on the results...

    Thanks!
    Aaron DeLoach, Sep 10, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Karl
    Replies:
    1
    Views:
    466
    Craig Deelsnyder
    Sep 6, 2004
  2. sebsauvage

    Efficient configuration storage

    sebsauvage, Sep 22, 2004, in forum: Python
    Replies:
    4
    Views:
    517
    Larry Bates
    Sep 22, 2004
  3. sarathy
    Replies:
    2
    Views:
    649
    sarathy
    Jul 17, 2006
  4. Replies:
    11
    Views:
    540
  5. Randy Kramer
    Replies:
    1
    Views:
    91
    Robert Klemme
    Mar 2, 2005
Loading...

Share This Page