Efficient Data Storage

Aaron DeLoach · Sep 8, 2004

Hello,

I have run into unfamiliar ground. Some guidance would be appreciated.

This project has grown from 1,000 or so users to over 50,000 users. The
project has been an overall success, so it's time to spend a little on the
investment. Currently, we are getting our own servers (in lieu of ISP shared
servers) setup with mod_perl and are revisiting a lot of the code to make
things more efficient. Hopefully, in a month or so we can make the switch.

At present, user records are stored each in a single file using the
Data:

umper module and the whole project works through the %user = eval
<FILE> method. User files are stored in directories named after the first
two characters of the user ID to keep the directories smaller, in theory,
for quicker searching of files (?). The records are read/written throughout
the use of the program in the method described.

I don't know how much efficiency would be gained by using an alternate
storage method. Perhaps MySQL? None of us are very familiar with databases,
although it doesn't seem very hard. We are looking into storing the records
as binary files which seems promising, but would like some input on the data
storage/retrieval methods available before we do anything.

I should mention that the project was first written in Perl and will remain
that way. Some suggestions were to investigate a different language. But
that's out of the question for now. We would rather increase efficiency in
the Perl code. Servers will remain Linux/Apache.

Any thoughts?

Tad McClellan · Sep 8, 2004

Aaron DeLoach said:
This project has grown from 1,000 or so users to over 50,000 users.

At present, user records are stored each in a single file using the
Data:umper module

I don't know how much efficiency would be gained by using an alternate
storage method. Perhaps MySQL?

Some form of relational database would be an easy way to get
performance gains over a roll-your-own flat file approach.

I'd recommend postgreSQL over MySQL though.

We are looking into storing the records
as binary files which seems promising, but would like some input on the data
storage/retrieval methods available before we do anything.

If you use an RDBMS you won't _need_ to do anything with regard
to storage and retrieval as the DB will handle all of that for you.

That wheel has been invented and heavily refined, just roll with it!

Any thoughts?

Use an RDBMS.

Uri Guttman · Sep 8, 2004

AD> At present, user records are stored each in a single file using
AD> the Data:

umper module and the whole project works through the
AD> %user = eval <FILE> method. User files are stored in directories
AD> named after the first two characters of the user ID to keep the
AD> directories smaller, in theory, for quicker searching of files
AD> (?). The records are read/written throughout the use of the
AD> program in the method described.

as tad suggested a dbms would be a good idea if you want to migrate from
a flat file. but just using File::Slurp will get you some immediate
speedups over <FILE> with almost no code changes.

also changing from data::dumper to Storable will also speed things up
and also require minimal code changes. try those before you make the
leap to a dbms.

uri

Sherm Pendley · Sep 8, 2004

Aaron said:
At present, user records are stored each in a single file using the
Data:umper module and the whole project works through the %user = eval
<FILE> method.

This suggests a minor tweak that could result in big gains under
mod_perl. Under traditional CGI, the file needs to be read and eval()'d
for each hit on the CGI.

Reducing the time it takes to read a user record is a good idea, but
with mod_perl you can also reduce the number of times a record is read.
You could take advantage of mod_perl's persistent environment here; keep
a hash of user records, and use an "Orcish Maneuver" to read and eval a
record only if the record you want is currently undef:

$users{$this_user} |= get_user($this_user);

The same sort of thing can be done for output templates, XSLT
transformer objects, and more. It's a very common technique for writing
mod_perl optimized code - Google for "Orcish Maneuver" for many examples.

There are naturally trade-offs to consider too. For example, if the file
has changed, the new data won't be read until the next time a new server
instance spawns. If your traffic is very high, and your server instances
have a lifetime measured in seconds, that may not be a problem. If not,
you might need a more involved conditional that also checks for the age
of the file, instead of the simplistic |= used above.

sherm--

Glenn Jackman · Sep 9, 2004

At 2004-09-08 04:36PM said:
a hash of user records, and use an "Orcish Maneuver" to read and eval a
record only if the record you want is currently undef:

$users{$this_user} |= get_user($this_user);

you mean:
$users{$this_user} ||= get_user($this_user);

Sherm Pendley · Sep 9, 2004

Glenn said:
you mean:
$users{$this_user} ||= get_user($this_user);

Yes, of course. Dang fingers don't always type what they're told to
type... :-(

sherm--

ctcgag · Sep 9, 2004

Aaron DeLoach said:
Hello,

I have run into unfamiliar ground. Some guidance would be appreciated.

This project has grown from 1,000 or so users to over 50,000 users. The
project has been an overall success, so it's time to spend a little on
the investment.

It makes a huge difference whether those 50,000 users access one cgi page
per week, on average, or one cgi page per minute.

Currently, we are getting our own servers (in lieu of ISP
shared servers) setup with mod_perl and are revisiting a lot of the code
to make things more efficient. Hopefully, in a month or so we can make
the switch.

Do you have specific performance complaints? One should keep general
efficiency in mind, but it is better focus on specific problems if they
exist.

At present, user records are stored each in a single file using the
Data:umper module and the whole project works through the %user = eval
<FILE> method.

How big are these files?

User files are stored in directories named after the first
two characters of the user ID to keep the directories smaller, in theory,
for quicker searching of files (?). The records are read/written
throughout the use of the program in the method described.

Hopefully there are a few subroutines which are invoked throughout the
program to cause the files to be read or written. If it is IO code, not
subroutines, which are throughout the program, than any changes will be
difficult. Then the first thing I would do is leave the actual physical
storage the same, but consildate all the IO into a few subroutines, so that
you can just swap out subroutines to test different methods.

I don't know how much efficiency would be gained by using an alternate
storage method. Perhaps MySQL?

My gut feeling is that it would not lead to large performance
improvements if all you do is use MySQL instead of the file system
as a bit bucket to store your Data:

umper strings. Especially if
your server has a lot of memory and aggresively caches the FS.

None of us are very familiar with
databases, although it doesn't seem very hard. We are looking into
storing the records as binary files which seems promising, but would like
some input on the data storage/retrieval methods available before we do
anything.

By binary files, do you mean using Storable rather than Data:

umper?

I would expect that to make more of a performance difference than the
MySQL vs file system.

I should mention that the project was first written in Perl and will
remain that way. Some suggestions were to investigate a different
language. But that's out of the question for now. We would rather
increase efficiency in the Perl code. Servers will remain Linux/Apache.

Any thoughts?

I'd spend some time investigating where the time is going now.
Make a script that does something like:

use strict;
use blahblah;
## all the other preliminaries that your real programs have to go through.

exit if $ARGV[0] == 1;

my $data = load_file_for_user($ARGV[1]);
exit if $ARGV[0] == 2;

my $user_ref = eval $data;
exit if $ARGV[0] == 3;

Do_whatever_your_most_common_task_is($user_ref);
exit if $ARGV[0] ==4;
### etc.

then write another program:

my $start=time;
foreach my $level (1..4) {
foreach (1..1000) {
my $u = randomly_chosen_user();
system "./first_program.pl $level $u" and die $!;
};
print "Level $level, ", $start-time, "\n";
};

if level 1 time is almost as big as level 4 time, then the overhead
of compilation and startup is your biggest problem. etc.

Xho

Aaron DeLoach · Sep 10, 2004

[...]

It makes a huge difference whether those 50,000 users access one cgi page
per week, on average, or one cgi page per minute.

On average, there are approximately 15,000 'hits' each day. These 'hits' are
through 2 different cgi programs, using several different sub-programs each.

Do you have specific performance complaints? One should keep general
efficiency in mind, but it is better focus on specific problems if they
exist.

We've noticed some performance issues as the user base grows, and wish to
regain some by moving from the ISP to our own server(s) where we can benefit
from unavailable modules/configurations (mod_perl, etc.). Our ISP is working
with us to smooth the transition and share knowledge. They're a local
company, and we were one of their first customers.

How big are these files?

Average 4kb. However, we plan to store some additional information in them
to eliminate calls to slower methods/modules. This will make them average
10kb or so.

Hopefully there are a few subroutines which are invoked throughout the
program to cause the files to be read or written. If it is IO code, not
subroutines, which are throughout the program, than any changes will be
difficult. Then the first thing I would do is leave the actual physical
storage the same, but consildate all the IO into a few subroutines, so
that
you can just swap out subroutines to test different methods.

The entire record (file) is eval(ed) into a href in the beginning of the
program(s). The user actions are performed upon the href until the end of
the session. The href is then written back to the file (replacing the
original contents). Writing is achieved through the Data

umper module. Uri
has suggested using different methods which we're looking into.

My gut feeling is that it would not lead to large performance
improvements if all you do is use MySQL instead of the file system
as a bit bucket to store your Data:umper strings. Especially if
your server has a lot of memory and aggresively caches the FS.

Your gut feeling is correct according to our initial research into the db
server/methods. Increasing the FS cache would compensate for any gains from
using a db scenario. Memory is not going to be an issue. It seems that the
mod_perl requirements are going to govern that.

By binary files, do you mean using Storable rather than Data:umper?

We were educated on the Storable module by Uri. Until then, there was just a
knowledge of the process.

I would expect that to make more of a performance difference than the
MySQL vs file system.

I'm glad to hear that. It's the way we were hoping to go.

I'd spend some time investigating where the time is going now.
Make a script that does something like:

use strict;
use blahblah;
## all the other preliminaries that your real programs have to go through.

exit if $ARGV[0] == 1;

my $data = load_file_for_user($ARGV[1]);
exit if $ARGV[0] == 2;

my $user_ref = eval $data;
exit if $ARGV[0] == 3;

Do_whatever_your_most_common_task_is($user_ref);
exit if $ARGV[0] ==4;
### etc.

then write another program:

my $start=time;
foreach my $level (1..4) {
foreach (1..1000) {
my $u = randomly_chosen_user();
system "./first_program.pl $level $u" and die $!;
};
print "Level $level, ", $start-time, "\n";
};

if level 1 time is almost as big as level 4 time, then the overhead
of compilation and startup is your biggest problem. etc.

Hmmm... I will update you on the results...

Thanks!

Memory efficient tuple storage	11	Mar 13, 2009
Issue with passing fetched data to POST form. How can I?	0	Jul 23, 2023
Retrieving data from software GUI	0	Aug 7, 2022
How to host data visualization beginner friendly?	1	Aug 10, 2023
Sending data from web page to Raspberry Pi	0	Nov 26, 2022
New To Javascript - Accessing Data	3	Nov 26, 2023
key/value store optimized for disk storage	23	May 3, 2012
Data section exceeds available space in board	3	Sep 14, 2021

Efficient Data Storage

Aaron DeLoach

Tad McClellan

Uri Guttman

Sherm Pendley

Glenn Jackman

Sherm Pendley

ctcgag

Aaron DeLoach

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads