concurrency with DBI questions

Andy · Feb 28, 2006

Hi,

I'm trying to look into the best way of speeding up multiple insert or
update statements into a database table. The values to insert or update
are in a file which could contain upto 100,000 records.

My plan was to have multiple children to do the database calls and feed
them via a process just reading the data from a file. I believe using
DBI with threads is a bad idea and to be honest my current view of perl
and threads is that it is not stable enough to be used in production
systems, but i'm happy to be told i'm wrong.

So the only other method I can think of is to fork multiple child
processes's. Each child will need to create its own database connection
and then the main process will need to feed in the data to do the
database calls

So questions:

1. Is fork my only option ? or am I overlooking a much better solution

2. How do I produce a "queue" for the forked processes to work from ?,
the only think I can think of is using some sort of PIPE, actually was
looking at "Safe Pipes" documented here:
http://www.fnal.gov/docs/products/perl/pod.new/5.00503/pod/perlipc.html#Safe_Pipe_Opens

any other ideas

3. What happens if the child process can't get a database connection
for some reason (database refuses connection) ? how can it tell the
master not to send requests as they cannot be processed, if there was a
single queue to work from this would be fine (as the child would'nt
pick any requests up) but if using a queue per child as I would be
using "Safe Pipes" then I think i'm stuck

4. Do you think I even need to to try and make this job concurrent or
will a single thread be quick enough ? anyone have any stats on how
many insert / update statements can be run through DBI per second ?

thanks in advance for any help / suggestions

Cheers

Andy

DJ Stunks · Feb 28, 2006

Andy said:
Hi,

I'm trying to look into the best way of speeding up multiple insert or
update statements into a database table. The values to insert or update
are in a file which could contain upto 100,000 records.

I can only speak from a MySQL perspective, but the fastest way to load
records (and by a significant margin) is via LOAD DATA INFILE. Since
your records are already in a file, my suggestion would be to use Perl
to massage your input file into a delimited record format suitable for
insertion via LOAD DATA INFILE and then doing a single $dbh->do{}
command to load them.

$0.02

-jp

xhoster · Feb 28, 2006

Andy said:
Hi,

I'm trying to look into the best way of speeding up multiple insert or
update statements into a database table. The values to insert or update
are in a file which could contain upto 100,000 records.

Generally, the fastest way to bulk load data is to the use the database's
bulk loading tool, like sqlldr for Oracle or MySQL's LOAD DATA INFILE.
I tend to want to do more transformation/error-checking/etc. than is
conventient with those tools, but still they are pretty good at what they
specialize in.

My plan was to have multiple children to do the database calls and feed
them via a process just reading the data from a file. I believe using
DBI with threads is a bad idea

I wouldn't try using DBI with threads in the first place since fork is so
easy, but other than that I see no reason not to do so, as long as you do
it right (i.e. each thread opening its own independent DB connection.)

So the only other method I can think of is to fork multiple child
processes's. Each child will need to create its own database connection
and then the main process will need to feed in the data to do the
database calls

Why have a main process at all? Have all forked processes be peers.

So questions:

1. Is fork my only option ? or am I overlooking a much better solution

No, and maybe. There are other options, and using the specialized bulk
loader that came with your database is likely to be much better.

2. How do I produce a "queue" for the forked processes to work from ?,
the only think I can think of is using some sort of PIPE,

warn "transaction logic left to the reader";
my $parallel=4;
my $pm = new Parallel::ForkManager($parallel);

foreach my $id (0..$parallel-1) {
$pm->start and next; # do the fork
my $dbh=Get_connect();
my $sth = $dbh->prepare('whatever');
open my $fh, "input.txt" or die $!;
while (<$fh>) { chomp;
next unless ($. % $parallel == $id);
do_whatever($sth,$_);
};
$sth->finish;
$dbh->disconnect;
$pm->finish; # do the exit in the child process
}
$pm->wait_all_children;

Altenratively, split the file into chunks beforehand (using, say, the
unix "split" util) and give each chunk to a child)

any other ideas

3. What happens if the child process can't get a database connection
for some reason (database refuses connection) ? how can it tell the
master not to send requests as they cannot be processed, if there was a
single queue to work from this would be fine (as the child would'nt
pick any requests up) but if using a queue per child as I would be
using "Safe Pipes" then I think i'm stuck

No true master, no problem. If the child dies, it logs this fact. You
look into the log, and just restart just that one child.

4. Do you think I even need to to try and make this job concurrent or
will a single thread be quick enough ?

How could anyone possibly know this but you?

anyone have any stats on how
many insert / update statements can be run through DBI per second ?

It depends on the size of the data to be updated, the contention on the
database, the size of the database server, the speed of the storage system
of the serve, the structure (indexes and constraints on the tables), the
commit frequency, etc, etc, etc.

Due to contention on the DB end, splitting the data into multiple streams
may not speed up much at all, and in fact may slow things down. You will
have to try it and see.

Xho

xhoster · Feb 28, 2006

Jim Gibson said:
I do not have much experience with this problem myself, but I do
remember the general advice given for large database inserts to a table
that has one or more indexes (indices?) defined is to delete the index,
do the inserts, and then recreate the index.

This is almost always a bad idea, unless the table is a new one being
populated for the first time. It isn't that bad an idea if the table
stays about the same size because the mass upload is accompanied by
a mass deletes of almost all of the old data, but even then there are
better ways to accomplish this. For example, it would generally be better
to copy the data to keep into a new table, load the bulk data to that new
table, then build the indexes and rename the new table to replace the old
one.

I had one person who was oh-so-clever and "optimized" a loading program
by dropping all the indices before a bulk load and remaking them after.
This reduced the overall time to load 300,000 data points into the table
by 20 or 40%, like from 5 minutes to 3 or 4 minutes or something, back when
the table only had 500,000 rows to start with. Fast forward 3 years and
the table has 100,000,000 rows. Just rebuilding the indices afterwards
took over 6 hours--far longer than it took to do the load with the indices
still there. And of course during that time the database would be
completely unusable.

This is faster than
revising the index after each insert. However, it is possible that a
bulk LOAD operation will do the same thing automatically.

Bulk load generally takes a middle road. It updates the indices in batches
in a way which is faster than doing it one row at a time but doesn't throw
the vast amounts of work it already did to index the pre-existing data.

Xho

Dr.Ruud · Feb 28, 2006

Andy schreef:

I'm trying to look into the best way of speeding up multiple insert or
update statements into a database table. The values to insert or
update are in a file which could contain upto 100,000 records.

You need to tell more. For example: Are indexes and constraints
involved?

You talk about inserts and updates on a single table. Isn't the data
normalized?

Often is is best to bulk upload the new data in a single non-normalized
temporary table, and create separate queries for the updates and the
inserts. First make all foreign-key-related tables (if any) up-to-date
with the new data in the temporary table.

My plan was to have multiple children to do the database calls and
feed them via a process just reading the data from a file.

No need for that. Just bulk load the data into the temporary table, and
leave the rest to the DBMS.

You can use Perl to prepare the file for the import functionality of the
DBMS, and maybe to create a text-file with all the update and insert
queries.

Andy · Mar 1, 2006

Hi,

Thanks all for the suggestions. this is a oracle database and i've used
sqlldr before uing a temp table, infact with 9i I can even use the
fancy MERGE command which takes care of insert or update.However the
reason I wasnt going this route is that even though I may need to
update / insert 100,000 rows it may only be a few thousand (file
delivered every hour) I was going for worst case situation, most likely
it will be aound 10 - 30 thousand records.

Also the table will have about 15 million records so it is a fairly
small insert / update in relation to the entire table and I thorght
havigna dedicated temp table etc etc may be overkill

There will be around 3 indexes on the table and no constraints, also I
will rebuild the indexes weekly in a nightly script, I see no need to
rebuild indexes any more than that.

Cheers

Andy

Dr.Ruud · Mar 1, 2006

Andy schreef:

Also the table will have about 15 million records so it is a fairly
small insert / update in relation to the entire table and I thorght
havigna dedicated temp table etc etc may be overkill

Best put that temp-table in its own separate database then, and use
(maybe Oracle-specific) SQL to append and update (and maybe delete?)
data from the current dataset.

For each INSERT query, create a sub-query (or view) that only returns
new data.

For each UPDATE query, create a sub-query (or view) that only returns
changed data.

It is even better to first remove from the temp-table all rows with
unchanged data.

In those ways, you optimally use the functionality of your DBMS.

DBI and fork	4	May 15, 2009
a problem with concurrency	2	Jun 5, 2009
Sharing a DBI::Mysql database connection with your children	7	Mar 31, 2008
Fetching SQL boolean values with Perl DBI?	1	Mar 20, 2011
Problem installing DBI	2	Jul 26, 2007
DBI problems	4	Feb 11, 2008
exec and named pipe questions	11	Sep 7, 2013
new dbi connection for each request under mod_perl	4	Jan 18, 2010

concurrency with DBI questions

Andy

DJ Stunks

xhoster

xhoster

Dr.Ruud

Andy

Dr.Ruud

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads