Can this database update script be optimized?

C

Cheez

Hi there,

I have a script that inserts and updates a mySQL table using data from
a text file. I am checking for duplicate records by first doing a
SELECT. If the record exists and 1) increase a counter by one and 2)
concatenate text in a field with a string. If the record doesn't
exist, I INSERT the data normally.

It's slow. I am performing this update with about 50,000 records at a
time. I think there's an exponential increase in processing time as
the table grows in size. Initially I have about 40 of these files to
process to create a table with about 2 million records.

My questions are:

1. using Perl is this the most efficient approach?
2. would a native SQL query using the database GUI be faster?
3. I am really poor at choosing when and where for 'die' statements.
Any pointers of where I should put these would be great.

Thanks,
Cheez


code below:


print "\nINSERTER_SEQUENCE: inserting data into table\n";

use DBI;

my $dbh = DBI->connect("dbi:mysql:mydb", "", "");

open(FILE, "seqparsed.txt");

while (my $data = <FILE>) {

($wherefound,$sequence) = split /\t/, $data;

dupcheck (); #this sub does all the work

}

close (FILE);

sub dupcheck {

$sth = $dbh->prepare("SELECT sequence, wherefound FROM
sequence5 WHERE sequence=?");

$sth->execute($sequence);

#selecting any records that match $sequence

if (@row = $sth->fetchrow_array) {

#a match results in frequency++
#and update of where it was found

my ($seq, $wheref) = @row;

$wherefound = "$wherefound, $wheref";

$sth = $dbh->prepare("update sequence5 SET
frequency=frequency+1, wherefound=? WHERE sequence=?");

$sth->execute($wherefound,$sequence);

}

else {

# if no records match $sequence then INSERT new data

$sth = $dbh->prepare("INSERT INTO sequence5 (wherefound,
sequence) VALUES (?, ?)");

$sth->execute( $wherefound, $sequence);

}
}
 
G

gnari

Cheez said:
Hi there,


It's slow. I am performing this update with about 50,000 records at a
time. I think there's an exponential increase in processing time as
the table grows in size.

have you checked your indexes?

gnari
 
P

pkent

I have a script that inserts and updates a mySQL table using data from
a text file. I am checking for duplicate records by first doing a ....
2. would a native SQL query using the database GUI be faster?

Ghod no, with all that GUI stuff and dialogs and menus in the way :)
.... or do you mean the database's native client libraries? If you do,
well the perl DBI hooks directly into them (in all cases I know of) so
you get the bets speed.

3. I am really poor at choosing when and where for 'die' statements.
Any pointers of where I should put these would be great.

use the 'RaiseError' => 1 option in the DBI connect() call - there's
examples in the docs. And always check the return value of open().

And always use strict and warnings (or -w on older versions of perl).

sub dupcheck {

$sth = $dbh->prepare("SELECT sequence, wherefound FROM
sequence5 WHERE sequence=?");

This is a bad, or at least inefficient, thing to do, because you call
this sub once for every line - and you're prepare()ing the same thing
every time. You need to prepare() _once_ outside the loop and then use
the resulting statement handle over and over inside the loop.

$sth = $dbh->prepare("INSERT INTO sequence5 (wherefound,
sequence) VALUES (?, ?)");

ditto for this one - prepare once, use many times.

P
 
C

ctcgag

Hi there,

I have a script that inserts and updates a mySQL table using data from
a text file. I am checking for duplicate records by first doing a
SELECT. If the record exists and 1) increase a counter by one and 2)
concatenate text in a field with a string.

Are you sure you don't want master=>detail tables, rather than
concatting into a field?
If the record doesn't
exist, I INSERT the data normally.

It's slow.

How slow is it? How fast does it need to be?
I am performing this update with about 50,000 records at a
time.

You are performing the update one record at a time, 50,000 times.
I think there's an exponential increase in processing time as
the table grows in size.

I really doubt it. I bet it is, at worst, N^2. If you can demonstrate
that it is exponential, that would be quite interesting (and perhaps useful
for finding the problem).

Initially I have about 40 of these files to
process to create a table with about 2 million records.

40*50,000 = 2 million. So how many times does the "pre-existing record"
branch of the code acutally get executed?
My questions are:

1. using Perl is this the most efficient approach?

I would move all the prepares to outside the inner loops, although
for MySQL I doubt it matters. If the insert branch is frequent and the
update branch is rare, I'd do the insert pre-emptively, and then do the
update only if the insert throws you a primary key violation.

But most likely, it isn't a Perl problem but a mysql one. Is "sequence" a
primary key or otherwise indexed?
2. would a native SQL query using the database GUI be faster?

I have no idea what you are talking about. I doubt that this is a
question for a perl group.
3. I am really poor at choosing when and where for 'die' statements.
Any pointers of where I should put these would be great.

First, use strict. Pass $wherefound,$sequence into dupcheck() as
arguments. you need to die (or otherwise handle) a failure to open.

Use RaiseError=>1 in your db connection. If you don't do that, then
you should have "die" statements in damn near every line.

Xho
 
C

Cheez

Thanks again all for the replies. Very useful stuff. Since it did
end up taking a very long time...20,000 records/hour were updated
using this script! That's too slow. I am going to build a hash table
using 'sequence' as the key and then incrementing a 'value' by one for
duplicates. Maybe try with all 3 million records?

I just can't say enough about how useful these comments are.

Cheers,
Cheez
 
B

Bruce Horrocks

Cheez said:
2. would a native SQL query using the database GUI be faster?

See my response to your previous "SQL insert/update duplicate entries"
post.

Regards,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top