Can this database update script be optimized?

Cheez · Jan 22, 2004

Hi there,

I have a script that inserts and updates a mySQL table using data from
a text file. I am checking for duplicate records by first doing a
SELECT. If the record exists and 1) increase a counter by one and 2)
concatenate text in a field with a string. If the record doesn't
exist, I INSERT the data normally.

It's slow. I am performing this update with about 50,000 records at a
time. I think there's an exponential increase in processing time as
the table grows in size. Initially I have about 40 of these files to
process to create a table with about 2 million records.

My questions are:

1. using Perl is this the most efficient approach?
2. would a native SQL query using the database GUI be faster?
3. I am really poor at choosing when and where for 'die' statements.
Any pointers of where I should put these would be great.

Thanks,
Cheez

code below:

print "\nINSERTER_SEQUENCE: inserting data into table\n";

use DBI;

my $dbh = DBI->connect("dbi:mysql:mydb", "", "");

open(FILE, "seqparsed.txt");

while (my $data = <FILE>) {

($wherefound,$sequence) = split /\t/, $data;

dupcheck (); #this sub does all the work

}

close (FILE);

sub dupcheck {

$sth = $dbh->prepare("SELECT sequence, wherefound FROM
sequence5 WHERE sequence=?");

$sth->execute($sequence);

#selecting any records that match $sequence

if (@row = $sth->fetchrow_array) {

#a match results in frequency++
#and update of where it was found

my ($seq, $wheref) = @row;

$wherefound = "$wherefound, $wheref";

$sth = $dbh->prepare("update sequence5 SET
frequency=frequency+1, wherefound=? WHERE sequence=?");

$sth->execute($wherefound,$sequence);

}

else {

# if no records match $sequence then INSERT new data

$sth = $dbh->prepare("INSERT INTO sequence5 (wherefound,
sequence) VALUES (?, ?)");

$sth->execute( $wherefound, $sequence);

}
}

gnari · Jan 22, 2004

Cheez said:
Hi there,

It's slow. I am performing this update with about 50,000 records at a
time. I think there's an exponential increase in processing time as
the table grows in size.

have you checked your indexes?

gnari

pkent · Jan 22, 2004

I have a script that inserts and updates a mySQL table using data from
a text file. I am checking for duplicate records by first doing a ....
2. would a native SQL query using the database GUI be faster?

Ghod no, with all that GUI stuff and dialogs and menus in the way

.... or do you mean the database's native client libraries? If you do,
well the perl DBI hooks directly into them (in all cases I know of) so
you get the bets speed.

3. I am really poor at choosing when and where for 'die' statements.
Any pointers of where I should put these would be great.

use the 'RaiseError' => 1 option in the DBI connect() call - there's
examples in the docs. And always check the return value of open().

And always use strict and warnings (or -w on older versions of perl).

sub dupcheck {

$sth = $dbh->prepare("SELECT sequence, wherefound FROM
sequence5 WHERE sequence=?");

This is a bad, or at least inefficient, thing to do, because you call
this sub once for every line - and you're prepare()ing the same thing
every time. You need to prepare() _once_ outside the loop and then use
the resulting statement handle over and over inside the loop.

$sth = $dbh->prepare("INSERT INTO sequence5 (wherefound,
sequence) VALUES (?, ?)");

ditto for this one - prepare once, use many times.

P

ctcgag · Jan 22, 2004

Hi there,

I have a script that inserts and updates a mySQL table using data from
a text file. I am checking for duplicate records by first doing a
SELECT. If the record exists and 1) increase a counter by one and 2)
concatenate text in a field with a string.

Are you sure you don't want master=>detail tables, rather than
concatting into a field?

If the record doesn't
exist, I INSERT the data normally.

It's slow.

How slow is it? How fast does it need to be?

I am performing this update with about 50,000 records at a
time.

You are performing the update one record at a time, 50,000 times.

I think there's an exponential increase in processing time as
the table grows in size.

I really doubt it. I bet it is, at worst, N^2. If you can demonstrate
that it is exponential, that would be quite interesting (and perhaps useful
for finding the problem).

Initially I have about 40 of these files to
process to create a table with about 2 million records.

40*50,000 = 2 million. So how many times does the "pre-existing record"
branch of the code acutally get executed?

My questions are:

1. using Perl is this the most efficient approach?

I would move all the prepares to outside the inner loops, although
for MySQL I doubt it matters. If the insert branch is frequent and the
update branch is rare, I'd do the insert pre-emptively, and then do the
update only if the insert throws you a primary key violation.

But most likely, it isn't a Perl problem but a mysql one. Is "sequence" a
primary key or otherwise indexed?

2. would a native SQL query using the database GUI be faster?

I have no idea what you are talking about. I doubt that this is a
question for a perl group.

3. I am really poor at choosing when and where for 'die' statements.
Any pointers of where I should put these would be great.

First, use strict. Pass $wherefound,$sequence into dupcheck() as
arguments. you need to die (or otherwise handle) a failure to open.

Use RaiseError=>1 in your db connection. If you don't do that, then
you should have "die" statements in damn near every line.

Xho

Cheez · Jan 23, 2004

Thanks again all for the replies. Very useful stuff. Since it did
end up taking a very long time...20,000 records/hour were updated
using this script! That's too slow. I am going to build a hash table
using 'sequence' as the key and then incrementing a 'value' by one for
duplicates. Maybe try with all 3 million records?

I just can't say enough about how useful these comments are.

Cheers,
Cheez

Bruce Horrocks · Jan 23, 2004

Cheez said:
2. would a native SQL query using the database GUI be faster?

See my response to your previous "SQL insert/update duplicate entries"
post.

Regards,

how can i convert my perl script to be a perl module?	4	Feb 8, 2009
A question about DBI	13	Apr 20, 2014
UPDATE DB	8	Sep 5, 2007
SQL insert/update duplicate entries	10	Jan 21, 2004
Database Manager: A C++ Console Application	14	May 12, 2025
Returning a dataset with arrays	1	Mar 31, 2008
Need suggestions to improve code	2	Aug 24, 2013
Survey details won't go through using php, ajax, Mysql	3	Oct 25, 2023

Can this database update script be optimized?

Cheez

gnari

pkent

ctcgag

Cheez

Bruce Horrocks

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads