Can this database update script be optimized?

Discussion in 'Perl Misc' started by Cheez, Jan 22, 2004.

  1. Cheez

    Cheez Guest

    Hi there,

    I have a script that inserts and updates a mySQL table using data from
    a text file. I am checking for duplicate records by first doing a
    SELECT. If the record exists and 1) increase a counter by one and 2)
    concatenate text in a field with a string. If the record doesn't
    exist, I INSERT the data normally.

    It's slow. I am performing this update with about 50,000 records at a
    time. I think there's an exponential increase in processing time as
    the table grows in size. Initially I have about 40 of these files to
    process to create a table with about 2 million records.

    My questions are:

    1. using Perl is this the most efficient approach?
    2. would a native SQL query using the database GUI be faster?
    3. I am really poor at choosing when and where for 'die' statements.
    Any pointers of where I should put these would be great.

    Thanks,
    Cheez


    code below:


    print "\nINSERTER_SEQUENCE: inserting data into table\n";

    use DBI;

    my $dbh = DBI->connect("dbi:mysql:mydb", "", "");

    open(FILE, "seqparsed.txt");

    while (my $data = <FILE>) {

    ($wherefound,$sequence) = split /\t/, $data;

    dupcheck (); #this sub does all the work

    }

    close (FILE);

    sub dupcheck {

    $sth = $dbh->prepare("SELECT sequence, wherefound FROM
    sequence5 WHERE sequence=?");

    $sth->execute($sequence);

    #selecting any records that match $sequence

    if (@row = $sth->fetchrow_array) {

    #a match results in frequency++
    #and update of where it was found

    my ($seq, $wheref) = @row;

    $wherefound = "$wherefound, $wheref";

    $sth = $dbh->prepare("update sequence5 SET
    frequency=frequency+1, wherefound=? WHERE sequence=?");

    $sth->execute($wherefound,$sequence);

    }

    else {

    # if no records match $sequence then INSERT new data

    $sth = $dbh->prepare("INSERT INTO sequence5 (wherefound,
    sequence) VALUES (?, ?)");

    $sth->execute( $wherefound, $sequence);

    }
    }
    Cheez, Jan 22, 2004
    #1
    1. Advertising

  2. Cheez

    gnari Guest

    "Cheez" <> wrote in message
    news:...
    > Hi there,
    >
    >
    > It's slow. I am performing this update with about 50,000 records at a
    > time. I think there's an exponential increase in processing time as
    > the table grows in size.


    have you checked your indexes?

    gnari
    gnari, Jan 22, 2004
    #2
    1. Advertising

  3. Cheez

    pkent Guest

    In article <>,
    (Cheez) wrote:

    > I have a script that inserts and updates a mySQL table using data from
    > a text file. I am checking for duplicate records by first doing a

    ....
    > 2. would a native SQL query using the database GUI be faster?


    Ghod no, with all that GUI stuff and dialogs and menus in the way :)
    .... or do you mean the database's native client libraries? If you do,
    well the perl DBI hooks directly into them (in all cases I know of) so
    you get the bets speed.


    > 3. I am really poor at choosing when and where for 'die' statements.
    > Any pointers of where I should put these would be great.


    use the 'RaiseError' => 1 option in the DBI connect() call - there's
    examples in the docs. And always check the return value of open().

    And always use strict and warnings (or -w on older versions of perl).


    > sub dupcheck {
    >
    > $sth = $dbh->prepare("SELECT sequence, wherefound FROM
    > sequence5 WHERE sequence=?");


    This is a bad, or at least inefficient, thing to do, because you call
    this sub once for every line - and you're prepare()ing the same thing
    every time. You need to prepare() _once_ outside the loop and then use
    the resulting statement handle over and over inside the loop.


    > $sth = $dbh->prepare("INSERT INTO sequence5 (wherefound,
    > sequence) VALUES (?, ?)");


    ditto for this one - prepare once, use many times.

    P

    --
    pkent 77 at yahoo dot, er... what's the last bit, oh yes, com
    Remove the tea to reply
    pkent, Jan 22, 2004
    #3
  4. Cheez

    Guest

    (Cheez) wrote:
    > Hi there,
    >
    > I have a script that inserts and updates a mySQL table using data from
    > a text file. I am checking for duplicate records by first doing a
    > SELECT. If the record exists and 1) increase a counter by one and 2)
    > concatenate text in a field with a string.


    Are you sure you don't want master=>detail tables, rather than
    concatting into a field?

    > If the record doesn't
    > exist, I INSERT the data normally.
    >
    > It's slow.


    How slow is it? How fast does it need to be?

    > I am performing this update with about 50,000 records at a
    > time.


    You are performing the update one record at a time, 50,000 times.

    > I think there's an exponential increase in processing time as
    > the table grows in size.


    I really doubt it. I bet it is, at worst, N^2. If you can demonstrate
    that it is exponential, that would be quite interesting (and perhaps useful
    for finding the problem).


    > Initially I have about 40 of these files to
    > process to create a table with about 2 million records.


    40*50,000 = 2 million. So how many times does the "pre-existing record"
    branch of the code acutally get executed?

    > My questions are:
    >
    > 1. using Perl is this the most efficient approach?


    I would move all the prepares to outside the inner loops, although
    for MySQL I doubt it matters. If the insert branch is frequent and the
    update branch is rare, I'd do the insert pre-emptively, and then do the
    update only if the insert throws you a primary key violation.

    But most likely, it isn't a Perl problem but a mysql one. Is "sequence" a
    primary key or otherwise indexed?

    > 2. would a native SQL query using the database GUI be faster?


    I have no idea what you are talking about. I doubt that this is a
    question for a perl group.

    > 3. I am really poor at choosing when and where for 'die' statements.
    > Any pointers of where I should put these would be great.


    First, use strict. Pass $wherefound,$sequence into dupcheck() as
    arguments. you need to die (or otherwise handle) a failure to open.

    Use RaiseError=>1 in your db connection. If you don't do that, then
    you should have "die" statements in damn near every line.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service New Rate! $9.95/Month 50GB
    , Jan 22, 2004
    #4
  5. Cheez

    Cheez Guest

    Thanks again all for the replies. Very useful stuff. Since it did
    end up taking a very long time...20,000 records/hour were updated
    using this script! That's too slow. I am going to build a hash table
    using 'sequence' as the key and then incrementing a 'value' by one for
    duplicates. Maybe try with all 3 million records?

    I just can't say enough about how useful these comments are.

    Cheers,
    Cheez
    Cheez, Jan 23, 2004
    #5
  6. In message <>, Cheez
    <> writes
    >2. would a native SQL query using the database GUI be faster?


    See my response to your previous "SQL insert/update duplicate entries"
    post.

    Regards,
    --
    Bruce Horrocks
    Surrey
    England
    <firstname>@<surname>.plus.com -- fix the obvious for email
    Bruce Horrocks, Jan 23, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Collin VanDyck
    Replies:
    3
    Views:
    384
    Collin VanDyck
    Oct 27, 2003
  2. Sonia
    Replies:
    3
    Views:
    434
    Sonia
    Sep 1, 2003
  3. Hedr
    Replies:
    1
    Views:
    556
    Raymond Hettinger
    Jun 25, 2003
  4. Replies:
    4
    Views:
    360
    Christian Bau
    Feb 11, 2006
  5. Replies:
    12
    Views:
    548
    ma740988
    Nov 10, 2006
Loading...

Share This Page