comparing large number of data

P

Perl Lover

I am writing a generic tool in perl DBD to compare data between
two tables and sync them up. The tables are guranteed to have
a primary key or unique index. One table is designated as the
base table and the other table as the target table where changes
from the base table will be applied.

The logic I am following is to select all rows from the base table
and pick its primary key, row by row. Then do a select on the
target table based on the primary key and compare the result
of the two rows (one from base table and the other from the target
table).

Can I make it faster by any other method.
 
T

Thrill5

Perl Lover said:
I am writing a generic tool in perl DBD to compare data between
two tables and sync them up. The tables are guranteed to have
a primary key or unique index. One table is designated as the
base table and the other table as the target table where changes
from the base table will be applied.

The logic I am following is to select all rows from the base table
and pick its primary key, row by row. Then do a select on the
target table based on the primary key and compare the result
of the two rows (one from base table and the other from the target
table).

Can I make it faster by any other method.

Most of the heavy lifting could be done using SQL commands

#Returns a list of primary_keys from basetable that are not in target
SELECT basetable.primary_key EXCEPT SELECT target.primary_key

#Returns all the records in basetable that are not identical to the records
in target
SELECT basetable.* WHERE NOT EXISTS (SELECT target.* WHERE
(basetable.field1 = target.field1 AND
basetable.field2 = target.field2 AND
...)

You could further optimize to INSERT the new records or perform the UPDATE
instead of just finding them.

Scott
 
B

Brian McCauley

#Returns a list of primary_keys from basetable that are not in target
SELECT basetable.primary_key EXCEPT SELECT target.primary_key

#Returns all the records in basetable that are not identical to the records
in target
SELECT basetable.* WHERE NOT EXISTS (SELECT target.* WHERE
(basetable.field1 = target.field1 AND
basetable.field2 = target.field2 AND
...)

That's quite a challenge for the the SQL optomiser. I prefer to do it
with outer joins...

#Returns a list of primary_keys from basetable that are not in target
SELECT basetable.primary_key
FROM basetable
LEFT OUTER JOIN target
ON basetable.primary_key = target.primary_key
WHERE target.primary_key IS NULL

# Returns all the records in basetable that are not identical
# to the records in target

SELECT basetable.*
FROM basetable
LEFT OUTER JOIN target
ON basetable.field1 = target.field1 AND
basetable.field2 = target.field2 AND
...
WHERE target.primary_key IS NULL

This, of course, has nothing to do with Perl.
 
P

Perl Lover

#Returns a list of primary_keys from basetable that are not in target
That's quite a challenge for the the SQL optomiser. I prefer to do it
with outer joins...

#Returns a list of primary_keys from basetable that are not in target
SELECT basetable.primary_key
FROM basetable
LEFT OUTER JOIN target
ON basetable.primary_key = target.primary_key
WHERE target.primary_key IS NULL

# Returns all the records in basetable that are not identical
# to the records in target

SELECT basetable.*
FROM basetable
LEFT OUTER JOIN target
ON basetable.field1 = target.field1 AND
basetable.field2 = target.field2 AND
...
WHERE target.primary_key IS NULL

This, of course, has nothing to do with Perl.

this whole exercise started in Perl because doing it in SQL
just killed the RDBMS. Imagine doing it on a table with
30 million rows.
 
M

Mark Clements

Perl said:
this whole exercise started in Perl because doing it in SQL
just killed the RDBMS. Imagine doing it on a table with
30 million rows.

If you get the indexes right, then it shouldn't be a problem. Use your
RDBMS's analysis tools (EXPLAIN PLAN or whatever) to check that your
queries are, in fact, using the indexes that have been set.

Mark
 
T

Thrill5

Mark Clements said:
If you get the indexes right, then it shouldn't be a problem. Use your
RDBMS's analysis tools (EXPLAIN PLAN or whatever) to check that your
queries are, in fact, using the indexes that have been set.

Mark

You could also add another field to the tables that has the date/time of the
last change, and use this one field to test for changes. You could update
this field in the basetable database using an ON UPDATE trigger. The
comparing each field in each row is probably what is killing the RDBMS. The
only thing you offload by doing this in another language is the comparison
operation, which means what you need to optimize is how to determine when a
record has been updated. Doing the compare on another system and another
language is not going to be much faster than doing it in SQL, and is
probably going to be slower.

Scott
 
D

DJ Stunks

Mark said:
If you get the indexes right, then it shouldn't be a problem. Use your
RDBMS's analysis tools (EXPLAIN PLAN or whatever) to check that your
queries are, in fact, using the indexes that have been set.

exactly. or do portions at a time - id's between 0 - 399_999, then
400_000 to 799_999 etc or something similar.

there's absolutely no way Perl can out perform a database at this type
of task. I love Perl, but use the right tool for the job.

-jp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,201
Latest member
KourtneyBe

Latest Threads

Top