comparing large number of data

Perl Lover · Jan 7, 2007

I am writing a generic tool in perl DBD to compare data between
two tables and sync them up. The tables are guranteed to have
a primary key or unique index. One table is designated as the
base table and the other table as the target table where changes
from the base table will be applied.

The logic I am following is to select all rows from the base table
and pick its primary key, row by row. Then do a select on the
target table based on the primary key and compare the result
of the two rows (one from base table and the other from the target
table).

Can I make it faster by any other method.

Thrill5 · Jan 7, 2007

Perl Lover said:
I am writing a generic tool in perl DBD to compare data between
two tables and sync them up. The tables are guranteed to have
a primary key or unique index. One table is designated as the
base table and the other table as the target table where changes
from the base table will be applied.

The logic I am following is to select all rows from the base table
and pick its primary key, row by row. Then do a select on the
target table based on the primary key and compare the result
of the two rows (one from base table and the other from the target
table).

Can I make it faster by any other method.

Most of the heavy lifting could be done using SQL commands

#Returns a list of primary_keys from basetable that are not in target
SELECT basetable.primary_key EXCEPT SELECT target.primary_key

#Returns all the records in basetable that are not identical to the records
in target
SELECT basetable.* WHERE NOT EXISTS (SELECT target.* WHERE
(basetable.field1 = target.field1 AND
basetable.field2 = target.field2 AND
...)

You could further optimize to INSERT the new records or perform the UPDATE
instead of just finding them.

Scott

Brian McCauley · Jan 7, 2007

#Returns a list of primary_keys from basetable that are not in target
SELECT basetable.primary_key EXCEPT SELECT target.primary_key

#Returns all the records in basetable that are not identical to the records
in target
SELECT basetable.* WHERE NOT EXISTS (SELECT target.* WHERE
(basetable.field1 = target.field1 AND
basetable.field2 = target.field2 AND
...)

That's quite a challenge for the the SQL optomiser. I prefer to do it
with outer joins...

#Returns a list of primary_keys from basetable that are not in target
SELECT basetable.primary_key
FROM basetable
LEFT OUTER JOIN target
ON basetable.primary_key = target.primary_key
WHERE target.primary_key IS NULL

# Returns all the records in basetable that are not identical
# to the records in target

SELECT basetable.*
FROM basetable
LEFT OUTER JOIN target
ON basetable.field1 = target.field1 AND
basetable.field2 = target.field2 AND
...
WHERE target.primary_key IS NULL

This, of course, has nothing to do with Perl.

Perl Lover · Jan 7, 2007

#Returns a list of primary_keys from basetable that are not in target

That's quite a challenge for the the SQL optomiser. I prefer to do it
with outer joins...

#Returns a list of primary_keys from basetable that are not in target
SELECT basetable.primary_key
FROM basetable
LEFT OUTER JOIN target
ON basetable.primary_key = target.primary_key
WHERE target.primary_key IS NULL

# Returns all the records in basetable that are not identical
# to the records in target

SELECT basetable.*
FROM basetable
LEFT OUTER JOIN target
ON basetable.field1 = target.field1 AND
basetable.field2 = target.field2 AND
...
WHERE target.primary_key IS NULL

This, of course, has nothing to do with Perl.

this whole exercise started in Perl because doing it in SQL
just killed the RDBMS. Imagine doing it on a table with
30 million rows.

Mark Clements · Jan 7, 2007

Perl said:
this whole exercise started in Perl because doing it in SQL
just killed the RDBMS. Imagine doing it on a table with
30 million rows.

If you get the indexes right, then it shouldn't be a problem. Use your
RDBMS's analysis tools (EXPLAIN PLAN or whatever) to check that your
queries are, in fact, using the indexes that have been set.

Mark

Thrill5 · Jan 7, 2007

Mark Clements said:
If you get the indexes right, then it shouldn't be a problem. Use your
RDBMS's analysis tools (EXPLAIN PLAN or whatever) to check that your
queries are, in fact, using the indexes that have been set.

Mark

You could also add another field to the tables that has the date/time of the
last change, and use this one field to test for changes. You could update
this field in the basetable database using an ON UPDATE trigger. The
comparing each field in each row is probably what is killing the RDBMS. The
only thing you offload by doing this in another language is the comparison
operation, which means what you need to optimize is how to determine when a
record has been updated. Doing the compare on another system and another
language is not going to be much faster than doing it in SQL, and is
probably going to be slower.

Scott

DJ Stunks · Jan 7, 2007

Mark said:
If you get the indexes right, then it shouldn't be a problem. Use your
RDBMS's analysis tools (EXPLAIN PLAN or whatever) to check that your
queries are, in fact, using the indexes that have been set.

exactly. or do portions at a time - id's between 0 - 399_999, then
400_000 to 799_999 etc or something similar.

there's absolutely no way Perl can out perform a database at this type
of task. I love Perl, but use the right tool for the job.

-jp

Sort by number of characters	1	Nov 2, 2023
Need total amount displayed of data-price attribute from each table	2	Jul 3, 2022
How to add dropdown selected data to table using jquery	2	Jul 2, 2022
Checking dynamically populated data using ajax with user entered value	5	Apr 11, 2020
Filter table rows based on multiple checkboxes value	2	Jan 13, 2023
Survey details won't go through using php, ajax, Mysql	0	Oct 26, 2023
Grouping messages that have similarities	2	Jun 28, 2023
Collect Excel Data from Website	5	Apr 30, 2022

comparing large number of data

Perl Lover

Thrill5

Brian McCauley

Perl Lover

Mark Clements

Thrill5

DJ Stunks

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads