comparing lists

ccc31807 · Feb 10, 2010

A normal task: sorting a large data file by some criterion, breaking
it into sub-files, and sending each sub-file to a particular client
based on the criterion.

During the next several weeks, I've been tasked with taking three data
files, comparing the keys of each file, and if the keys are identical,
processing the file but if not, printing out a list of differences,
which in effect means printing out the different keys. The keys are
all seven digit integers. (Each file is to be generated by a different
query of the same database.)

Okay, I could use diff for this, but I'd like to do it
programmatically. Using brute force, I could generate three files with
just the keys and compare them line by line, but I'd like not to do
this for several reason but mostly because the data files are pretty
much guaranteed to be identical and we don't expect there to be any
differences.

I'm thinking about hashing the keys in the three files and comparing
the key digests, with the assumption that identical hashes means
identical files.

Ideas?

Thanks, CC.

Jürgen Exner · Feb 10, 2010

ccc31807 said:
During the next several weeks, I've been tasked with taking three data
files, comparing the keys of each file, and if the keys are identical,
processing the file but if not, printing out a list of differences,
which in effect means printing out the different keys. The keys are
all seven digit integers. (Each file is to be generated by a different
query of the same database.) [...]
I'm thinking about hashing the keys in the three files and comparing
the key digests, with the assumption that identical hashes means
identical files.

Seems to be rather simple and straight-forward. Read the keys from each
file into a hash (be careful to treat them as strings, such that you
don't run into potential int overflow problems), then compare the hashes
as described in "perldoc -q intersection".

jue

Xho Jingleheimerschmidt · Feb 11, 2010

ccc31807 said:
During the next several weeks, I've been tasked with taking three data
files, comparing the keys of each file, and if the keys are identical,
processing the file but if not, printing out a list of differences,

With what information? Just the name of the key that fails to appear in
all files, or do you have to identify which one or two out of three it
appears in?

which in effect means printing out the different keys. The keys are
all seven digit integers. (Each file is to be generated by a different
query of the same database.)

Since you already got it in a database, how about something like:

select key, count(1) from (union of all three queries) group by key
having count(1) != 3;

Okay, I could use diff for this, but I'd like to do it
programmatically. Using brute force, I could generate three files with
just the keys and compare them line by line, but I'd like not to do
this for several reason but mostly because the data files are pretty
much guaranteed to be identical and we don't expect there to be any
differences.

That reason doesn't make much sense. The fact that the files are pretty
much guaranteed to be identical can be used to argue against *any*
proposed method, not just the line-by-line method.

I'm thinking about hashing the keys in the three files and comparing
the key digests, with the assumption that identical hashes means
identical files.

I don't know of any hashing functions that have both a very low chance
of collision, and are indifferent to the order in which the strings are
added into it. And if you have to sort the keys so they are in the same
order, then you might as well do the line by line thing.

Xho

Peter J. Holzer · Feb 11, 2010

[comparing three files]

Okay, I could use diff for this, but I'd like to do it
programmatically.

diff isn't a program?

Using brute force, I could generate three files with
just the keys and compare them line by line, but I'd like not to do
this for several reason but mostly because the data files are pretty
much guaranteed to be identical and we don't expect there to be any
differences.

If the files are "pretty much guaranteed to be identical" you could just
compute a hash for each file and compare the hashes. If they are the
same, you are done. Only if they aren't (which is "pretty much
guaranteed" not to happen) do you need to worry about finding the
differences.

hp

ccc31807 · Feb 11, 2010

diff isn't a program?

I process the (main) file with a Perl script, and I don't want to do
in two steps what I can do in one, that is, including a function in
the existing script to compare the three files.

If the files are "pretty much guaranteed to be identical" you could just
compute a hash for each file and compare the hashes. If they are the
same, you are done. Only if they aren't (which is "pretty much
guaranteed" not to happen) do you need to worry about finding the
differences.

As it turns out, with a couple of days experience and several
attempts, I would up creating three hashes, one for each file, with
the IDs as keys and the name of the file as the values. I iterate
through the 'main' hash, and if the hash element exists in all three
hashes I delete it. I then print the hashes. It's kinda' crude, but it
was easy to do, doesn't take long, and gives me what I need.

Thanks, CC.

ccc31807 · Feb 11, 2010

With what information? Just the name of the key that fails to appear in
all files, or do you have to identify which one or two out of three it
appears in?

Just the key.

Since you already got it in a database, how about something like:

Unfortunately, this is a non-SQL, non-relational, non-first-normal-
form flat file database (IBM's UniData) over a WAN connection, and
it's a lot more practical to glob the data and process it locally.

That reason doesn't make much sense. The fact that the files are pretty
much guaranteed to be identical can be used to argue against *any*
proposed method, not just the line-by-line method.

See my reply to PJH. The 'official' query is highly impractical for my
unit, and we have written two other queries to replace it. We just
want to make sure that the data derived from all three queries is the
same before we make any changes.

I don't know of any hashing functions that have both a very low chance
of collision, and are indifferent to the order in which the strings are
added into it. And if you have to sort the keys so they are in the same
order, then you might as well do the line by line thing.

Obviously, the keys would have to be in order. As it turns out, the
size of the files is much less than I anticipated, so O(n) works just
fine.

CC.

Peter J. Holzer · Feb 11, 2010

As it turns out, with a couple of days experience and several
attempts, I would up creating three hashes,

I just realized that my use of the word "hash" was ambiguous: I meant
result of a strong hash-function such as SHA-1, not a Perl hash.

hp

ccc31807 · Feb 11, 2010

I just realized that my use of the word "hash" was ambiguous: I meant
result of a strong hash-function such as SHA-1, not a Perl hash.

That's okay. I figured out what you meant.

CC.

Question about my projects	3	Jul 23, 2021
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Picture Comparison Code Not Working Properly	1	Jul 24, 2021
How to speed up this slow part of my program	14	Mar 28, 2012
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
Diff files with regex comparing	2	May 13, 2007
comparing two lists, ndiff performance	3	Jan 30, 2008
comparing a 2 hash of hashes	9	May 25, 2005

comparing lists

ccc31807

Jürgen Exner

Xho Jingleheimerschmidt

Peter J. Holzer

ccc31807

ccc31807

Peter J. Holzer

ccc31807

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads