comparing lists

C

ccc31807

A normal task: sorting a large data file by some criterion, breaking
it into sub-files, and sending each sub-file to a particular client
based on the criterion.

During the next several weeks, I've been tasked with taking three data
files, comparing the keys of each file, and if the keys are identical,
processing the file but if not, printing out a list of differences,
which in effect means printing out the different keys. The keys are
all seven digit integers. (Each file is to be generated by a different
query of the same database.)

Okay, I could use diff for this, but I'd like to do it
programmatically. Using brute force, I could generate three files with
just the keys and compare them line by line, but I'd like not to do
this for several reason but mostly because the data files are pretty
much guaranteed to be identical and we don't expect there to be any
differences.

I'm thinking about hashing the keys in the three files and comparing
the key digests, with the assumption that identical hashes means
identical files.

Ideas?

Thanks, CC.
 
J

Jürgen Exner

ccc31807 said:
During the next several weeks, I've been tasked with taking three data
files, comparing the keys of each file, and if the keys are identical,
processing the file but if not, printing out a list of differences,
which in effect means printing out the different keys. The keys are
all seven digit integers. (Each file is to be generated by a different
query of the same database.) [...]
I'm thinking about hashing the keys in the three files and comparing
the key digests, with the assumption that identical hashes means
identical files.

Seems to be rather simple and straight-forward. Read the keys from each
file into a hash (be careful to treat them as strings, such that you
don't run into potential int overflow problems), then compare the hashes
as described in "perldoc -q intersection".

jue
 
X

Xho Jingleheimerschmidt

ccc31807 said:
During the next several weeks, I've been tasked with taking three data
files, comparing the keys of each file, and if the keys are identical,
processing the file but if not, printing out a list of differences,

With what information? Just the name of the key that fails to appear in
all files, or do you have to identify which one or two out of three it
appears in?
which in effect means printing out the different keys. The keys are
all seven digit integers. (Each file is to be generated by a different
query of the same database.)

Since you already got it in a database, how about something like:

select key, count(1) from (union of all three queries) group by key
having count(1) != 3;
Okay, I could use diff for this, but I'd like to do it
programmatically. Using brute force, I could generate three files with
just the keys and compare them line by line, but I'd like not to do
this for several reason but mostly because the data files are pretty
much guaranteed to be identical and we don't expect there to be any
differences.

That reason doesn't make much sense. The fact that the files are pretty
much guaranteed to be identical can be used to argue against *any*
proposed method, not just the line-by-line method.
I'm thinking about hashing the keys in the three files and comparing
the key digests, with the assumption that identical hashes means
identical files.

I don't know of any hashing functions that have both a very low chance
of collision, and are indifferent to the order in which the strings are
added into it. And if you have to sort the keys so they are in the same
order, then you might as well do the line by line thing.

Xho
 
P

Peter J. Holzer

[comparing three files]
Okay, I could use diff for this, but I'd like to do it
programmatically.

diff isn't a program?
Using brute force, I could generate three files with
just the keys and compare them line by line, but I'd like not to do
this for several reason but mostly because the data files are pretty
much guaranteed to be identical and we don't expect there to be any
differences.

If the files are "pretty much guaranteed to be identical" you could just
compute a hash for each file and compare the hashes. If they are the
same, you are done. Only if they aren't (which is "pretty much
guaranteed" not to happen) do you need to worry about finding the
differences.

hp
 
C

ccc31807

diff isn't a program?

I process the (main) file with a Perl script, and I don't want to do
in two steps what I can do in one, that is, including a function in
the existing script to compare the three files.
If the files are "pretty much guaranteed to be identical" you could just
compute a hash for each file and compare the hashes. If they are the
same, you are done. Only if they aren't (which is "pretty much
guaranteed" not to happen) do you need to worry about finding the
differences.

As it turns out, with a couple of days experience and several
attempts, I would up creating three hashes, one for each file, with
the IDs as keys and the name of the file as the values. I iterate
through the 'main' hash, and if the hash element exists in all three
hashes I delete it. I then print the hashes. It's kinda' crude, but it
was easy to do, doesn't take long, and gives me what I need.

Thanks, CC.
 
C

ccc31807

With what information?  Just the name of the key that fails to appear in
all files, or do you have to identify which one or two out of three it
appears in?

Just the key.
Since you already got it in a database, how about something like:

Unfortunately, this is a non-SQL, non-relational, non-first-normal-
form flat file database (IBM's UniData) over a WAN connection, and
it's a lot more practical to glob the data and process it locally.

That reason doesn't make much sense.  The fact that the files are pretty
much guaranteed to be identical can be used to argue against *any*
proposed method, not just the line-by-line method.

See my reply to PJH. The 'official' query is highly impractical for my
unit, and we have written two other queries to replace it. We just
want to make sure that the data derived from all three queries is the
same before we make any changes.
I don't know of any hashing functions that have both a very low chance
of collision, and are indifferent to the order in which the strings are
added into it.  And if you have to sort the keys so they are in the same
order, then you might as well do the line by line thing.

Obviously, the keys would have to be in order. As it turns out, the
size of the files is much less than I anticipated, so O(n) works just
fine.

CC.
 
P

Peter J. Holzer

As it turns out, with a couple of days experience and several
attempts, I would up creating three hashes,

I just realized that my use of the word "hash" was ambiguous: I meant
result of a strong hash-function such as SHA-1, not a Perl hash.

hp
 
C

ccc31807

I just realized that my use of the word "hash" was ambiguous: I meant
result of a strong hash-function such as SHA-1, not a Perl hash.

That's okay. I figured out what you meant.

CC.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top