comparing lists

Discussion in 'Perl Misc' started by ccc31807, Feb 10, 2010.

  1. ccc31807

    ccc31807 Guest

    A normal task: sorting a large data file by some criterion, breaking
    it into sub-files, and sending each sub-file to a particular client
    based on the criterion.

    During the next several weeks, I've been tasked with taking three data
    files, comparing the keys of each file, and if the keys are identical,
    processing the file but if not, printing out a list of differences,
    which in effect means printing out the different keys. The keys are
    all seven digit integers. (Each file is to be generated by a different
    query of the same database.)

    Okay, I could use diff for this, but I'd like to do it
    programmatically. Using brute force, I could generate three files with
    just the keys and compare them line by line, but I'd like not to do
    this for several reason but mostly because the data files are pretty
    much guaranteed to be identical and we don't expect there to be any
    differences.

    I'm thinking about hashing the keys in the three files and comparing
    the key digests, with the assumption that identical hashes means
    identical files.

    Ideas?

    Thanks, CC.
     
    ccc31807, Feb 10, 2010
    #1
    1. Advertising

  2. ccc31807 <> wrote:
    >During the next several weeks, I've been tasked with taking three data
    >files, comparing the keys of each file, and if the keys are identical,
    >processing the file but if not, printing out a list of differences,
    >which in effect means printing out the different keys. The keys are
    >all seven digit integers. (Each file is to be generated by a different
    >query of the same database.)

    [...]
    >I'm thinking about hashing the keys in the three files and comparing
    >the key digests, with the assumption that identical hashes means
    >identical files.


    Seems to be rather simple and straight-forward. Read the keys from each
    file into a hash (be careful to treat them as strings, such that you
    don't run into potential int overflow problems), then compare the hashes
    as described in "perldoc -q intersection".

    jue
     
    Jürgen Exner, Feb 10, 2010
    #2
    1. Advertising

  3. ccc31807 wrote:
    >
    > During the next several weeks, I've been tasked with taking three data
    > files, comparing the keys of each file, and if the keys are identical,
    > processing the file but if not, printing out a list of differences,


    With what information? Just the name of the key that fails to appear in
    all files, or do you have to identify which one or two out of three it
    appears in?

    > which in effect means printing out the different keys. The keys are
    > all seven digit integers. (Each file is to be generated by a different
    > query of the same database.)


    Since you already got it in a database, how about something like:

    select key, count(1) from (union of all three queries) group by key
    having count(1) != 3;

    > Okay, I could use diff for this, but I'd like to do it
    > programmatically. Using brute force, I could generate three files with
    > just the keys and compare them line by line, but I'd like not to do
    > this for several reason but mostly because the data files are pretty
    > much guaranteed to be identical and we don't expect there to be any
    > differences.


    That reason doesn't make much sense. The fact that the files are pretty
    much guaranteed to be identical can be used to argue against *any*
    proposed method, not just the line-by-line method.

    > I'm thinking about hashing the keys in the three files and comparing
    > the key digests, with the assumption that identical hashes means
    > identical files.


    I don't know of any hashing functions that have both a very low chance
    of collision, and are indifferent to the order in which the strings are
    added into it. And if you have to sort the keys so they are in the same
    order, then you might as well do the line by line thing.

    Xho
     
    Xho Jingleheimerschmidt, Feb 11, 2010
    #3
  4. On 2010-02-10 14:36, ccc31807 <> wrote:

    [comparing three files]

    > Okay, I could use diff for this, but I'd like to do it
    > programmatically.


    diff isn't a program?

    > Using brute force, I could generate three files with
    > just the keys and compare them line by line, but I'd like not to do
    > this for several reason but mostly because the data files are pretty
    > much guaranteed to be identical and we don't expect there to be any
    > differences.


    If the files are "pretty much guaranteed to be identical" you could just
    compute a hash for each file and compare the hashes. If they are the
    same, you are done. Only if they aren't (which is "pretty much
    guaranteed" not to happen) do you need to worry about finding the
    differences.

    hp
     
    Peter J. Holzer, Feb 11, 2010
    #4
  5. ccc31807

    ccc31807 Guest

    On Feb 11, 7:20 am, "Peter J. Holzer" <> wrote:
    > > Okay, I could use diff for this, but I'd like to do it
    > > programmatically.

    >
    > diff isn't a program?


    I process the (main) file with a Perl script, and I don't want to do
    in two steps what I can do in one, that is, including a function in
    the existing script to compare the three files.

    > If the files are "pretty much guaranteed to be identical" you could just
    > compute a hash for each file and compare the hashes. If they are the
    > same, you are done. Only if they aren't (which is "pretty much
    > guaranteed" not to happen) do you need to worry about finding the
    > differences.


    As it turns out, with a couple of days experience and several
    attempts, I would up creating three hashes, one for each file, with
    the IDs as keys and the name of the file as the values. I iterate
    through the 'main' hash, and if the hash element exists in all three
    hashes I delete it. I then print the hashes. It's kinda' crude, but it
    was easy to do, doesn't take long, and gives me what I need.

    Thanks, CC.
     
    ccc31807, Feb 11, 2010
    #5
  6. ccc31807

    ccc31807 Guest

    On Feb 10, 11:21 pm, Xho Jingleheimerschmidt <>
    wrote:
    > With what information?  Just the name of the key that fails to appear in
    > all files, or do you have to identify which one or two out of three it
    > appears in?


    Just the key.

    > Since you already got it in a database, how about something like:


    Unfortunately, this is a non-SQL, non-relational, non-first-normal-
    form flat file database (IBM's UniData) over a WAN connection, and
    it's a lot more practical to glob the data and process it locally.


    > > this for several reason but mostly because the data files are pretty
    > > much guaranteed to be identical and we don't expect there to be any
    > > differences.

    >
    > That reason doesn't make much sense.  The fact that the files are pretty
    > much guaranteed to be identical can be used to argue against *any*
    > proposed method, not just the line-by-line method.


    See my reply to PJH. The 'official' query is highly impractical for my
    unit, and we have written two other queries to replace it. We just
    want to make sure that the data derived from all three queries is the
    same before we make any changes.

    > I don't know of any hashing functions that have both a very low chance
    > of collision, and are indifferent to the order in which the strings are
    > added into it.  And if you have to sort the keys so they are in the same
    > order, then you might as well do the line by line thing.


    Obviously, the keys would have to be in order. As it turns out, the
    size of the files is much less than I anticipated, so O(n) works just
    fine.

    CC.
     
    ccc31807, Feb 11, 2010
    #6
  7. On 2010-02-11 14:50, ccc31807 <> wrote:
    > On Feb 11, 7:20 am, "Peter J. Holzer" <> wrote:
    >> If the files are "pretty much guaranteed to be identical" you could just
    >> compute a hash for each file and compare the hashes. If they are the
    >> same, you are done. Only if they aren't (which is "pretty much
    >> guaranteed" not to happen) do you need to worry about finding the
    >> differences.

    >
    > As it turns out, with a couple of days experience and several
    > attempts, I would up creating three hashes,


    I just realized that my use of the word "hash" was ambiguous: I meant
    result of a strong hash-function such as SHA-1, not a Perl hash.

    hp
     
    Peter J. Holzer, Feb 11, 2010
    #7
  8. ccc31807

    ccc31807 Guest

    On Feb 11, 11:46 am, "Peter J. Holzer" <> wrote:
    > I just realized that my use of the word "hash" was ambiguous: I meant
    > result of a strong hash-function such as SHA-1, not a Perl hash.


    That's okay. I figured out what you meant.

    CC.
     
    ccc31807, Feb 11, 2010
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Odd-R.

    Comparing lists

    Odd-R., Oct 10, 2005, in forum: Python
    Replies:
    41
    Views:
    1,219
    Christian Stapfer
    Oct 18, 2005
  2. =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==

    List of lists of lists of lists...

    =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==, May 8, 2006, in forum: Python
    Replies:
    5
    Views:
    433
    =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==
    May 15, 2006
  3. James Stroud

    Comparing lists ...

    James Stroud, Feb 14, 2007, in forum: Python
    Replies:
    2
    Views:
    273
    Paul Rubin
    Feb 14, 2007
  4. hiro
    Replies:
    12
    Views:
    416
    Paul Rubin
    Jun 25, 2007
  5. Ladislav Andel

    comparing two lists

    Ladislav Andel, Aug 23, 2007, in forum: Python
    Replies:
    10
    Views:
    453
    Ladislav Andel
    Aug 27, 2007
Loading...

Share This Page