Compare two extremely large lists?

Discussion in 'Perl Misc' started by Joe Young, Jan 17, 2011.

  1. Joe Young

    Joe Young Guest

    I have a list of several thousands of numerical ids


    and in another file I have a database dump of hundreds of thousands of
    records


    I need to parse the first list, and with each id select the
    corresponding record from the database dump.



    file1
    20121
    2193403
    334
    4343
    43434
    3535340
    948548
    34543
    And so on.......



    file 2
    72371 more.jpg green No Friday
    034 Leicester.png Yes
    8213 sport.jpeg No Saturday Two Pass
    2313 feline.jpg Yes Wednesday
     
    Joe Young, Jan 17, 2011
    #1
    1. Advertising

  2. Joe Young

    smallpond Guest

    On Jan 17, 8:37 am, Joe Young <> wrote:
    > I have a list of several thousands of numerical ids
    >
    > and in another file I have a database dump of hundreds of thousands of
    > records
    >
    > I need to parse the first list, and with each id select the
    > corresponding record from the database dump.
    >
    > file1
    > 20121
    > 2193403
    > 334
    > 4343
    > 43434
    > 3535340
    > 948548
    > 34543
    > And so on.......
    >
    > file 2
    > 72371  more.jpg green No Friday
    > 034     Leicester.png Yes
    > 8213   sport.jpeg No Saturday Two Pass
    > 2313   feline.jpg Yes Wednesday



    Why would you not use the database for this? That's what they're for.

    You can put it all in a hash in memory using id as a key. Several
    hundred thousand short lines of text is only a few MB.
     
    smallpond, Jan 17, 2011
    #2
    1. Advertising

  3. Joe Young <> writes:

    > I have a list of several thousands of numerical ids


    Several thousands is in my opinion not neccessarily extremely large
    lists. I would just do the naïve thing and parse file2 into a hash and
    the read file1 line by line and output the relevant data from the
    hash.

    Based on you examples that should be doable using a meager 1MB memory
    for storing data in-memory.

    //Makholm
     
    Peter Makholm, Jan 17, 2011
    #3
  4. Joe Young

    J. Gleixner Guest

    Joe Young wrote:
    > I have a list of several thousands of numerical ids
    >
    >
    > and in another file I have a database dump of hundreds of thousands of
    > records
    >
    >
    > I need to parse the first list, and with each id select the
    > corresponding record from the database dump.


    open file1, for read.
    while reading file1, line by line
    parse the line for ID
    store the ID as a key in a hash
    close file1.

    open file2, for read.
    while reading through file2, line by line
    parse the line for the id and the record information
    print the record information if the id exists as a key in the file1 hash.
    close file2

    perldoc perlopentut



    Or, insert all ids from file1 into a table and use
    the database to select the record information for
    all rows where the ids match.
     
    J. Gleixner, Jan 17, 2011
    #4
  5. On Jan 17, 5:37 am, Joe Young <> wrote:
    > I have a list of several thousands of numerical ids
    >
    > and in another file I have a database dump of hundreds of thousands of
    > records
    >
    > I need to parse the first list, and with each id select the
    > corresponding record from the database dump.


    If that's all you have to do, try using join:

    Skyes-MacBook-Pro-15:~ sshaw$ sort -n file1 > sorted1 #lines should
    be sorted
    Skyes-MacBook-Pro-15:~ sshaw$ sort -n file2 > sorted2
    Skyes-MacBook-Pro-15:~ sshaw$ join sorted1 sorted2
    334 Leicester.png Yes
    4343 feline.jpg Yes Wednesday

    -Skye
     
    Skye Shaw!@#$, Jan 17, 2011
    #5
  6. Joe Young <> wrote:
    > I have a list of several thousands of numerical ids
    >
    >
    > and in another file I have a database dump of hundreds of thousands of
    > records
    >
    >
    > I need to parse the first list, and with each id select the
    > corresponding record from the database dump.
    >
    >
    >
    > file1
    > 20121
    > 2193403
    > 334
    > 4343
    > 43434
    > 3535340
    > 948548
    > 34543
    > And so on.......
    >
    >
    >
    > file 2
    > 72371 more.jpg green No Friday
    > 034 Leicester.png Yes
    > 8213 sport.jpeg No Saturday Two Pass
    > 2313 feline.jpg Yes Wednesday


    my %Keys;
    open( my $F1,'<','file1') or die;
    while(<$F1>) {
    chomp; $Keys{$_}++;
    }
    open( my $F2,'<','file2') or die;
    while(<$F2>) {
    die unless /^\d+)\s+(\S.*)$/;
    print if $Keys{$1};
    }

    --
    [pl>en Andrew] Andrzej Adam Filip : :
    I have a hard time being attracted to anyone who can beat me up.
    -- John McGrath, Atlanta sportswriter, on women weightlifters.
     
    Andrzej Adam Filip, Jan 17, 2011
    #6
  7. Joe Young

    Joe Young Guest


    > my %Keys;
    > open( my $F1,'<','file1') or die;
    > while(<$F1>) {
    >   chomp; $Keys{$_}++;}
    >
    > open( my $F2,'<','file2') or die;
    > while(<$F2>) {
    >   die unless /^\d+)\s+(\S.*)$/;
    >   print if $Keys{$1};
    >
    > }
    >



    Thanks Andrzej,

    [1] Could be useful to test run before posting. (Was a bracket missing
    in regex.) Very helpful post though. Thanks.
    [2] the print if $keys{$1}
    what is that saying exactly? print if there is an entry in the
    first entry in each key pair?

    Because (using the test data below) it should print 8 lines of
    data and it only prints 5
    it ignores
    2313 feline.jpg Yes Wednesday
    8213 sport.jpeg No Saturday Two Pass
    72371 more.jpg green No Friday
    for no good reason that I can see. Those lines have keys the same as
    any others.






    I've changed the data for testing to
    file1
    334
    4343
    20121
    34543
    43434
    948548
    2193403
    3535340

    file2
    334 Leicester.png Yes
    2313 feline.jpg Yes Wednesday
    4343 buzzaldrin.jpg Yes
    8213 sport.jpeg No Saturday Two Pass
    20121 bounty.png Yes Monday
    43434 peckerwood.jpeg
    72371 more.jpg green No Friday
    2193403 go_green.jpg No Wednesday One
     
    Joe Young, Jan 18, 2011
    #7
  8. Joe Young

    Joe Young Guest

    Scrub that last post!!

    The ids are not the same! There's been a mixup in my sort!

    Sorry for the waste of time!!
     
    Joe Young, Jan 18, 2011
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. mikester
    Replies:
    7
    Views:
    612
    mikester
    Dec 25, 2003
  2. Replies:
    5
    Views:
    1,310
    Scott Ellsworth
    Aug 11, 2005
  3. Ruchi Dayal
    Replies:
    1
    Views:
    721
    Peter Gerstbach
    Sep 7, 2004
  4. GenxLogic
    Replies:
    3
    Views:
    1,371
    andrewmcdonagh
    Dec 6, 2006
  5. Replies:
    4
    Views:
    1,471
    jacob navia
    Jan 1, 2008
Loading...

Share This Page