Compare two extremely large lists?

J

Joe Young

I have a list of several thousands of numerical ids


and in another file I have a database dump of hundreds of thousands of
records


I need to parse the first list, and with each id select the
corresponding record from the database dump.



file1
20121
2193403
334
4343
43434
3535340
948548
34543
And so on.......



file 2
72371 more.jpg green No Friday
034 Leicester.png Yes
8213 sport.jpeg No Saturday Two Pass
2313 feline.jpg Yes Wednesday
 
S

smallpond

I have a list of several thousands of numerical ids

and in another file I have a database dump of hundreds of thousands of
records

I need to parse the first list, and with each id select the
corresponding record from the database dump.

file1
20121
2193403
334
4343
43434
3535340
948548
34543
And so on.......

file 2
72371  more.jpg green No Friday
034     Leicester.png Yes
8213   sport.jpeg No Saturday Two Pass
2313   feline.jpg Yes Wednesday


Why would you not use the database for this? That's what they're for.

You can put it all in a hash in memory using id as a key. Several
hundred thousand short lines of text is only a few MB.
 
P

Peter Makholm

Joe Young said:
I have a list of several thousands of numerical ids

Several thousands is in my opinion not neccessarily extremely large
lists. I would just do the naïve thing and parse file2 into a hash and
the read file1 line by line and output the relevant data from the
hash.

Based on you examples that should be doable using a meager 1MB memory
for storing data in-memory.

//Makholm
 
J

J. Gleixner

Joe said:
I have a list of several thousands of numerical ids


and in another file I have a database dump of hundreds of thousands of
records


I need to parse the first list, and with each id select the
corresponding record from the database dump.

open file1, for read.
while reading file1, line by line
parse the line for ID
store the ID as a key in a hash
close file1.

open file2, for read.
while reading through file2, line by line
parse the line for the id and the record information
print the record information if the id exists as a key in the file1 hash.
close file2

perldoc perlopentut



Or, insert all ids from file1 into a table and use
the database to select the record information for
all rows where the ids match.
 
S

Skye Shaw!@#$

I have a list of several thousands of numerical ids

and in another file I have a database dump of hundreds of thousands of
records

I need to parse the first list, and with each id select the
corresponding record from the database dump.

If that's all you have to do, try using join:

Skyes-MacBook-Pro-15:~ sshaw$ sort -n file1 > sorted1 #lines should
be sorted
Skyes-MacBook-Pro-15:~ sshaw$ sort -n file2 > sorted2
Skyes-MacBook-Pro-15:~ sshaw$ join sorted1 sorted2
334 Leicester.png Yes
4343 feline.jpg Yes Wednesday

-Skye
 
A

Andrzej Adam Filip

Joe Young said:
I have a list of several thousands of numerical ids


and in another file I have a database dump of hundreds of thousands of
records


I need to parse the first list, and with each id select the
corresponding record from the database dump.



file1
20121
2193403
334
4343
43434
3535340
948548
34543
And so on.......



file 2
72371 more.jpg green No Friday
034 Leicester.png Yes
8213 sport.jpeg No Saturday Two Pass
2313 feline.jpg Yes Wednesday

my %Keys;
open( my $F1,'<','file1') or die;
while(<$F1>) {
chomp; $Keys{$_}++;
}
open( my $F2,'<','file2') or die;
while(<$F2>) {
die unless /^\d+)\s+(\S.*)$/;
print if $Keys{$1};
}
 
J

Joe Young

my %Keys;
open( my $F1,'<','file1') or die;
while(<$F1>) {
  chomp; $Keys{$_}++;}

open( my $F2,'<','file2') or die;
while(<$F2>) {
  die unless /^\d+)\s+(\S.*)$/;
  print if $Keys{$1};

}


Thanks Andrzej,

[1] Could be useful to test run before posting. (Was a bracket missing
in regex.) Very helpful post though. Thanks.
[2] the print if $keys{$1}
what is that saying exactly? print if there is an entry in the
first entry in each key pair?

Because (using the test data below) it should print 8 lines of
data and it only prints 5
it ignores
2313 feline.jpg Yes Wednesday
8213 sport.jpeg No Saturday Two Pass
72371 more.jpg green No Friday
for no good reason that I can see. Those lines have keys the same as
any others.






I've changed the data for testing to
file1
334
4343
20121
34543
43434
948548
2193403
3535340

file2
334 Leicester.png Yes
2313 feline.jpg Yes Wednesday
4343 buzzaldrin.jpg Yes
8213 sport.jpeg No Saturday Two Pass
20121 bounty.png Yes Monday
43434 peckerwood.jpeg
72371 more.jpg green No Friday
2193403 go_green.jpg No Wednesday One
 
J

Joe Young

Scrub that last post!!

The ids are not the same! There's been a mixup in my sort!

Sorry for the waste of time!!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top