Fast iterative reads

P

Paul Branon

Hypothetically speaking: -


Say I have a file containing 100,000 numbers, and each number
corresponds to a database id

and I have another file which contains all of the database ids plus
the name of the file that corresponds to it.
(Why couldn't I do an sql query which puts all the info in one file?
Because I don't have access to the database!)
So, I have these two files! (The second file is much larger than the
first, say 300,000 records or more)
Now, the only way that I can match database ids to filenames is to
traverse list one and for each entry in list one then iteratively
traverse list two until a match is made
(In shell you would just use grep) what I'm doing is while <FILENAME>{

if ( $number =~ /\b$othernumber\b/){

carryoutmatch_behaviour;
}
}

It works. But it's relatively slow! (perhaps not surprisingly!)


Question) How would you solve the problem? (More quickly)
 
J

Jim Gibson

Ben Morrow said:
Use a hash. Read the second file into a hash mapping id to filename,
then you can quickly look up the filename when you iterate over the
first file. If you run out of memory (not likely with only 300_000
records) look into one of the DBM_File modules.

Alternatively, if you run out of memory, sort the two files by ID.
Then, read a line from file one and start reading lines from file two.
If the ID from file 2 is the same as the ID from file 1, use the line
and read the next line from file 1. If the ID is less, discard it and
read the next line from file 2. If the ID is greater, the ID doesn't
exist in file 2, so skip it and read the next line from file 1.
 
S

sln

Hypothetically speaking: -


Say I have a file containing 100,000 numbers, and each number
corresponds to a database id

and I have another file which contains all of the database ids plus
the name of the file that corresponds to it.
(Why couldn't I do an sql query which puts all the info in one file?
Because I don't have access to the database!)
So, I have these two files! (The second file is much larger than the
first, say 300,000 records or more)
Now, the only way that I can match database ids to filenames is to
traverse list one and for each entry in list one then iteratively
traverse list two until a match is made
(In shell you would just use grep) what I'm doing is while <FILENAME>{

if ( $number =~ /\b$othernumber\b/){

carryoutmatch_behaviour;
}
}

It works. But it's relatively slow! (perhaps not surprisingly!)


Question) How would you solve the problem? (More quickly)

Why would you need file-1? Do you need to keep track of the 100,000
numbers? I don't see why.

File-2 looks to be 2 fields because its 300,000 records,
with multiple filenames per id.
But, it doesn't matter.
And it doesn't matter with only 300,000 records as far as
memory, otherwise a in-file lookup of indexed positions
might be more efficient.

Track it all:

while (<file1>)
{
($control, $id) = parse1 ($_);
$dbhash{$id}->[0] = $control if !exists ($dbhash{$id});
}
while (<file2>)
{
($id, @filenames) = parse2 ($_);
push @{$dbhash{$id}}, @filenames if exists ($dbhash{$id});
}
for $testid ( '123', '653' ) {
if (exists $dbhash{$testid}) {
($control, @filenames) = @{$dbhash{$testid}};
print "id: $testid, control: $control\n";
for (@filenames) {
print " $_\n";
}
}

-sln
 
P

Paul Branon

I like all these ideas. It's a shame the sort and discard idea won't
work in this instance because the long list comes from the same
database as the short list! (Why couldn't the original query contain
all the relevant data? In theory it could. In practice they don't do
it that way.)

As for storing the data from the long list in a hash, isn't it the
case that I'd need an array of hashes? Otherwise each iterative read
of the longlist would overwrite the entries from the previous list.
And worse still, at the end of the file the hash would be empty again.
(So to access my data I would have needed to have stored the entries
all in an array)



open(LONGLIST, "<$databaselonglist") or die "$!\n";
while (<LONGLIST>) {
$ItemID="";
chomp;

# You can get the Item-ID split on spaces it's array item[1]
@getitemid = split(/\s+/, $_); #print " Item-ID $getitemid[1]\n";
@getfileextension=split(/\./, $_); #print " FileExtension
$getfileextension[1]\n";
if ( $getitemid[1] =~ /^\d+/){
$ItemID = $getitemid[1];
}else{
}

# this needs to be conditional
%hash_table = ('ItemID' => $getitemid[1], 'FileExtension' =>
$getfileextension[1]);
foreach $key (keys (%hash_table)) { print $key, ",",
$hash_table{$key}, "\n"; } #(Prints keynames plus input data)

}
 
P

Paul Branon

note there is an error above in that the hash_table is called my_hash
somewhere and hash_table somewhere else. This is just my transposition
of the code into the forum.
 
S

sln

I like all these ideas. It's a shame the sort and discard idea won't
work in this instance because the long list comes from the same
database as the short list!

I take it, the "short list" are unique ID's and
the "long list" are unique ID/Filename pairs.
Where the "long list" contains duplicate ID's, but
unique ID/Filename pairs?
(Why couldn't the original query contain
all the relevant data? In theory it could. In practice they don't do
it that way.)

As for storing the data from the long list in a hash, isn't it the
case that I'd need an array of hashes?

AoH is @array = (
{},{},{}
);
Is this what you mean?
Otherwise each iterative read
of the longlist would overwrite the entries from the previous list.

Wasn't the idea to avoid itterative reads of the "long list"?
And worse still, at the end of the file the hash would be empty again.
(So to access my data I would have needed to have stored the entries
all in an array)

Are you are using the "short list" to query the "long list"?
Do you expect the "long list" ID's to be unique?
In other words, 1 ID per filename?
If not, is a duplicate an error, or is it possible that an
ID can have multiple filenames?
open(LONGLIST, "<$databaselonglist") or die "$!\n";
while (<LONGLIST>) {
$ItemID="";
chomp;

# You can get the Item-ID split on spaces it's array item[1]
@getitemid = split(/\s+/, $_); #print " Item-ID $getitemid[1]\n";
@getfileextension=split(/\./, $_); #print " FileExtension
$getfileextension[1]\n";
if ( $getitemid[1] =~ /^\d+/){
$ItemID = $getitemid[1];
}else{
}

# this needs to be conditional
%hash_table = ('ItemID' => $getitemid[1], 'FileExtension' =>
$getfileextension[1]);
foreach $key (keys (%hash_table)) { print $key, ",",
$hash_table{$key}, "\n"; } #(Prints keynames plus input data)

}

Is this pseudo code? Because it stops compiling after too many
errors to display.

Post a small data sample exhibiting all facets of shape you
expect.

-sln
 
J

Jim Gibson

Paul Branon said:
I like all these ideas. It's a shame the sort and discard idea won't
work in this instance because the long list comes from the same
database as the short list! (Why couldn't the original query contain
all the relevant data? In theory it could. In practice they don't do
it that way.)

Please be aware that most people in this group do not use Google Groups
to read and post. As such, you need to give a little context in each
post so that people will know what you are talking about.
As for storing the data from the long list in a hash, isn't it the
case that I'd need an array of hashes? Otherwise each iterative read
of the longlist would overwrite the entries from the previous list.
And worse still, at the end of the file the hash would be empty again.
(So to access my data I would have needed to have stored the entries
all in an array)

You said you had two files, one long and one short, and the task was to
match up database IDs in the two files. So, no, you don't need an array
of hashes. The suggestion was to read the longer file into one hash,
indexed by ID, one time only, then read the shorter file and see if the
corresponding entry (by ID) exists in the hash. As long as the longer
file doesn't change, you don't need to read it again. You never
overwrite the hash, because you only create it one time. The hash will
never become empty, since after it has been created you only read from
it. If your database IDs in the long file are unique and you use them
as keys, no hash entry will be overwritten.

You should have:

use strict;
use warnings;
open(LONGLIST, "<$databaselonglist") or die "$!\n";

You should use the 3-argument form of open.
while (<LONGLIST>) {
$ItemID="";
chomp;

# You can get the Item-ID split on spaces it's array item[1]
@getitemid = split(/\s+/, $_); #print " Item-ID $getitemid[1]\n";
@getfileextension=split(/\./, $_); #print " FileExtension
$getfileextension[1]\n";
if ( $getitemid[1] =~ /^\d+/){
$ItemID = $getitemid[1];
}else{
}

# this needs to be conditional
%hash_table = ('ItemID' => $getitemid[1], 'FileExtension' =>
$getfileextension[1]);

The above hash only holds data for one database record. A better format
would be to have one hash entry per record, with the database ID as key
and the file name as value. So the above line should be:

$hash_table{$getitemid[1]} = $getfileextension[1];
foreach $key (keys (%hash_table)) { print $key, ",",
$hash_table{$key}, "\n"; } #(Prints keynames plus input data)

}

There are other ways in which your code could be improved. Keep reading
this newsgroup to find out what they are.
 
X

Xho Jingleheimerschmidt

Paul said:
I like all these ideas.

What ideas? You haven't quoted any context.

It's a shame the sort and discard idea won't
work in this instance because the long list comes from the same
database as the short list! (Why couldn't the original query contain
all the relevant data? In theory it could. In practice they don't do
it that way.)

As for storing the data from the long list in a hash, isn't it the
case that I'd need an array of hashes? Otherwise each iterative read
of the longlist would overwrite the entries from the previous list.

Why would you read the long list repeatedly? The whole point of storing
it is that you only need to read it once.

<Unreadable code snipped>

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top