Faster file iteration

vijay · Mar 13, 2008

use strict;

my $file_1 = '1.txt'; # File 1
my $file_2 = '2.txt'; # File 2

if(open(FH1 , $file_1)){
print "File $file_1 Opened\n";
}else{
print "Failed to Open file $file_1\n";
exit;
}

if(open(FH2 , $file_2)){
print "File $file_2 Opened\n";
}else{
print "Failed to Open file $file_2\n";
close FH1;
exit;
}

while(chomp(my $line_2 = <FH2>)){
my($dummy21,$file21_no,$file21_date) = split(/\s+/,$line_2);
next if($file21_no !~ /\d+/);
my $counter1 = 0;
my $least_date1 = 0;
seek(FH1,0,0);
$least_date1 = date_compare($file21_date);
while(chomp(my $line_1 = <FH1>)){
my($d,$file1_no,$file1_date) = split(/;/,$line_1);
if($file1_no == $file21_no){
$file1_date =~/(\d\d\d\d)(\d\d)(\d\d)/;
my $yr1 = $1;
$file21_date =~/(\d\d\d\d)(\d\d)(\d\d)/;
if(($yr1 - $1) < 5){
$counter1++;
}
}
}
$least_date1 = 0 if($counter1 == 0);
print "$dummy21\t$file21_no\t$file21_date\t$counter1\t
$least_date1\n";
print FH3 "$dummy21\t$file21_no\t$file21_date\t$counter1\t
$least_date1\n";
}

Here $file_1 has around 12000000 records , it takes 2 mins to go for a
single record in $file_2.

Any suggestion to make it fast ?

Martijn Lievaart · Mar 13, 2008

Here $file_1 has around 12000000 records , it takes 2 mins to go for a
single record in $file_2.

Any suggestion to make it fast ?

Read file_1 once, store it in an appropriate datastructure (hash comes to
mind). It still may take two minutes to read, but after that searching is
fast.

Does take some memory, but 12 million records should take less than 100
Megs.

M4

vijay · Mar 13, 2008

Are the two files in date-sorted order?

BugBear

No , they are not sorted on date , no unique key ..

xhoster · Mar 13, 2008

[email protected] said:
use strict;

my $file_1 = '1.txt'; # File 1
my $file_2 = '2.txt'; # File 2

if(open(FH1 , $file_1)){
print "File $file_1 Opened\n";
}else{
print "Failed to Open file $file_1\n";
exit;
}

if(open(FH2 , $file_2)){
print "File $file_2 Opened\n";
}else{
print "Failed to Open file $file_2\n";
close FH1;
exit;
}

while(chomp(my $line_2 = <FH2>)){
my($dummy21,$file21_no,$file21_date) = split(/\s+/,$line_2);
next if($file21_no !~ /\d+/);
my $counter1 = 0;
my $least_date1 = 0;
seek(FH1,0,0);
$least_date1 = date_compare($file21_date);
while(chomp(my $line_1 = <FH1>)){
my($d,$file1_no,$file1_date) = split(/;/,$line_1);
if($file1_no == $file21_no){

You could pre-load file1 into a hash (by $file1_no) of a list of
lines that have that $file1_no. That way for each line in file2, you
only need to go through those lines of file1 that already meet the
above condition. This by itself should greatly improve things unless
there most of the data is all in the same or just a few $file1_no.

$file1_date =~/(\d\d\d\d)(\d\d)(\d\d)/;
my $yr1 = $1;
$file21_date =~/(\d\d\d\d)(\d\d)(\d\d)/;
if(($yr1 - $1) < 5){
$counter1++;
}

And within a given $file1_no hashed list, you could sort by file1_date,
that way once you meet a non-qualifying date you could abort the loop
early rather than testing all the rest. (This improvement would probably
be quite small, compared to the previous one)

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Jürgen Exner · Mar 13, 2008

[code snipped]
Thank you for posting the code. But what is it _supposed_ to do?
What are the requirements? Unless you tell us we can't know if you are doing
something unneccessary in your code.

Here $file_1 has around 12000000 records , it takes 2 mins to go for a
single record in $file_2.

Any suggestion to make it fast ?

Give us a spec and maybe someone will be able to come up with a better
algorithm.

jue

Mario D'Alessio · Mar 13, 2008

....

Here $file_1 has around 12000000 records , it takes 2 mins to go for a
single record in $file_2.

Any suggestion to make it fast ?

Obvious answer: If you have the memory, read file1 into memory
and process it from there.

Mario

vijay · Mar 14, 2008

Give us a spec and maybe someone will be able to come up with a better
algorithm.

the specs

We have two files. The first file,say 'one.txt', has data arranged in
three columns, separated by semicolon. something like this:

1234567;7654321;20080225
1234765;5464354;19821111
342312A;5464354;19990101
ABC12;9876544;0
I002222;ACD222;19991130
.........

Note that the three columns are not of fixed length. The first two
columns are of a maximum length 7 and can contain alpha-numerals.
The third column is the date column (in YYYYMMDD format). It can also
contain '0' or can be empty too.

The second file,say 'two.txt', also has three columns separated by
spaces, something like:

serialno fileno date
123 1234567 20080315
2 2233442 20081130
311 1232231 20031221
44 1232123 19990831
23 2131312 20000101
132 5464354 19811111
......

The enitre file contains only numerals, from second line onwards. The
first column length ranges from 1-3 numbers. Second column strictly is
of 7 number length. Third column is the date column strictly in
YYYYMMDD format.

Now, the requirement would be to add two additional columns in
'two.txt'. The fourth and fifth columns will be tab separated and
labeled 'label4' and 'label5' respectively.

The values to be populated under 'label4' should be computed as
follows:

Read the 7-digit number present in the second column (under fileno) of
'two.txt'. Compare the number with the alpha-numeric value present in
the second column of the 'one.txt' file. on finding a perfect match,
trigger a counter. Repeat the previous procedure for subsequent lines
and increment the counter each time you find a match. The fourth
column should then be populated with the final value in teh counter
against the fileno, which is the number of exact matches you've found.
If you've found no match, then just populate the entry with a
'0' (zero). But, there is one condition which you need to take care of
before populating-the date difference in each row should be less than
or equal to 5yrs. to do this, you need to pick up the corresponding
date from next to that fileno in 'two.txt'and also pick up the date
next from the thrid column in 'one.txt', and take a diff. If the
difference is more than 5 yrs, do not increment the counter. *NOTE:
the date in file 'one.txt' is always greater(or later) than the
corresponding date in 'two.txt'. The date ranges from 19900101 to
20041231 in file 'two.txt' and from 19750101 to 20011225in file
'one.txt'

In the above example, the new 'two.txt' will look something like

serialnumber fileno date label4
123 1234567 20080315 0
2 2233442 20081130 0
311 1232231 20031221 0
44 1232123 19990831 0
23 2131312 20000101 0
132 5464354 19811111 1

*Label5: We know that the date in 'one.txt' ends on 12/25/2001. For
every matched file number in 'two.txt', pls do the following:
1. if the date in 'two.txt' is less than 12/25/2001, by 5 yrs or more,
mark as 5 yrs.
2. if its between 12/25/2001 and 12/25/1996 mark the exact number in
terms of number of years,months and days.
3. if its more than 12/25/2001 and till 31/12/2004, mark the exact
number of years,months and days, but put a '-' (minus sign) in front
of it.

vijay · Mar 26, 2008

Overall performance is O(NlogN) + O(N) + O(NlogN) which is O(NlogN)
which is rather better than your present O(N^2)

BugBear

Any suggestions on using Thread?

#!/usr/bin/perl

use strict;
#use Data:

umper;
#use CGI;
use Date::Calc qw(Delta_YMD);
use Thread;

my $file_1 = '1.txt'; # File 1
my $file_2 = '2.txt'; # File 2
my $file_3 = 'f.txt'; # Final output file

if(open(FH1 , $file_1)){
print "File $file_1 Opened\n";
close(FH1);
}else{
print "Failed to Open file $file_1\n";
exit;
}

if(open(FH2 , $file_2)){
print "File $file_2 Opened\n";
}else{
print "Failed to Open file $file_2\n";
close FH1;
exit;
}
if(open(FH3,">$file_3")){
print "File $file_3 Opened\n";
print FH3 "serialno\tfileno\tdate\tlabel4\tlabel5\n";
}else{
print "Failed to Open file $file_3\n";
close FH1;close FH2;
exit;
}

while(my $line_2 = <FH2>){
chomp($line_2);print $line_2."\n";
my($dummy,$file2_no,$file2_date) = split(/\s+/,$line_2);
next if($file2_no !~ /\d+/);
my $counter = 0;
my $least_date = date_compare($file2_date);
my $thr = new Thread \&traverse, $dummy,$file2_no,$file2_date,
$counter,$least_date;
#$counter = traverse($file2_no,$file2_date);
}
sleep(500);
close FH1;
close FH2;
close FH3;

sub traverse{
my($dummy,$file2_no,$file2_date,$counter,$least_date) = @_;
my $counter = 0;
open(FHT , $file_1);
seek(FHT,0,0);
while(my $line_1 = <FHT>){
chomp($line_1);
my ($d,$file1_no,$file1_date) = split(/;/,$line_1);
if($file1_no == $file2_no){
#print $file1_date."=".$file2_date."\n";
if((date_compare5($file1_date,$file2_date)) == 1){
$counter++;
}
}
}
close(FHT);
$least_date = 0 if($counter == 0);
print "$dummy\t$file2_no\t$file2_date\t$counter\t$least_date\n";
print FH3 "$dummy\t$file2_no\t$file2_date\t$counter\t$least_date
\n";
return $counter;
}
sub date_compare5{ # Comparision for 5 Years
my($date_1,$date_2) = @_;
$date_1 =~/(\d\d\d\d)(\d\d)(\d\d)/;
my $yr1 = $1;

$date_2 =~/(\d\d\d\d)(\d\d)(\d\d)/;
my $yr2 = $1;

#print "$yr1=$mn1=$dt1: ";print "$yr2=$mn2=$dt2\n";
if(($yr1 - $yr2) < 5){
#print "$yr1=$mn1=$dt1: ";print "$yr2=$mn2=$dt2\n";
return 1;
}
return -1;
}
sub date_compare{ # Comparision for actual date , return 1 if date1 is
big otherwise -1 , if equal then 0
my($date_1) = @_;
$date_1 =~/(\d\d\d\d)(\d\d)(\d\d)/;
my($yr1,$mn1,$dt1) = ($1,$2,$3);

if($yr1 < 1996){
return "5 Yrs";
}elsif($yr1 == 1996 && $mn1 < 12){
return "5 Yrs";
}elsif($yr1 == 1996 && $mn1 == 12 && $dt1 <= 25 ){
return "5 Yrs";
}elsif($yr1 < 2001 && $yr1 > 1996){
return delta($yr1,$mn1,$dt1);
}elsif($yr1 == 1996 && $mn1 == 12 && $dt1 >=25){
return delta($yr1,$mn1,$dt1);
}elsif($yr1 == 2001 && $mn1 < 12 ){
return delta($yr1,$mn1,$dt1);
}elsif($yr1 == 2001 && $mn1 == 12 && $dt1 <=24){
return delta($yr1,$mn1,$dt1);
}elsif($yr1 > 2001){
return delta($yr1,$mn1,$dt1);
}elsif($yr1 == 2001 && $mn1 == 12 && $dt1 > 24 ){
return delta($yr1,$mn1,$dt1);
}else{
return "No case ".$date_1;
}
}
sub delta{
my $yr = shift;my $mn = shift; my $dt= shift;
($yr,$mn,$dt) = Delta_YMD($yr,$mn,$dt,2001,12,25);
return "$yr-$mn-$dt";
}

xhoster · Mar 26, 2008

[email protected] said:
Any suggestions on using Thread?

God, I hope not. It seems like you want to try every bad way to solve
this problem. What about the suggestions you already received--ones that
would actually work and make things fast?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Ben Morrow · Mar 26, 2008

Quoth "[email protected] said:
Any suggestions on using Thread?

Thread.pm is deprecated: it supported the old 5005-threads threading
model, which never worked right and was removed from perl 5.8. Thread.pm
is just a passthrough to threads.pm; new code should be using this
directly.

Ben

Why is each iteration accumulating the values here?	0	Aug 10, 2023
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Addition and substraction of polynomials is working fine but the multiplication isn't; what's wrong with my code	1	Nov 22, 2022
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022
Python client/server that reads HTML body from server	1	Apr 12, 2023
Splitting up and Reassembling A File	5	Mar 14, 2011
I need help	1	Nov 2, 2022
text::CSV	2	Sep 15, 2010

Faster file iteration

vijay

Martijn Lievaart

vijay

xhoster

Jürgen Exner

Mario D'Alessio

vijay

vijay

xhoster

Ben Morrow

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads