Faster file iteration

Discussion in 'Perl Misc' started by vijay@iavian.com, Mar 13, 2008.

  1. Guest

    use strict;

    my $file_1 = '1.txt'; # File 1
    my $file_2 = '2.txt'; # File 2

    if(open(FH1 , $file_1)){
    print "File $file_1 Opened\n";
    }else{
    print "Failed to Open file $file_1\n";
    exit;
    }

    if(open(FH2 , $file_2)){
    print "File $file_2 Opened\n";
    }else{
    print "Failed to Open file $file_2\n";
    close FH1;
    exit;
    }

    while(chomp(my $line_2 = <FH2>)){
    my($dummy21,$file21_no,$file21_date) = split(/\s+/,$line_2);
    next if($file21_no !~ /\d+/);
    my $counter1 = 0;
    my $least_date1 = 0;
    seek(FH1,0,0);
    $least_date1 = date_compare($file21_date);
    while(chomp(my $line_1 = <FH1>)){
    my($d,$file1_no,$file1_date) = split(/;/,$line_1);
    if($file1_no == $file21_no){
    $file1_date =~/(\d\d\d\d)(\d\d)(\d\d)/;
    my $yr1 = $1;
    $file21_date =~/(\d\d\d\d)(\d\d)(\d\d)/;
    if(($yr1 - $1) < 5){
    $counter1++;
    }
    }
    }
    $least_date1 = 0 if($counter1 == 0);
    print "$dummy21\t$file21_no\t$file21_date\t$counter1\t
    $least_date1\n";
    print FH3 "$dummy21\t$file21_no\t$file21_date\t$counter1\t
    $least_date1\n";
    }

    Here $file_1 has around 12000000 records , it takes 2 mins to go for a
    single record in $file_2.

    Any suggestion to make it fast ?
    , Mar 13, 2008
    #1
    1. Advertising

  2. On Thu, 13 Mar 2008 06:41:59 -0700, wrote:

    > Here $file_1 has around 12000000 records , it takes 2 mins to go for a
    > single record in $file_2.
    >
    > Any suggestion to make it fast ?


    Read file_1 once, store it in an appropriate datastructure (hash comes to
    mind). It still may take two minutes to read, but after that searching is
    fast.

    Does take some memory, but 12 million records should take less than 100
    Megs.

    M4
    Martijn Lievaart, Mar 13, 2008
    #2
    1. Advertising

  3. Guest

    On Mar 13, 7:52 pm, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > wrote:
    >
    > > Here $file_1 has around 12000000 records , it takes 2 mins to go for a
    > > single record in $file_2.

    >
    > > Any suggestion to make it fast ?

    >
    > Are the two files in date-sorted order?
    >
    > BugBear


    No , they are not sorted on date , no unique key ..
    , Mar 13, 2008
    #3
  4. Guest

    "" <> wrote:
    > use strict;
    >
    > my $file_1 = '1.txt'; # File 1
    > my $file_2 = '2.txt'; # File 2
    >
    > if(open(FH1 , $file_1)){
    > print "File $file_1 Opened\n";
    > }else{
    > print "Failed to Open file $file_1\n";
    > exit;
    > }
    >
    > if(open(FH2 , $file_2)){
    > print "File $file_2 Opened\n";
    > }else{
    > print "Failed to Open file $file_2\n";
    > close FH1;
    > exit;
    > }
    >
    > while(chomp(my $line_2 = <FH2>)){
    > my($dummy21,$file21_no,$file21_date) = split(/\s+/,$line_2);
    > next if($file21_no !~ /\d+/);
    > my $counter1 = 0;
    > my $least_date1 = 0;
    > seek(FH1,0,0);
    > $least_date1 = date_compare($file21_date);
    > while(chomp(my $line_1 = <FH1>)){
    > my($d,$file1_no,$file1_date) = split(/;/,$line_1);
    > if($file1_no == $file21_no){


    You could pre-load file1 into a hash (by $file1_no) of a list of
    lines that have that $file1_no. That way for each line in file2, you
    only need to go through those lines of file1 that already meet the
    above condition. This by itself should greatly improve things unless
    there most of the data is all in the same or just a few $file1_no.


    > $file1_date =~/(\d\d\d\d)(\d\d)(\d\d)/;
    > my $yr1 = $1;
    > $file21_date =~/(\d\d\d\d)(\d\d)(\d\d)/;
    > if(($yr1 - $1) < 5){
    > $counter1++;
    > }


    And within a given $file1_no hashed list, you could sort by file1_date,
    that way once you meet a non-qualifying date you could abort the loop
    early rather than testing all the rest. (This improvement would probably
    be quite small, compared to the previous one)


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
    , Mar 13, 2008
    #4
  5. "" <> wrote:
    [code snipped]
    Thank you for posting the code. But what is it _supposed_ to do?
    What are the requirements? Unless you tell us we can't know if you are doing
    something unneccessary in your code.

    >Here $file_1 has around 12000000 records , it takes 2 mins to go for a
    >single record in $file_2.
    >
    >Any suggestion to make it fast ?


    Give us a spec and maybe someone will be able to come up with a better
    algorithm.

    jue
    Jürgen Exner, Mar 13, 2008
    #5
  6. <> wrote in message
    news:...
    ....
    > Here $file_1 has around 12000000 records , it takes 2 mins to go for a
    > single record in $file_2.
    >
    > Any suggestion to make it fast ?


    Obvious answer: If you have the memory, read file1 into memory
    and process it from there.

    Mario
    Mario D'Alessio, Mar 13, 2008
    #6
  7. Guest

    On Mar 13, 9:23 pm, Jürgen Exner <> wrote:
    > Give us a spec and maybe someone will be able to come up with a better
    > algorithm.



    the specs

    We have two files. The first file,say 'one.txt', has data arranged in
    three columns, separated by semicolon. something like this:

    1234567;7654321;20080225
    1234765;5464354;19821111
    342312A;5464354;19990101
    ABC12;9876544;0
    I002222;ACD222;19991130
    .........

    Note that the three columns are not of fixed length. The first two
    columns are of a maximum length 7 and can contain alpha-numerals.
    The third column is the date column (in YYYYMMDD format). It can also
    contain '0' or can be empty too.

    The second file,say 'two.txt', also has three columns separated by
    spaces, something like:

    serialno fileno date
    123 1234567 20080315
    2 2233442 20081130
    311 1232231 20031221
    44 1232123 19990831
    23 2131312 20000101
    132 5464354 19811111
    ......

    The enitre file contains only numerals, from second line onwards. The
    first column length ranges from 1-3 numbers. Second column strictly is
    of 7 number length. Third column is the date column strictly in
    YYYYMMDD format.

    Now, the requirement would be to add two additional columns in
    'two.txt'. The fourth and fifth columns will be tab separated and
    labeled 'label4' and 'label5' respectively.

    The values to be populated under 'label4' should be computed as
    follows:

    Read the 7-digit number present in the second column (under fileno) of
    'two.txt'. Compare the number with the alpha-numeric value present in
    the second column of the 'one.txt' file. on finding a perfect match,
    trigger a counter. Repeat the previous procedure for subsequent lines
    and increment the counter each time you find a match. The fourth
    column should then be populated with the final value in teh counter
    against the fileno, which is the number of exact matches you've found.
    If you've found no match, then just populate the entry with a
    '0' (zero). But, there is one condition which you need to take care of
    before populating-the date difference in each row should be less than
    or equal to 5yrs. to do this, you need to pick up the corresponding
    date from next to that fileno in 'two.txt'and also pick up the date
    next from the thrid column in 'one.txt', and take a diff. If the
    difference is more than 5 yrs, do not increment the counter. *NOTE:
    the date in file 'one.txt' is always greater(or later) than the
    corresponding date in 'two.txt'. The date ranges from 19900101 to
    20041231 in file 'two.txt' and from 19750101 to 20011225in file
    'one.txt'

    In the above example, the new 'two.txt' will look something like

    serialnumber fileno date label4
    123 1234567 20080315 0
    2 2233442 20081130 0
    311 1232231 20031221 0
    44 1232123 19990831 0
    23 2131312 20000101 0
    132 5464354 19811111 1

    *Label5: We know that the date in 'one.txt' ends on 12/25/2001. For
    every matched file number in 'two.txt', pls do the following:
    1. if the date in 'two.txt' is less than 12/25/2001, by 5 yrs or more,
    mark as 5 yrs.
    2. if its between 12/25/2001 and 12/25/1996 mark the exact number in
    terms of number of years,months and days.
    3. if its more than 12/25/2001 and till 31/12/2004, mark the exact
    number of years,months and days, but put a '-' (minus sign) in front
    of it.
    , Mar 14, 2008
    #7
  8. Guest

    On Mar 14, 4:33 pm, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    >
    > Overall performance is O(NlogN) + O(N) + O(NlogN) which is O(NlogN)
    > which is rather better than your present O(N^2)
    >
    > BugBear


    Any suggestions on using Thread?

    #!/usr/bin/perl

    use strict;
    #use Data::Dumper;
    #use CGI;
    use Date::Calc qw(Delta_YMD);
    use Thread;

    my $file_1 = '1.txt'; # File 1
    my $file_2 = '2.txt'; # File 2
    my $file_3 = 'f.txt'; # Final output file

    if(open(FH1 , $file_1)){
    print "File $file_1 Opened\n";
    close(FH1);
    }else{
    print "Failed to Open file $file_1\n";
    exit;
    }

    if(open(FH2 , $file_2)){
    print "File $file_2 Opened\n";
    }else{
    print "Failed to Open file $file_2\n";
    close FH1;
    exit;
    }
    if(open(FH3,">$file_3")){
    print "File $file_3 Opened\n";
    print FH3 "serialno\tfileno\tdate\tlabel4\tlabel5\n";
    }else{
    print "Failed to Open file $file_3\n";
    close FH1;close FH2;
    exit;
    }

    while(my $line_2 = <FH2>){
    chomp($line_2);print $line_2."\n";
    my($dummy,$file2_no,$file2_date) = split(/\s+/,$line_2);
    next if($file2_no !~ /\d+/);
    my $counter = 0;
    my $least_date = date_compare($file2_date);
    my $thr = new Thread \&traverse, $dummy,$file2_no,$file2_date,
    $counter,$least_date;
    #$counter = traverse($file2_no,$file2_date);
    }
    sleep(500);
    close FH1;
    close FH2;
    close FH3;

    sub traverse{
    my($dummy,$file2_no,$file2_date,$counter,$least_date) = @_;
    my $counter = 0;
    open(FHT , $file_1);
    seek(FHT,0,0);
    while(my $line_1 = <FHT>){
    chomp($line_1);
    my ($d,$file1_no,$file1_date) = split(/;/,$line_1);
    if($file1_no == $file2_no){
    #print $file1_date."=".$file2_date."\n";
    if((date_compare5($file1_date,$file2_date)) == 1){
    $counter++;
    }
    }
    }
    close(FHT);
    $least_date = 0 if($counter == 0);
    print "$dummy\t$file2_no\t$file2_date\t$counter\t$least_date\n";
    print FH3 "$dummy\t$file2_no\t$file2_date\t$counter\t$least_date
    \n";
    return $counter;
    }
    sub date_compare5{ # Comparision for 5 Years
    my($date_1,$date_2) = @_;
    $date_1 =~/(\d\d\d\d)(\d\d)(\d\d)/;
    my $yr1 = $1;

    $date_2 =~/(\d\d\d\d)(\d\d)(\d\d)/;
    my $yr2 = $1;

    #print "$yr1=$mn1=$dt1: ";print "$yr2=$mn2=$dt2\n";
    if(($yr1 - $yr2) < 5){
    #print "$yr1=$mn1=$dt1: ";print "$yr2=$mn2=$dt2\n";
    return 1;
    }
    return -1;
    }
    sub date_compare{ # Comparision for actual date , return 1 if date1 is
    big otherwise -1 , if equal then 0
    my($date_1) = @_;
    $date_1 =~/(\d\d\d\d)(\d\d)(\d\d)/;
    my($yr1,$mn1,$dt1) = ($1,$2,$3);

    if($yr1 < 1996){
    return "5 Yrs";
    }elsif($yr1 == 1996 && $mn1 < 12){
    return "5 Yrs";
    }elsif($yr1 == 1996 && $mn1 == 12 && $dt1 <= 25 ){
    return "5 Yrs";
    }elsif($yr1 < 2001 && $yr1 > 1996){
    return delta($yr1,$mn1,$dt1);
    }elsif($yr1 == 1996 && $mn1 == 12 && $dt1 >=25){
    return delta($yr1,$mn1,$dt1);
    }elsif($yr1 == 2001 && $mn1 < 12 ){
    return delta($yr1,$mn1,$dt1);
    }elsif($yr1 == 2001 && $mn1 == 12 && $dt1 <=24){
    return delta($yr1,$mn1,$dt1);
    }elsif($yr1 > 2001){
    return delta($yr1,$mn1,$dt1);
    }elsif($yr1 == 2001 && $mn1 == 12 && $dt1 > 24 ){
    return delta($yr1,$mn1,$dt1);
    }else{
    return "No case ".$date_1;
    }
    }
    sub delta{
    my $yr = shift;my $mn = shift; my $dt= shift;
    ($yr,$mn,$dt) = Delta_YMD($yr,$mn,$dt,2001,12,25);
    return "$yr-$mn-$dt";
    }
    , Mar 26, 2008
    #8
  9. Guest

    "" <> wrote:
    > On Mar 14, 4:33 pm, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > >
    > > Overall performance is O(NlogN) + O(N) + O(NlogN) which is O(NlogN)
    > > which is rather better than your present O(N^2)
    > >
    > > BugBear

    >
    > Any suggestions on using Thread?


    God, I hope not. It seems like you want to try every bad way to solve
    this problem. What about the suggestions you already received--ones that
    would actually work and make things fast?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
    , Mar 26, 2008
    #9
  10. Ben Morrow Guest

    Quoth "" <>:
    > On Mar 14, 4:33 pm, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > >
    > > Overall performance is O(NlogN) + O(N) + O(NlogN) which is O(NlogN)
    > > which is rather better than your present O(N^2)

    >
    > Any suggestions on using Thread?


    Thread.pm is deprecated: it supported the old 5005-threads threading
    model, which never worked right and was removed from perl 5.8. Thread.pm
    is just a passthrough to threads.pm; new code should be using this
    directly.

    Ben
    Ben Morrow, Mar 26, 2008
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Danny Anderson

    open new file each loop iteration

    Danny Anderson, Jan 21, 2004, in forum: C++
    Replies:
    0
    Views:
    430
    Danny Anderson
    Jan 21, 2004
  2. Dennis Schulz

    iteration through a file of structs

    Dennis Schulz, May 8, 2004, in forum: C Programming
    Replies:
    2
    Views:
    329
    -berlin.de
    May 8, 2004
  3. Paul Watson

    Iteration on file reading

    Paul Watson, Oct 2, 2003, in forum: Python
    Replies:
    7
    Views:
    252
    Andrew Dalke
    Oct 4, 2003
  4. Rudi
    Replies:
    5
    Views:
    4,960
  5. Kyle Barbour
    Replies:
    10
    Views:
    571
    Marvin Gülker
    Aug 2, 2010
Loading...

Share This Page