please help with creating a special iterator

Discussion in 'Perl Misc' started by Mathematisch, Oct 18, 2010.

  1. Mathematisch

    Mathematisch Guest

    Hi,

    The problem: I would like to create an iterator to iterate through a
    csv file with the following structure:


    field_1,field_2,...field_14
    field_1,field_2,...field_14
    (...)



    Note that this is a csv file with 14 fields and it is already sorted
    by field_1 and then by field_2. There are usually only 5-10 lines
    having the same field_1 and field_2 value.

    There could be up to hundreds of millions of lines in the file. The
    desired iterator should work like this: At each "next_entry" call, the
    iterator should return a reference to an array of the lines having the
    identical field_1 and field_2 values.

    Because of my lack of understanding the iterator concept, I could not
    come up with a solution yet. The file is too big to use the field_1
    and field_2 as a hash key to achieve the same goal of grouping the
    entries.

    Thank you very much for any help on this. I hope I can learn from the
    eventual proposed solutions.

    Kind regards.
    F.
    Mathematisch, Oct 18, 2010
    #1
    1. Advertising

  2. Mathematisch

    J. Gleixner Guest

    Mathematisch wrote:
    > Hi,
    >
    > The problem: I would like to create an iterator to iterate through a
    > csv file with the following structure:
    >
    >
    > field_1,field_2,...field_14
    > field_1,field_2,...field_14
    > (...)
    >
    >
    >
    > Note that this is a csv file with 14 fields and it is already sorted
    > by field_1 and then by field_2. There are usually only 5-10 lines
    > having the same field_1 and field_2 value.
    >
    > There could be up to hundreds of millions of lines in the file. The
    > desired iterator should work like this: At each "next_entry" call, the
    > iterator should return a reference to an array of the lines having the
    > identical field_1 and field_2 values.
    >
    > Because of my lack of understanding the iterator concept, I could not
    > come up with a solution yet. The file is too big to use the field_1
    > and field_2 as a hash key to achieve the same goal of grouping the
    > entries.


    You don't say what you want to do with the data, however
    you could store everything into a database, then using
    group by, order by, you could process your data easily.

    However, since you say that everything is already sorted
    by those keys, you could process things as you read the
    file, keeping track of when those fields change. Throwing a
    next_entry around this and having it return the data
    of calling process_data, would be simple enough, I rarely
    bother with creating an 'iterator'.. but that's just me.. :)

    Hopefully you're using Text::CSV or some other module to
    parse the CSV file.

    my ( $prev_f1, $prev_f2, @data );
    while( my $line = <> )
    {
    chomp( $line );
    my ( $f1, $f2, @fields ) = parse-line-somehow();

    if( $f1 eq $prev_f1 && $f2 eq $prev_f2 )
    {
    push( @data, \@fields );
    }
    else
    {
    process_data( $prev_f1, $prev_f2, \@data );
    $prev_f1 = $f1;
    $prev_f2 = $f2;
    undef @data;
    push( @data, \@fields );
    }
    }
    process_data( $prev_f1, $prev_f2, \@data ) if @data;

    sub process_data
    {
    my $f1 = shift;
    my $f2 = shift;
    my $data_aref = shift;

    # do whatever you want...
    }
    J. Gleixner, Oct 18, 2010
    #2
    1. Advertising

  3. Mathematisch

    J. Gleixner Guest

    J. Gleixner wrote:
    [...]
    > my ( $prev_f1, $prev_f2, @data );
    > while( my $line = <> )
    > {
    > chomp( $line );
    > my ( $f1, $f2, @fields ) = parse-line-somehow();
    >


    FYI: Just looking at this again..

    Dumb bug here, which will cause it to always go to the
    else, on the first line read.. so you'll have to modify
    this if() accordingly.

    > if( $f1 eq $prev_f1 && $f2 eq $prev_f2 )
    > {
    > push( @data, \@fields );
    > }
    > else
    > {
    J. Gleixner, Oct 18, 2010
    #3
  4. Mathematisch

    Guest

    On Mon, 18 Oct 2010 11:09:11 -0500, "J. Gleixner" <> wrote:

    >Mathematisch wrote:
    >> Hi,
    >>
    >> The problem: I would like to create an iterator to iterate through a
    >> csv file with the following structure:
    >>
    >>
    >> field_1,field_2,...field_14
    >> field_1,field_2,...field_14
    >> (...)
    >>
    >>
    >>
    >> Note that this is a csv file with 14 fields and it is already sorted
    >> by field_1 and then by field_2. There are usually only 5-10 lines
    >> having the same field_1 and field_2 value.
    >>
    >> There could be up to hundreds of millions of lines in the file. The
    >> desired iterator should work like this: At each "next_entry" call, the
    >> iterator should return a reference to an array of the lines having the
    >> identical field_1 and field_2 values.
    >>
    >> Because of my lack of understanding the iterator concept, I could not
    >> come up with a solution yet. The file is too big to use the field_1
    >> and field_2 as a hash key to achieve the same goal of grouping the
    >> entries.

    >
    >You don't say what you want to do with the data, however
    >you could store everything into a database, then using
    >group by, order by, you could process your data easily.
    >
    >However, since you say that everything is already sorted
    >by those keys, you could process things as you read the
    >file, keeping track of when those fields change. Throwing a
    >next_entry around this and having it return the data
    >of calling process_data, would be simple enough, I rarely
    >bother with creating an 'iterator'.. but that's just me.. :)
    >
    >Hopefully you're using Text::CSV or some other module to
    >parse the CSV file.
    >
    >my ( $prev_f1, $prev_f2, @data );
    >while( my $line = <> )
    >{
    > chomp( $line );
    > my ( $f1, $f2, @fields ) = parse-line-somehow();
    >
    > if( $f1 eq $prev_f1 && $f2 eq $prev_f2 )
    > {
    > push( @data, \@fields );
    > }
    > else
    > {
    > process_data( $prev_f1, $prev_f2, \@data );
    > $prev_f1 = $f1;
    > $prev_f2 = $f2;
    > undef @data;
    > push( @data, \@fields );
    > }
    >}
    >process_data( $prev_f1, $prev_f2, \@data ) if @data;
    >
    >sub process_data
    >{
    > my $f1 = shift;
    > my $f2 = shift;
    > my $data_aref = shift;
    >
    > # do whatever you want...
    >}


    Since it's set up to process_data() on every
    non-match (including the first line), the check could
    be in the function.

    ----------

    my ( $prev_f1, $prev_f2, @data );
    while( my $line = <> )
    {
    ...
    }
    process_data( $prev_f1, $prev_f2, \@data ); # if @data;

    sub process_data
    {
    my ( $f1, $f2, $data_aref ) = @_;
    return unless @{$data_ref};

    if ( @{$data_ref} > 1 ) {
    # process multiple records (all with same f1 f2 val's)
    }
    else {
    # process single record (or not)
    }
    }

    -sln
    , Oct 19, 2010
    #4
  5. Mathematisch wrote:
    > Hi,
    >
    > The problem: I would like to create an iterator to iterate through a
    > csv file with the following structure:
    >
    >
    > field_1,field_2,...field_14
    > field_1,field_2,...field_14
    > (...)
    >
    > Note that this is a csv file with 14 fields and it is already sorted
    > by field_1 and then by field_2. There are usually only 5-10 lines
    > having the same field_1 and field_2 value.


    What is usually the case is of precious little value. If the unusual
    case causes ICBMs to be erroneously launched, where is the comfort in
    the fact that this is unusual? What is the *maximum plausible* number
    of lines with the same field_1 and field_2?

    > There could be up to hundreds of millions of lines in the file. The
    > desired iterator should work like this: At each "next_entry" call, the
    > iterator should return a reference to an array of the lines having the
    > identical field_1 and field_2 values.
    >
    > Because of my lack of understanding the iterator concept, I could not
    > come up with a solution yet. The file is too big to use the field_1
    > and field_2 as a hash key to achieve the same goal of grouping the
    > entries.


    package whatever;
    sub new {
    shift; # not meant for subclassing
    open my $fh, (shift) or die $!;
    my $x=<$fh>; chomp $x;
    return bless [$fh,$x];
    };

    sub next_entry {
    my $this=shift;
    my $fh=$this->[0];
    return unless defined $this->[1];
    my @return=$this->[1];
    my @line=split /,/, $this->[1];
    while(1) {
    $this->[1]=<$fh>;
    return [@return] unless defined $this->[1];
    chomp $this->[1];
    my @line2=split /,/, $this->[1];
    return [@return] unless $line2[0]eq$line[0] and $line2[1]eq$line[1];
    push @return, $this->[1];
    };
    };



    >
    > Thank you very much for any help on this. I hope I can learn from the
    > eventual proposed solutions.
    >
    > Kind regards.
    > F.
    >
    >
    >
    Xho Jingleheimerschmidt, Oct 19, 2010
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hendrik Maryns
    Replies:
    18
    Views:
    1,399
  2. KK
    Replies:
    2
    Views:
    516
    Big Brian
    Oct 14, 2003
  3. Bobo

    Special map iterator

    Bobo, Feb 9, 2004, in forum: C++
    Replies:
    2
    Views:
    406
  4. KK

    Special iterator

    KK, Dec 16, 2005, in forum: C++
    Replies:
    3
    Views:
    284
    mlimber
    Dec 16, 2005
  5. Mark Hobley

    Iterator special variable $_

    Mark Hobley, Jan 7, 2007, in forum: Perl Misc
    Replies:
    10
    Views:
    157
    Verizon
    Jan 10, 2007
Loading...

Share This Page