please help with creating a special iterator

Mathematisch · Oct 18, 2010

Hi,

The problem: I would like to create an iterator to iterate through a
csv file with the following structure:

field_1,field_2,...field_14
field_1,field_2,...field_14
(...)

Note that this is a csv file with 14 fields and it is already sorted
by field_1 and then by field_2. There are usually only 5-10 lines
having the same field_1 and field_2 value.

There could be up to hundreds of millions of lines in the file. The
desired iterator should work like this: At each "next_entry" call, the
iterator should return a reference to an array of the lines having the
identical field_1 and field_2 values.

Because of my lack of understanding the iterator concept, I could not
come up with a solution yet. The file is too big to use the field_1
and field_2 as a hash key to achieve the same goal of grouping the
entries.

Thank you very much for any help on this. I hope I can learn from the
eventual proposed solutions.

Kind regards.
F.

J. Gleixner · Oct 18, 2010

Mathematisch said:
Hi,

The problem: I would like to create an iterator to iterate through a
csv file with the following structure:

field_1,field_2,...field_14
field_1,field_2,...field_14
(...)

Note that this is a csv file with 14 fields and it is already sorted
by field_1 and then by field_2. There are usually only 5-10 lines
having the same field_1 and field_2 value.

There could be up to hundreds of millions of lines in the file. The
desired iterator should work like this: At each "next_entry" call, the
iterator should return a reference to an array of the lines having the
identical field_1 and field_2 values.

Because of my lack of understanding the iterator concept, I could not
come up with a solution yet. The file is too big to use the field_1
and field_2 as a hash key to achieve the same goal of grouping the
entries.

You don't say what you want to do with the data, however
you could store everything into a database, then using
group by, order by, you could process your data easily.

However, since you say that everything is already sorted
by those keys, you could process things as you read the
file, keeping track of when those fields change. Throwing a
next_entry around this and having it return the data
of calling process_data, would be simple enough, I rarely
bother with creating an 'iterator'.. but that's just me..

Hopefully you're using Text::CSV or some other module to
parse the CSV file.

my ( $prev_f1, $prev_f2, @data );
while( my $line = <> )
{
chomp( $line );
my ( $f1, $f2, @fields ) = parse-line-somehow();

if( $f1 eq $prev_f1 && $f2 eq $prev_f2 )
{
push( @data, \@fields );
}
else
{
process_data( $prev_f1, $prev_f2, \@data );
$prev_f1 = $f1;
$prev_f2 = $f2;
undef @data;
push( @data, \@fields );
}
}
process_data( $prev_f1, $prev_f2, \@data ) if @data;

sub process_data
{
my $f1 = shift;
my $f2 = shift;
my $data_aref = shift;

# do whatever you want...
}

J. Gleixner · Oct 18, 2010

J. Gleixner wrote:
[...]

my ( $prev_f1, $prev_f2, @data );
while( my $line = <> )
{
chomp( $line );
my ( $f1, $f2, @fields ) = parse-line-somehow();

FYI: Just looking at this again..

Dumb bug here, which will cause it to always go to the
else, on the first line read.. so you'll have to modify
this if() accordingly.

sln · Oct 19, 2010

You don't say what you want to do with the data, however
you could store everything into a database, then using
group by, order by, you could process your data easily.

However, since you say that everything is already sorted
by those keys, you could process things as you read the
file, keeping track of when those fields change. Throwing a
next_entry around this and having it return the data
of calling process_data, would be simple enough, I rarely
bother with creating an 'iterator'.. but that's just me..

Hopefully you're using Text::CSV or some other module to
parse the CSV file.

my ( $prev_f1, $prev_f2, @data );
while( my $line = <> )
{
chomp( $line );
my ( $f1, $f2, @fields ) = parse-line-somehow();

if( $f1 eq $prev_f1 && $f2 eq $prev_f2 )
{
push( @data, \@fields );
}
else
{
process_data( $prev_f1, $prev_f2, \@data );
$prev_f1 = $f1;
$prev_f2 = $f2;
undef @data;
push( @data, \@fields );
}
}
process_data( $prev_f1, $prev_f2, \@data ) if @data;

sub process_data
{
my $f1 = shift;
my $f2 = shift;
my $data_aref = shift;

# do whatever you want...
}

Since it's set up to process_data() on every
non-match (including the first line), the check could
be in the function.

----------

my ( $prev_f1, $prev_f2, @data );
while( my $line = <> )
{
...
}
process_data( $prev_f1, $prev_f2, \@data ); # if @data;

sub process_data
{
my ( $f1, $f2, $data_aref ) = @_;
return unless @{$data_ref};

if ( @{$data_ref} > 1 ) {
# process multiple records (all with same f1 f2 val's)
}
else {
# process single record (or not)
}
}

-sln

Xho Jingleheimerschmidt · Oct 19, 2010

Mathematisch said:
Hi,

The problem: I would like to create an iterator to iterate through a
csv file with the following structure:

field_1,field_2,...field_14
field_1,field_2,...field_14
(...)

Note that this is a csv file with 14 fields and it is already sorted
by field_1 and then by field_2. There are usually only 5-10 lines
having the same field_1 and field_2 value.

What is usually the case is of precious little value. If the unusual
case causes ICBMs to be erroneously launched, where is the comfort in
the fact that this is unusual? What is the *maximum plausible* number
of lines with the same field_1 and field_2?

There could be up to hundreds of millions of lines in the file. The
desired iterator should work like this: At each "next_entry" call, the
iterator should return a reference to an array of the lines having the
identical field_1 and field_2 values.

Because of my lack of understanding the iterator concept, I could not
come up with a solution yet. The file is too big to use the field_1
and field_2 as a hash key to achieve the same goal of grouping the
entries.

package whatever;
sub new {
shift; # not meant for subclassing
open my $fh, (shift) or die $!;
my $x=<$fh>; chomp $x;
return bless [$fh,$x];
};

sub next_entry {
my $this=shift;
my $fh=$this->[0];
return unless defined $this->[1];
my @return=$this->[1];
my @line=split /,/, $this->[1];
while(1) {
$this->[1]=<$fh>;
return [@return] unless defined $this->[1];
chomp $this->[1];
my @line2=split /,/, $this->[1];
return [@return] unless $line2[0]eq$line[0] and $line2[1]eq$line[1];
push @return, $this->[1];
};
};

Please help	2	Jul 19, 2022
Help with my navigation, please	0	Feb 8, 2023
Help please	8	Jul 7, 2023
Please help	7	Jun 27, 2022
Special type of permutation	1	Dec 19, 2021
Code help please	4	May 19, 2023
Please, help me.	1	Aug 15, 2023
HELP PLEASE	4	Jul 20, 2022

please help with creating a special iterator

Mathematisch

J. Gleixner

J. Gleixner

sln

Xho Jingleheimerschmidt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads