please help with creating a special iterator

M

Mathematisch

Hi,

The problem: I would like to create an iterator to iterate through a
csv file with the following structure:


field_1,field_2,...field_14
field_1,field_2,...field_14
(...)



Note that this is a csv file with 14 fields and it is already sorted
by field_1 and then by field_2. There are usually only 5-10 lines
having the same field_1 and field_2 value.

There could be up to hundreds of millions of lines in the file. The
desired iterator should work like this: At each "next_entry" call, the
iterator should return a reference to an array of the lines having the
identical field_1 and field_2 values.

Because of my lack of understanding the iterator concept, I could not
come up with a solution yet. The file is too big to use the field_1
and field_2 as a hash key to achieve the same goal of grouping the
entries.

Thank you very much for any help on this. I hope I can learn from the
eventual proposed solutions.

Kind regards.
F.
 
J

J. Gleixner

Mathematisch said:
Hi,

The problem: I would like to create an iterator to iterate through a
csv file with the following structure:


field_1,field_2,...field_14
field_1,field_2,...field_14
(...)



Note that this is a csv file with 14 fields and it is already sorted
by field_1 and then by field_2. There are usually only 5-10 lines
having the same field_1 and field_2 value.

There could be up to hundreds of millions of lines in the file. The
desired iterator should work like this: At each "next_entry" call, the
iterator should return a reference to an array of the lines having the
identical field_1 and field_2 values.

Because of my lack of understanding the iterator concept, I could not
come up with a solution yet. The file is too big to use the field_1
and field_2 as a hash key to achieve the same goal of grouping the
entries.

You don't say what you want to do with the data, however
you could store everything into a database, then using
group by, order by, you could process your data easily.

However, since you say that everything is already sorted
by those keys, you could process things as you read the
file, keeping track of when those fields change. Throwing a
next_entry around this and having it return the data
of calling process_data, would be simple enough, I rarely
bother with creating an 'iterator'.. but that's just me.. :)

Hopefully you're using Text::CSV or some other module to
parse the CSV file.

my ( $prev_f1, $prev_f2, @data );
while( my $line = <> )
{
chomp( $line );
my ( $f1, $f2, @fields ) = parse-line-somehow();

if( $f1 eq $prev_f1 && $f2 eq $prev_f2 )
{
push( @data, \@fields );
}
else
{
process_data( $prev_f1, $prev_f2, \@data );
$prev_f1 = $f1;
$prev_f2 = $f2;
undef @data;
push( @data, \@fields );
}
}
process_data( $prev_f1, $prev_f2, \@data ) if @data;

sub process_data
{
my $f1 = shift;
my $f2 = shift;
my $data_aref = shift;

# do whatever you want...
}
 
J

J. Gleixner

J. Gleixner wrote:
[...]
my ( $prev_f1, $prev_f2, @data );
while( my $line = <> )
{
chomp( $line );
my ( $f1, $f2, @fields ) = parse-line-somehow();

FYI: Just looking at this again..

Dumb bug here, which will cause it to always go to the
else, on the first line read.. so you'll have to modify
this if() accordingly.
 
S

sln

You don't say what you want to do with the data, however
you could store everything into a database, then using
group by, order by, you could process your data easily.

However, since you say that everything is already sorted
by those keys, you could process things as you read the
file, keeping track of when those fields change. Throwing a
next_entry around this and having it return the data
of calling process_data, would be simple enough, I rarely
bother with creating an 'iterator'.. but that's just me.. :)

Hopefully you're using Text::CSV or some other module to
parse the CSV file.

my ( $prev_f1, $prev_f2, @data );
while( my $line = <> )
{
chomp( $line );
my ( $f1, $f2, @fields ) = parse-line-somehow();

if( $f1 eq $prev_f1 && $f2 eq $prev_f2 )
{
push( @data, \@fields );
}
else
{
process_data( $prev_f1, $prev_f2, \@data );
$prev_f1 = $f1;
$prev_f2 = $f2;
undef @data;
push( @data, \@fields );
}
}
process_data( $prev_f1, $prev_f2, \@data ) if @data;

sub process_data
{
my $f1 = shift;
my $f2 = shift;
my $data_aref = shift;

# do whatever you want...
}

Since it's set up to process_data() on every
non-match (including the first line), the check could
be in the function.

----------

my ( $prev_f1, $prev_f2, @data );
while( my $line = <> )
{
...
}
process_data( $prev_f1, $prev_f2, \@data ); # if @data;

sub process_data
{
my ( $f1, $f2, $data_aref ) = @_;
return unless @{$data_ref};

if ( @{$data_ref} > 1 ) {
# process multiple records (all with same f1 f2 val's)
}
else {
# process single record (or not)
}
}

-sln
 
X

Xho Jingleheimerschmidt

Mathematisch said:
Hi,

The problem: I would like to create an iterator to iterate through a
csv file with the following structure:


field_1,field_2,...field_14
field_1,field_2,...field_14
(...)

Note that this is a csv file with 14 fields and it is already sorted
by field_1 and then by field_2. There are usually only 5-10 lines
having the same field_1 and field_2 value.

What is usually the case is of precious little value. If the unusual
case causes ICBMs to be erroneously launched, where is the comfort in
the fact that this is unusual? What is the *maximum plausible* number
of lines with the same field_1 and field_2?
There could be up to hundreds of millions of lines in the file. The
desired iterator should work like this: At each "next_entry" call, the
iterator should return a reference to an array of the lines having the
identical field_1 and field_2 values.

Because of my lack of understanding the iterator concept, I could not
come up with a solution yet. The file is too big to use the field_1
and field_2 as a hash key to achieve the same goal of grouping the
entries.

package whatever;
sub new {
shift; # not meant for subclassing
open my $fh, (shift) or die $!;
my $x=<$fh>; chomp $x;
return bless [$fh,$x];
};

sub next_entry {
my $this=shift;
my $fh=$this->[0];
return unless defined $this->[1];
my @return=$this->[1];
my @line=split /,/, $this->[1];
while(1) {
$this->[1]=<$fh>;
return [@return] unless defined $this->[1];
chomp $this->[1];
my @line2=split /,/, $this->[1];
return [@return] unless $line2[0]eq$line[0] and $line2[1]eq$line[1];
push @return, $this->[1];
};
};
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Please help 2
Help with my navigation, please 0
Help please 8
Please help 7
Special type of permutation 1
Code help please 4
Please, help me. 1
HELP PLEASE 4

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top