Comments on parsing solution.

P

Prabh

Hello all,
This is about grepping, regexps and parsing data.
I do have a solution, but I was wondering if anyone could direct me to
a more efficient one.
I have a log file of the following format, which contains info. on a
series of files after a process.

===============================
File1: Info. on File1
File2: Info. on File2
File1: Info. on File1
File3: Info. on File3
File1: Info. on File1
and so on...
===============================

I want to display the output as...

============================
n1 lines of info on File1
n2 lines of info on File2
n3 lines of info on File3
============================

This is what I came up with, but when the input log file is of
gigantic proportions, the parsing takes a lot of time, could anyone
recommend a better solution, please?

#!/usr/local/bin/perl
#======================

#====================
# Foo.txt is the log
#--------------------
open(FDL,"Foo.txt") ;
chomp(@arr = <FDL> ) ;
close(FDL) ;

#===============================
# Get all the files in the log
#-------------------------------
undef @files ;
foreach $line ( @arr ) {
push(@files,(split(/\:/,$line))[0]) ;
}

#==========================================
# Sort the files, find the uniq files.
# Foreach such file, grep the original log
# for all occurrences and count.
#------------------------------------------
foreach $file ( &uniq(sort @files ) ) {
undef $info ;
$info = grep {/^$file\:/} @arr ;
printf "$info lines of info on $file\n";
}


#=============================
# subroutine to do Unixy-uniq
#-----------------------------
sub uniq {
@uniq = @_ ;
#=======================================================
# Foreach array element , compare with its predecessor.
# If yes, its already present and splice.
#-------------------------------------------------------
for ( $i = 1; $i < @uniq ; $i++ ) {
if ( @uniq[$i] eq @uniq[$i-1] ) {
splice( @uniq,$i-1,1 ) ;
$i--;
}
}

return @uniq ;

}


Thanks,
Prab
 
J

Jürgen Exner

Prabh said:
Hello all,
This is about grepping, regexps and parsing data.
I do have a solution, but I was wondering if anyone could direct me to
a more efficient one.
I have a log file of the following format, which contains info. on a
series of files after a process.

===============================
File1: Info. on File1
File2: Info. on File2
File1: Info. on File1
File3: Info. on File3
File1: Info. on File1
and so on...
===============================

I want to display the output as...

============================
n1 lines of info on File1
n2 lines of info on File2
n3 lines of info on File3
============================

This is what I came up with, but when the input log file is of
gigantic proportions, the parsing takes a lot of time, could anyone
recommend a better solution, please?

[snip program]


Whenever you see "unique" you should automatically think "hash". For your
problem that means a better data structure would be a hash of (references to
arrays).


In your program you are looping through the looping through list half a
dozen times, including a read, a sort, a grep, and a unique operation.
That's 3n+n*log n already!
Instead you could do the work once while reading the file line by line and
build your target data structure incrementally in linear time.
To do this just read the next line, extract the file name, add this line to
the array that is the hash value for this file name.
When done reading the whole file just sort the keys of the hash and print
each value in sequence (pseudo-code for clarifying the logical program flow,
not fit and polished Perl!):

open FDL or die ....;
while ( <FDL>) { # for each line
($fname) = split (/:/, $_, 2); #get the file name
push @{$myhash{$fname}}, $_; #and push the current line into the
hash at key $fname
}
for (sort (keys (%myhash))) { #for each file name in sorted order
print @{$myhash{$fname}}; #print the array with the lines
}
 
G

Glenn Jackman

Prabh said:
I do have a solution, but I was wondering if anyone could direct me to
a more efficient one.
I have a log file of the following format, which contains info. on a
series of files after a process.

===============================
File1: Info. on File1
File2: Info. on File2
File1: Info. on File1
File3: Info. on File3
File1: Info. on File1
and so on...
===============================

I want to display the output as...

============================
n1 lines of info on File1
n2 lines of info on File2
n3 lines of info on File3
============================

#!/usr/local/bin/perl
use strict;
use warnings;

# always check the return value of open()
open F, "file" or die "can't open file: $!\n";
my %hash;
while (<F>) {
$hash{(split /:/)[0]} ++;
}
close F;
foreach my $f (sort keys %hash) {
print "$hash{$f} lines of info on $f\n";
}
 
T

Tore Aursand

#!/usr/local/bin/perl

You _need_ this:

use strict;
use warnings;
open(FDL,"Foo.txt") ;
chomp(@arr = <FDL> ) ;
close(FDL) ;

Always check the return value of open():

open( FDL, 'Foo.txt' ) or die "$!\n";
undef @files ;
foreach $line ( @arr ) {
push(@files,(split(/\:/,$line))[0]) ;
}

Why do you want to set @files to undefined? This should do it, and it
keeps @files unique too;

my @files = ();
my %seen = ();
foreach ( @arr ) {
my $file = ( split(/\:/) )[0];
push( @files, $file ) unless ( exists $seen{$file} );
}
foreach $file ( &uniq(sort @files ) ) {
undef $info ;
$info = grep {/^$file\:/} @arr ;
printf "$info lines of info on $file\n";
}

And this could be written as (no need for 'printf'):

foreach ( sort @files ) {
my $info = grep { /^$file\:/ } @arr;
print "$info lines of into on $file\n";
}
sub uniq {

AFAIKT, this won't work if you give it an array of files where two
identical filename doesn't follow each other;

perldoc -q duplicate

You don't need this function, though, as my code (above) keeps the array
unique at the point it's being populated.
 
J

Jeff 'japhy' Pinyan

[posted & mailed]

I have a log file of the following format, which contains info. on a
series of files after a process.

===============================
File1: Info. on File1
File2: Info. on File2
File1: Info. on File1
File3: Info. on File3
File1: Info. on File1
===============================

I want to display the output as...

============================
n1 lines of info on File1
n2 lines of info on File2
n3 lines of info on File3
============================
This is what I came up with, but when the input log file is of
gigantic proportions, the parsing takes a lot of time, could anyone
recommend a better solution, please?

That's because you slurp the ENTIRE file into memory, which takes time and
space:
open(FDL,"Foo.txt") ;
chomp(@arr = <FDL> ) ;
close(FDL) ;

Then you make an array of the same number of elements, when you should
really be using a hash:
foreach $line ( @arr ) {
push(@files,(split(/\:/,$line))[0]) ;
}

Then you sort the list of files, and then iterate over the ENTIRE file's
contents for EACH file.
foreach $file ( &uniq(sort @files ) ) {
undef $info ;
$info = grep {/^$file\:/} @arr ;
printf "$info lines of info on $file\n";
}

You've now made ONE pass over the file, ONE pass over the array of the
file, and then ANOTHER pass over the array of the file for EACH unique
filename. For a file with 3 unique names, that's basically FIVE passes.

I would strongly suggest using a hash, and making only ONE pass over the
file:

#!/usr/bin/perl

use strict;
use warnings;

my %records;

open FDL, "Foo.txt" or die "can't read Foo.txt: $!";
while (<FDL>) {
my ($rec) = split /:/;
++$records{$rec};
}
close FDL;

for (keys %records) {
print "$records{$_} lines of info on $_\n";
}

Something like that. You might want to keep an array of the ORDER of the
filenames:

my (%records, @order);

open ...;
while (<FDL>) {
my ($rec) = split /:/;
$records{$rec}++ or push @order, $rec;
}
close ...;

for (@order) {
# ...
}
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

#!/usr/local/bin/perl
use strict;
use warnings;

# always check the return value of open()
open F, "file" or die "can't open file: $!\n";
my %hash;
while (<F>) {
$hash{(split /:/)[0]} ++;
}
close F;
foreach my $f (sort keys %hash) {
print "$hash{$f} lines of info on $f\n";
}

Are you golfing, or trying to help? If the latter, perhaps you would be
so kind as to provide a bit of explanation, instead of just throwing some
fairly dense code at the novice?

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP71tNWPeouIeTNHoEQK0uACeLe3zEYqMPXUPiXLfVvIs39LrHOUAn2M/
wGx6tVOBDtx4eLz5SspJnxlg
=7F8a
-----END PGP SIGNATURE-----
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

my @files = ();
my %seen = ();

Why not simply

my @files;
my %seen;

?
Less typing, less chance for typos.

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP71ts2PeouIeTNHoEQJtFwCgqigQ9GmGiMJrqUF3fHohcYmBKoYAoKVX
L5ZcTnjn9ZonQiNNsZwguDPz
=foWb
-----END PGP SIGNATURE-----
 
A

Anno Siegel

Eric J. Roode said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

#!/usr/local/bin/perl
use strict;
use warnings;

# always check the return value of open()
open F, "file" or die "can't open file: $!\n";
my %hash;
while (<F>) {
$hash{(split /:/)[0]} ++;
}
close F;
foreach my $f (sort keys %hash) {
print "$hash{$f} lines of info on $f\n";
}

Are you golfing, or trying to help? If the latter, perhaps you would be
so kind as to provide a bit of explanation, instead of just throwing some
fairly dense code at the novice?

Oh, come on. The OP had this (after reading the file into @arr):
foreach $line ( @arr ) {
push(@files,(split(/\:/,$line))[0]) ;
}

Lose the file slurping and replace @arr with %hash, and you end up
more or less with Eric's code. That's not too much of a step.

Anno
 
T

Tore Aursand

Why not simply

my @files;
my %seen;

?
Less typing, less chance for typos.

You have a point, of course. My personal style, however, implies that I
set each variable when I declare them. Even if it's not necessary, and
even when they're empty.
 
U

Uri Guttman

TA> You have a point, of course. My personal style, however, implies that I
TA> set each variable when I declare them. Even if it's not necessary, and
TA> even when they're empty.

my has a runtime effect of clearing variables.

uri
 
A

Anno Siegel

Uri Guttman said:
TA> You have a point, of course. My personal style, however, implies that I
TA> set each variable when I declare them. Even if it's not necessary, and
TA> even when they're empty.

my has a runtime effect of clearing variables.

Ah, but Tore knows that. His style rule says to initialize every variable,
whether it needs it or not. That's how I read his remark, and it's what
I did for a long while too.

I don't do it anymore. For one, every redundant statement in a source
leaves a nagging doubt whether the author perhaps *thought* it necessary,
thereby revealing a lack of acquaintance with the language. A good style
should build confidence that the author knows what they're doing, not under-
mine it.

Another reason for not always initializing is that you can tell the reader
something by initializing only where necessary. In saying:

my $x = 0;
# some code involving $x
print $x;

I'm giving a subtle hint that "some code ..." may *not* set $x under some
circumstances. Without the initialization the reader knows that I believe
$x will always be set. Always initializing all variables takes this bit of
expressiveness away.

Anno
 
T

Tore Aursand

my has a runtime effect of clearing variables.

That's right, but you must be a real speed-demon if you're hoping to gain
anything. But - I guess - a little here and a little there sums up to be
something very big somewhere else. :)

Just for the fun of it, I benchmark'ed this. Setting a scalar, an array
and a hash explicit took more than twice the time than "leaving them
alone".
 
U

Uri Guttman

TA> That's right, but you must be a real speed-demon if you're hoping to gain
TA> anything. But - I guess - a little here and a little there sums up to be
TA> something very big somewhere else. :)

TA> Just for the fun of it, I benchmark'ed this. Setting a scalar, an array
TA> and a hash explicit took more than twice the time than "leaving them
TA> alone".

good to know but i don't assign () or undef in my as it is redundant and
poor style IMO. the higher speed is nice as well.

uri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top