Comments on parsing solution.

Prabh · Nov 20, 2003

Hello all,
This is about grepping, regexps and parsing data.
I do have a solution, but I was wondering if anyone could direct me to
a more efficient one.
I have a log file of the following format, which contains info. on a
series of files after a process.

===============================
File1: Info. on File1
File2: Info. on File2
File1: Info. on File1
File3: Info. on File3
File1: Info. on File1
and so on...
===============================

I want to display the output as...

============================
n1 lines of info on File1
n2 lines of info on File2
n3 lines of info on File3
============================

This is what I came up with, but when the input log file is of
gigantic proportions, the parsing takes a lot of time, could anyone
recommend a better solution, please?

#!/usr/local/bin/perl
#======================

#====================
# Foo.txt is the log
#--------------------
open(FDL,"Foo.txt") ;
chomp(@arr = <FDL> ) ;
close(FDL) ;

#===============================
# Get all the files in the log
#-------------------------------
undef @files ;
foreach $line ( @arr ) {
push(@files,(split(/\:/,$line))[0]) ;
}

#==========================================
# Sort the files, find the uniq files.
# Foreach such file, grep the original log
# for all occurrences and count.
#------------------------------------------
foreach $file ( &uniq(sort @files ) ) {
undef $info ;
$info = grep {/^$file\:/} @arr ;
printf "$info lines of info on $file\n";
}

#=============================
# subroutine to do Unixy-uniq
#-----------------------------
sub uniq {
@uniq = @_ ;
#=======================================================
# Foreach array element , compare with its predecessor.
# If yes, its already present and splice.
#-------------------------------------------------------
for ( $i = 1; $i < @uniq ; $i++ ) {
if ( @uniq[$i] eq @uniq[$i-1] ) {
splice( @uniq,$i-1,1 ) ;
$i--;
}
}

return @uniq ;

}

Thanks,
Prab

Jürgen Exner · Nov 20, 2003

Prabh said:
Hello all,
This is about grepping, regexps and parsing data.
I do have a solution, but I was wondering if anyone could direct me to
a more efficient one.
I have a log file of the following format, which contains info. on a
series of files after a process.

===============================
File1: Info. on File1
File2: Info. on File2
File1: Info. on File1
File3: Info. on File3
File1: Info. on File1
and so on...
===============================

I want to display the output as...

============================
n1 lines of info on File1
n2 lines of info on File2
n3 lines of info on File3
============================

This is what I came up with, but when the input log file is of
gigantic proportions, the parsing takes a lot of time, could anyone
recommend a better solution, please?

[snip program]

Whenever you see "unique" you should automatically think "hash". For your
problem that means a better data structure would be a hash of (references to
arrays).

In your program you are looping through the looping through list half a
dozen times, including a read, a sort, a grep, and a unique operation.
That's 3n+n*log n already!
Instead you could do the work once while reading the file line by line and
build your target data structure incrementally in linear time.
To do this just read the next line, extract the file name, add this line to
the array that is the hash value for this file name.
When done reading the whole file just sort the keys of the hash and print
each value in sequence (pseudo-code for clarifying the logical program flow,
not fit and polished Perl!):

open FDL or die ....;
while ( <FDL>) { # for each line
($fname) = split (/:/, $_, 2); #get the file name
push @{$myhash{$fname}}, $_; #and push the current line into the
hash at key $fname
}
for (sort (keys (%myhash))) { #for each file name in sorted order
print @{$myhash{$fname}}; #print the array with the lines
}

Glenn Jackman · Nov 20, 2003

Prabh said:
I do have a solution, but I was wondering if anyone could direct me to
a more efficient one.
I have a log file of the following format, which contains info. on a
series of files after a process.

===============================
File1: Info. on File1
File2: Info. on File2
File1: Info. on File1
File3: Info. on File3
File1: Info. on File1
and so on...
===============================

I want to display the output as...

============================
n1 lines of info on File1
n2 lines of info on File2
n3 lines of info on File3
============================

#!/usr/local/bin/perl
use strict;
use warnings;

# always check the return value of open()
open F, "file" or die "can't open file: $!\n";
my %hash;
while (<F>) {
$hash{(split /:/)[0]} ++;
}
close F;
foreach my $f (sort keys %hash) {
print "$hash{$f} lines of info on $f\n";
}

Tore Aursand · Nov 20, 2003

#!/usr/local/bin/perl

You _need_ this:

use strict;
use warnings;

open(FDL,"Foo.txt") ;
chomp(@arr = <FDL> ) ;
close(FDL) ;

Always check the return value of open():

open( FDL, 'Foo.txt' ) or die "$!\n";

undef @files ;
foreach $line ( @arr ) {
push(@files,(split(/\:/,$line))[0]) ;
}

Why do you want to set @files to undefined? This should do it, and it
keeps @files unique too;

my @files = ();
my %seen = ();
foreach ( @arr ) {
my $file = ( split(/\:/) )[0];
push( @files, $file ) unless ( exists $seen{$file} );
}

foreach $file ( &uniq(sort @files ) ) {
undef $info ;
$info = grep {/^$file\:/} @arr ;
printf "$info lines of info on $file\n";
}

And this could be written as (no need for 'printf'):

foreach ( sort @files ) {
my $info = grep { /^$file\:/ } @arr;
print "$info lines of into on $file\n";
}

sub uniq {

AFAIKT, this won't work if you give it an array of files where two
identical filename doesn't follow each other;

perldoc -q duplicate

You don't need this function, though, as my code (above) keeps the array
unique at the point it's being populated.

Jeff 'japhy' Pinyan · Nov 20, 2003

[posted & mailed]

I have a log file of the following format, which contains info. on a
series of files after a process.

===============================
File1: Info. on File1
File2: Info. on File2
File1: Info. on File1
File3: Info. on File3
File1: Info. on File1
===============================

I want to display the output as...

============================
n1 lines of info on File1
n2 lines of info on File2
n3 lines of info on File3
============================

This is what I came up with, but when the input log file is of
gigantic proportions, the parsing takes a lot of time, could anyone
recommend a better solution, please?

That's because you slurp the ENTIRE file into memory, which takes time and
space:

open(FDL,"Foo.txt") ;
chomp(@arr = <FDL> ) ;
close(FDL) ;

Then you make an array of the same number of elements, when you should
really be using a hash:

foreach $line ( @arr ) {
push(@files,(split(/\:/,$line))[0]) ;
}

Then you sort the list of files, and then iterate over the ENTIRE file's
contents for EACH file.

foreach $file ( &uniq(sort @files ) ) {
undef $info ;
$info = grep {/^$file\:/} @arr ;
printf "$info lines of info on $file\n";
}

You've now made ONE pass over the file, ONE pass over the array of the
file, and then ANOTHER pass over the array of the file for EACH unique
filename. For a file with 3 unique names, that's basically FIVE passes.

I would strongly suggest using a hash, and making only ONE pass over the
file:

#!/usr/bin/perl

use strict;
use warnings;

my %records;

open FDL, "Foo.txt" or die "can't read Foo.txt: $!";
while (<FDL>) {
my ($rec) = split /:/;
++$records{$rec};
}
close FDL;

for (keys %records) {
print "$records{$_} lines of info on $_\n";
}

Something like that. You might want to keep an array of the ORDER of the
filenames:

my (%records, @order);

open ...;
while (<FDL>) {
my ($rec) = split /:/;
$records{$rec}++ or push @order, $rec;
}
close ...;

for (@order) {
# ...
}

Eric J. Roode · Nov 21, 2003

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

#!/usr/local/bin/perl
use strict;
use warnings;

# always check the return value of open()
open F, "file" or die "can't open file: $!\n";
my %hash;
while (<F>) {
$hash{(split /:/)[0]} ++;
}
close F;
foreach my $f (sort keys %hash) {
print "$hash{$f} lines of info on $f\n";
}

Are you golfing, or trying to help? If the latter, perhaps you would be
so kind as to provide a bit of explanation, instead of just throwing some
fairly dense code at the novice?

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP71tNWPeouIeTNHoEQK0uACeLe3zEYqMPXUPiXLfVvIs39LrHOUAn2M/
wGx6tVOBDtx4eLz5SspJnxlg
=7F8a
-----END PGP SIGNATURE-----

Eric J. Roode · Nov 21, 2003

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

my @files = ();
my %seen = ();

Why not simply

my @files;
my %seen;

?
Less typing, less chance for typos.

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP71ts2PeouIeTNHoEQJtFwCgqigQ9GmGiMJrqUF3fHohcYmBKoYAoKVX
L5ZcTnjn9ZonQiNNsZwguDPz
=foWb
-----END PGP SIGNATURE-----

Anno Siegel · Nov 21, 2003

Eric J. Roode said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

#!/usr/local/bin/perl
use strict;
use warnings;

# always check the return value of open()
open F, "file" or die "can't open file: $!\n";
my %hash;
while (<F>) {
$hash{(split /:/)[0]} ++;
}
close F;
foreach my $f (sort keys %hash) {
print "$hash{$f} lines of info on $f\n";
}

Click to expand...

Are you golfing, or trying to help? If the latter, perhaps you would be
so kind as to provide a bit of explanation, instead of just throwing some
fairly dense code at the novice?

Oh, come on. The OP had this (after reading the file into @arr):

foreach $line ( @arr ) {
push(@files,(split(/\:/,$line))[0]) ;
}

Lose the file slurping and replace @arr with %hash, and you end up
more or less with Eric's code. That's not too much of a step.

Anno

Tore Aursand · Nov 22, 2003

Why not simply

my @files;
my %seen;

?
Less typing, less chance for typos.

You have a point, of course. My personal style, however, implies that I
set each variable when I declare them. Even if it's not necessary, and
even when they're empty.

Uri Guttman · Nov 22, 2003

TA> You have a point, of course. My personal style, however, implies that I
TA> set each variable when I declare them. Even if it's not necessary, and
TA> even when they're empty.

my has a runtime effect of clearing variables.

uri

Anno Siegel · Nov 22, 2003

Uri Guttman said:
TA> You have a point, of course. My personal style, however, implies that I
TA> set each variable when I declare them. Even if it's not necessary, and
TA> even when they're empty.

my has a runtime effect of clearing variables.

Ah, but Tore knows that. His style rule says to initialize every variable,
whether it needs it or not. That's how I read his remark, and it's what
I did for a long while too.

I don't do it anymore. For one, every redundant statement in a source
leaves a nagging doubt whether the author perhaps *thought* it necessary,
thereby revealing a lack of acquaintance with the language. A good style
should build confidence that the author knows what they're doing, not under-
mine it.

Another reason for not always initializing is that you can tell the reader
something by initializing only where necessary. In saying:

my $x = 0;
# some code involving $x
print $x;

I'm giving a subtle hint that "some code ..." may *not* set $x under some
circumstances. Without the initialization the reader knows that I believe
$x will always be set. Always initializing all variables takes this bit of
expressiveness away.

Anno

Tore Aursand · Nov 24, 2003

my has a runtime effect of clearing variables.

That's right, but you must be a real speed-demon if you're hoping to gain
anything. But - I guess - a little here and a little there sums up to be
something very big somewhere else.

Just for the fun of it, I benchmark'ed this. Setting a scalar, an array
and a hash explicit took more than twice the time than "leaving them
alone".

Uri Guttman · Nov 24, 2003

TA> That's right, but you must be a real speed-demon if you're hoping to gain
TA> anything. But - I guess - a little here and a little there sums up to be
TA> something very big somewhere else.

TA> Just for the fun of it, I benchmark'ed this. Setting a scalar, an array
TA> and a hash explicit took more than twice the time than "leaving them
TA> alone".

good to know but i don't assign () or undef in my as it is redundant and
poor style IMO. the higher speed is nice as well.

uri

Parsing a log file	2	Nov 20, 2003
Why is this WordPress comments form not submitting?	1	Jan 12, 2020
Directory structure question	1	Apr 5, 2007
FAQ 4.41 How can I remove duplicate elements from a list or array?	0	Mar 1, 2011
efficient way to write multiple loops code	18	Oct 7, 2008
reading multiple files	1	Sep 10, 2010
Possibly useful perl script to filter lines in one file out of another.	23	Aug 23, 2009
I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023

Comments on parsing solution.

Prabh

Jürgen Exner

Glenn Jackman

Tore Aursand

Jeff 'japhy' Pinyan

Eric J. Roode

Eric J. Roode

Anno Siegel

Tore Aursand

Uri Guttman

Anno Siegel

Tore Aursand

Uri Guttman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads