Comments on parsing solution.

Discussion in 'Perl Misc' started by Prabh, Nov 20, 2003.

  1. Prabh

    Prabh Guest

    Hello all,
    This is about grepping, regexps and parsing data.
    I do have a solution, but I was wondering if anyone could direct me to
    a more efficient one.
    I have a log file of the following format, which contains info. on a
    series of files after a process.

    ===============================
    File1: Info. on File1
    File2: Info. on File2
    File1: Info. on File1
    File3: Info. on File3
    File1: Info. on File1
    and so on...
    ===============================

    I want to display the output as...

    ============================
    n1 lines of info on File1
    n2 lines of info on File2
    n3 lines of info on File3
    ============================

    This is what I came up with, but when the input log file is of
    gigantic proportions, the parsing takes a lot of time, could anyone
    recommend a better solution, please?

    #!/usr/local/bin/perl
    #======================

    #====================
    # Foo.txt is the log
    #--------------------
    open(FDL,"Foo.txt") ;
    chomp(@arr = <FDL> ) ;
    close(FDL) ;

    #===============================
    # Get all the files in the log
    #-------------------------------
    undef @files ;
    foreach $line ( @arr ) {
    push(@files,(split(/\:/,$line))[0]) ;
    }

    #==========================================
    # Sort the files, find the uniq files.
    # Foreach such file, grep the original log
    # for all occurrences and count.
    #------------------------------------------
    foreach $file ( &uniq(sort @files ) ) {
    undef $info ;
    $info = grep {/^$file\:/} @arr ;
    printf "$info lines of info on $file\n";
    }


    #=============================
    # subroutine to do Unixy-uniq
    #-----------------------------
    sub uniq {
    @uniq = @_ ;
    #=======================================================
    # Foreach array element , compare with its predecessor.
    # If yes, its already present and splice.
    #-------------------------------------------------------
    for ( $i = 1; $i < @uniq ; $i++ ) {
    if ( @uniq[$i] eq @uniq[$i-1] ) {
    splice( @uniq,$i-1,1 ) ;
    $i--;
    }
    }

    return @uniq ;

    }


    Thanks,
    Prab
     
    Prabh, Nov 20, 2003
    #1
    1. Advertising

  2. Prabh wrote:
    > Hello all,
    > This is about grepping, regexps and parsing data.
    > I do have a solution, but I was wondering if anyone could direct me to
    > a more efficient one.
    > I have a log file of the following format, which contains info. on a
    > series of files after a process.
    >
    > ===============================
    > File1: Info. on File1
    > File2: Info. on File2
    > File1: Info. on File1
    > File3: Info. on File3
    > File1: Info. on File1
    > and so on...
    > ===============================
    >
    > I want to display the output as...
    >
    > ============================
    > n1 lines of info on File1
    > n2 lines of info on File2
    > n3 lines of info on File3
    > ============================
    >
    > This is what I came up with, but when the input log file is of
    > gigantic proportions, the parsing takes a lot of time, could anyone
    > recommend a better solution, please?


    [snip program]


    Whenever you see "unique" you should automatically think "hash". For your
    problem that means a better data structure would be a hash of (references to
    arrays).


    In your program you are looping through the looping through list half a
    dozen times, including a read, a sort, a grep, and a unique operation.
    That's 3n+n*log n already!
    Instead you could do the work once while reading the file line by line and
    build your target data structure incrementally in linear time.
    To do this just read the next line, extract the file name, add this line to
    the array that is the hash value for this file name.
    When done reading the whole file just sort the keys of the hash and print
    each value in sequence (pseudo-code for clarifying the logical program flow,
    not fit and polished Perl!):

    open FDL or die ....;
    while ( <FDL>) { # for each line
    ($fname) = split (/:/, $_, 2); #get the file name
    push @{$myhash{$fname}}, $_; #and push the current line into the
    hash at key $fname
    }
    for (sort (keys (%myhash))) { #for each file name in sorted order
    print @{$myhash{$fname}}; #print the array with the lines
    }
     
    Jürgen Exner, Nov 20, 2003
    #2
    1. Advertising

  3. Prabh <> wrote:
    > I do have a solution, but I was wondering if anyone could direct me to
    > a more efficient one.
    > I have a log file of the following format, which contains info. on a
    > series of files after a process.
    >
    > ===============================
    > File1: Info. on File1
    > File2: Info. on File2
    > File1: Info. on File1
    > File3: Info. on File3
    > File1: Info. on File1
    > and so on...
    > ===============================
    >
    > I want to display the output as...
    >
    > ============================
    > n1 lines of info on File1
    > n2 lines of info on File2
    > n3 lines of info on File3
    > ============================


    #!/usr/local/bin/perl
    use strict;
    use warnings;

    # always check the return value of open()
    open F, "file" or die "can't open file: $!\n";
    my %hash;
    while (<F>) {
    $hash{(split /:/)[0]} ++;
    }
    close F;
    foreach my $f (sort keys %hash) {
    print "$hash{$f} lines of info on $f\n";
    }


    --
    Glenn Jackman
    NCF Sysadmin
     
    Glenn Jackman, Nov 20, 2003
    #3
  4. Prabh

    Tore Aursand Guest

    On Thu, 20 Nov 2003 06:00:09 -0800, Prabh wrote:
    > #!/usr/local/bin/perl


    You _need_ this:

    use strict;
    use warnings;

    > open(FDL,"Foo.txt") ;
    > chomp(@arr = <FDL> ) ;
    > close(FDL) ;


    Always check the return value of open():

    open( FDL, 'Foo.txt' ) or die "$!\n";

    > undef @files ;
    > foreach $line ( @arr ) {
    > push(@files,(split(/\:/,$line))[0]) ;
    > }


    Why do you want to set @files to undefined? This should do it, and it
    keeps @files unique too;

    my @files = ();
    my %seen = ();
    foreach ( @arr ) {
    my $file = ( split(/\:/) )[0];
    push( @files, $file ) unless ( exists $seen{$file} );
    }

    > foreach $file ( &uniq(sort @files ) ) {
    > undef $info ;
    > $info = grep {/^$file\:/} @arr ;
    > printf "$info lines of info on $file\n";
    > }


    And this could be written as (no need for 'printf'):

    foreach ( sort @files ) {
    my $info = grep { /^$file\:/ } @arr;
    print "$info lines of into on $file\n";
    }

    > sub uniq {


    AFAIKT, this won't work if you give it an array of files where two
    identical filename doesn't follow each other;

    perldoc -q duplicate

    You don't need this function, though, as my code (above) keeps the array
    unique at the point it's being populated.


    --
    Tore Aursand <>
    "A teacher is never a giver of truth - he is a guide, a pointer to the
    truth that each student must find for himself. A good teacher is
    merely a catalyst." -- Bruce Lee
     
    Tore Aursand, Nov 20, 2003
    #4
  5. [posted & mailed]

    On 20 Nov 2003, Prabh wrote:

    >I have a log file of the following format, which contains info. on a
    >series of files after a process.
    >
    >===============================
    >File1: Info. on File1
    >File2: Info. on File2
    >File1: Info. on File1
    >File3: Info. on File3
    >File1: Info. on File1
    >===============================
    >
    >I want to display the output as...
    >
    >============================
    >n1 lines of info on File1
    >n2 lines of info on File2
    >n3 lines of info on File3
    >============================


    >This is what I came up with, but when the input log file is of
    >gigantic proportions, the parsing takes a lot of time, could anyone
    >recommend a better solution, please?


    That's because you slurp the ENTIRE file into memory, which takes time and
    space:

    >open(FDL,"Foo.txt") ;
    >chomp(@arr = <FDL> ) ;
    >close(FDL) ;


    Then you make an array of the same number of elements, when you should
    really be using a hash:

    >foreach $line ( @arr ) {
    > push(@files,(split(/\:/,$line))[0]) ;
    >}


    Then you sort the list of files, and then iterate over the ENTIRE file's
    contents for EACH file.

    >foreach $file ( &uniq(sort @files ) ) {
    > undef $info ;
    > $info = grep {/^$file\:/} @arr ;
    > printf "$info lines of info on $file\n";
    >}


    You've now made ONE pass over the file, ONE pass over the array of the
    file, and then ANOTHER pass over the array of the file for EACH unique
    filename. For a file with 3 unique names, that's basically FIVE passes.

    I would strongly suggest using a hash, and making only ONE pass over the
    file:

    #!/usr/bin/perl

    use strict;
    use warnings;

    my %records;

    open FDL, "Foo.txt" or die "can't read Foo.txt: $!";
    while (<FDL>) {
    my ($rec) = split /:/;
    ++$records{$rec};
    }
    close FDL;

    for (keys %records) {
    print "$records{$_} lines of info on $_\n";
    }

    Something like that. You might want to keep an array of the ORDER of the
    filenames:

    my (%records, @order);

    open ...;
    while (<FDL>) {
    my ($rec) = split /:/;
    $records{$rec}++ or push @order, $rec;
    }
    close ...;

    for (@order) {
    # ...
    }

    --
    Jeff Pinyan RPI Acacia Brother #734 2003 Rush Chairman
    "And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
    years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
    Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)
     
    Jeff 'japhy' Pinyan, Nov 20, 2003
    #5
  6. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Glenn Jackman <> wrote in
    news::

    > Prabh <> wrote:
    >
    > #!/usr/local/bin/perl
    > use strict;
    > use warnings;
    >
    > # always check the return value of open()
    > open F, "file" or die "can't open file: $!\n";
    > my %hash;
    > while (<F>) {
    > $hash{(split /:/)[0]} ++;
    > }
    > close F;
    > foreach my $f (sort keys %hash) {
    > print "$hash{$f} lines of info on $f\n";
    > }


    Are you golfing, or trying to help? If the latter, perhaps you would be
    so kind as to provide a bit of explanation, instead of just throwing some
    fairly dense code at the novice?

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBP71tNWPeouIeTNHoEQK0uACeLe3zEYqMPXUPiXLfVvIs39LrHOUAn2M/
    wGx6tVOBDtx4eLz5SspJnxlg
    =7F8a
    -----END PGP SIGNATURE-----
     
    Eric J. Roode, Nov 21, 2003
    #6
  7. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Tore Aursand <> wrote in news:pan.2003.11.20.16.28.22.46067
    @aursand.no:

    > my @files = ();
    > my %seen = ();


    Why not simply

    my @files;
    my %seen;

    ?
    Less typing, less chance for typos.

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBP71ts2PeouIeTNHoEQJtFwCgqigQ9GmGiMJrqUF3fHohcYmBKoYAoKVX
    L5ZcTnjn9ZonQiNNsZwguDPz
    =foWb
    -----END PGP SIGNATURE-----
     
    Eric J. Roode, Nov 21, 2003
    #7
  8. Prabh

    Anno Siegel Guest

    Eric J. Roode <> wrote in comp.lang.perl.misc:
    > -----BEGIN PGP SIGNED MESSAGE-----
    > Hash: SHA1
    >
    > Glenn Jackman <> wrote in
    > news::
    >
    > > Prabh <> wrote:
    > >
    > > #!/usr/local/bin/perl
    > > use strict;
    > > use warnings;
    > >
    > > # always check the return value of open()
    > > open F, "file" or die "can't open file: $!\n";
    > > my %hash;
    > > while (<F>) {
    > > $hash{(split /:/)[0]} ++;
    > > }
    > > close F;
    > > foreach my $f (sort keys %hash) {
    > > print "$hash{$f} lines of info on $f\n";
    > > }

    >
    > Are you golfing, or trying to help? If the latter, perhaps you would be
    > so kind as to provide a bit of explanation, instead of just throwing some
    > fairly dense code at the novice?


    Oh, come on. The OP had this (after reading the file into @arr):

    > foreach $line ( @arr ) {
    > push(@files,(split(/\:/,$line))[0]) ;
    > }


    Lose the file slurping and replace @arr with %hash, and you end up
    more or less with Eric's code. That's not too much of a step.

    Anno
     
    Anno Siegel, Nov 21, 2003
    #8
  9. Prabh

    Tore Aursand Guest

    On Thu, 20 Nov 2003 19:42:35 -0600, Eric J. Roode wrote:
    >> my @files = ();
    >> my %seen = ();

    >
    > Why not simply
    >
    > my @files;
    > my %seen;
    >
    > ?
    > Less typing, less chance for typos.


    You have a point, of course. My personal style, however, implies that I
    set each variable when I declare them. Even if it's not necessary, and
    even when they're empty.


    --
    Tore Aursand <>
    "A teacher is never a giver of truth - he is a guide, a pointer to the
    truth that each student must find for himself. A good teacher is
    merely a catalyst." -- Bruce Lee
     
    Tore Aursand, Nov 22, 2003
    #9
  10. Prabh

    Uri Guttman Guest

    >>>>> "TA" == Tore Aursand <> writes:

    TA> On Thu, 20 Nov 2003 19:42:35 -0600, Eric J. Roode wrote:
    >>> my @files = ();
    >>> my %seen = ();

    >>
    >> Why not simply
    >>
    >> my @files;
    >> my %seen;
    >>
    >> ?
    >> Less typing, less chance for typos.


    TA> You have a point, of course. My personal style, however, implies that I
    TA> set each variable when I declare them. Even if it's not necessary, and
    TA> even when they're empty.

    my has a runtime effect of clearing variables.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Nov 22, 2003
    #10
  11. Prabh

    Anno Siegel Guest

    Uri Guttman <> wrote in comp.lang.perl.misc:
    > >>>>> "TA" == Tore Aursand <> writes:

    >
    > TA> On Thu, 20 Nov 2003 19:42:35 -0600, Eric J. Roode wrote:
    > >>> my @files = ();
    > >>> my %seen = ();
    > >>
    > >> Why not simply
    > >>
    > >> my @files;
    > >> my %seen;
    > >>
    > >> ?
    > >> Less typing, less chance for typos.

    >
    > TA> You have a point, of course. My personal style, however, implies that I
    > TA> set each variable when I declare them. Even if it's not necessary, and
    > TA> even when they're empty.
    >
    > my has a runtime effect of clearing variables.


    Ah, but Tore knows that. His style rule says to initialize every variable,
    whether it needs it or not. That's how I read his remark, and it's what
    I did for a long while too.

    I don't do it anymore. For one, every redundant statement in a source
    leaves a nagging doubt whether the author perhaps *thought* it necessary,
    thereby revealing a lack of acquaintance with the language. A good style
    should build confidence that the author knows what they're doing, not under-
    mine it.

    Another reason for not always initializing is that you can tell the reader
    something by initializing only where necessary. In saying:

    my $x = 0;
    # some code involving $x
    print $x;

    I'm giving a subtle hint that "some code ..." may *not* set $x under some
    circumstances. Without the initialization the reader knows that I believe
    $x will always be set. Always initializing all variables takes this bit of
    expressiveness away.

    Anno
     
    Anno Siegel, Nov 22, 2003
    #11
  12. Prabh

    Tore Aursand Guest

    On Sat, 22 Nov 2003 05:07:26 +0000, Uri Guttman wrote:
    >>> Why not simply
    >>>
    >>> my @files;
    >>> my %seen;
    >>>
    >>> ?
    >>> Less typing, less chance for typos.


    >> You have a point, of course. My personal style, however, implies that
    >> I set each variable when I declare them. Even if it's not necessary,
    >> and even when they're empty.


    > my has a runtime effect of clearing variables.


    That's right, but you must be a real speed-demon if you're hoping to gain
    anything. But - I guess - a little here and a little there sums up to be
    something very big somewhere else. :)

    Just for the fun of it, I benchmark'ed this. Setting a scalar, an array
    and a hash explicit took more than twice the time than "leaving them
    alone".


    --
    Tore Aursand <>
    "A teacher is never a giver of truth - he is a guide, a pointer to the
    truth that each student must find for himself. A good teacher is
    merely a catalyst." -- Bruce Lee
     
    Tore Aursand, Nov 24, 2003
    #12
  13. Prabh

    Uri Guttman Guest

    >>>>> "TA" == Tore Aursand <> writes:

    >> my has a runtime effect of clearing variables.


    TA> That's right, but you must be a real speed-demon if you're hoping to gain
    TA> anything. But - I guess - a little here and a little there sums up to be
    TA> something very big somewhere else. :)

    TA> Just for the fun of it, I benchmark'ed this. Setting a scalar, an array
    TA> and a hash explicit took more than twice the time than "leaving them
    TA> alone".

    good to know but i don't assign () or undef in my as it is redundant and
    poor style IMO. the higher speed is nice as well.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Nov 24, 2003
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Uwe Ziegenhagen
    Replies:
    5
    Views:
    7,761
    oceamus
    Jan 27, 2010
  2. Replies:
    0
    Views:
    1,184
  3. Monk
    Replies:
    10
    Views:
    1,544
    Michael Wojcik
    Apr 20, 2005
  4. Woland99

    Parsing and preserving comments

    Woland99, Oct 9, 2006, in forum: Perl Misc
    Replies:
    0
    Views:
    147
    Woland99
    Oct 9, 2006
  5. Replies:
    4
    Views:
    657
    Dr John Stockton
    Jun 3, 2006
Loading...

Share This Page