counting word occurances

Discussion in 'Perl Misc' started by Rodrick Brown, Jun 3, 2005.

  1. Hello,

    Just learning Perl so bare with me.

    I have the following output file:

    pear
    apple
    apple
    orange
    mango
    mango
    pear
    cherry
    apple

    ill would like the count the ammount of occurances for each fruit.

    I spent a few hours trying to do this and just gave up if someone can help
    me out with an example or a better way to do this than the method i'm trying
    to use

    This is as far as I got

    #!/usr/bin/perl -w

    use strict;

    my @keys;
    my @fruits;
    my %cnt;
    my $types;
    my $f = 1;
    my $m;

    open(LOG,"/tmp/fruits.txt") || die("Can't open file: $!\n");
    while(<LOG>)
    {
    next if(/^\s+/);
    push(@fruits,$_);
    }

    # Give all fruits a default value of 1
    foreach my $types (@fruits)
    {
    $cnt{$types} = $f;
    }

    foreach $types (@fruits)
    {
    @keys = keys %cnt;
    while(@keys)
    {
    my $fruitnames = pop(@keys);
    if($types =~ m/$fruitnames/)
    {
    $cnt{$types}++;
    print "$cnt{$types} $fruitnames";
    }
    }
    }

    The code doesnt work and i'm a bit fustrated that I couldnt get it working,
    many times I thought I had it but I never did get the results I expected.

    --
    RB
     
    Rodrick Brown, Jun 3, 2005
    #1
    1. Advertising

  2. Rodrick Brown

    John Bokma Guest

    Rodrick Brown wrote:

    > Hello,
    >
    > Just learning Perl so bare with me.
    >
    > I have the following output file:
    >
    > pear
    > apple
    > apple
    > orange
    > mango
    > mango
    > pear
    > cherry
    > apple
    >
    > ill would like the count the ammount of occurances for each fruit.
    >
    > I spent a few hours trying to do this and just gave up if someone can
    > help me out with an example or a better way to do this than the method
    > i'm trying to use
    >
    > This is as far as I got
    >
    > #!/usr/bin/perl -w


    don't use -w, use warnings; instead:

    > use strict;


    use warnings;

    > my @keys;
    > my @fruits;
    > my %cnt;
    > my $types;
    > my $f = 1;
    > my $m;


    do this when you need it, not ahead of time. I replace them with:

    my $filename = '/tmp/fruits.txt';
    my %count;

    > open(LOG,"/tmp/fruits.txt") || die("Can't open file: $!\n");


    open my $fh, $filename or die "Can't open '$filename' for reading: $!";

    while ( my $line = <$fh> ) {

    $line =~ s/^\s+//; # remove leading whitespace
    $line =~ s/\s+$//; # remove trailing whitespace (and \n)

    next if $line eq ''; # skip empty lines

    $count{ $line }++;
    }

    close $fh or die "Can't close '$filename' after reading: $!"

    Note that some magic happens here and there, like incrementing an
    undefined entry in a hash table (%count) assumes it had a value of zero.

    Also note that I use an undefined variable in open, so it can be used as
    a file handle.

    > foreach $types (@fruits)
    > {
    > @keys = keys %cnt;
    > while(@keys)
    > {
    > my $fruitnames = pop(@keys);
    > if($types =~ m/$fruitnames/)
    > {
    > $cnt{$types}++;
    > print "$cnt{$types} $fruitnames";
    > }
    > }
    > }


    I don't even want to guess what's going on here :)

    print "$count{$_} $_\n"
    for sort { $count{ $b } <=> $count{ $a } } keys %count;

    Since I have $b to the left, it sorts the keys of %count descending
    based on the count of each item.

    I recommend reading a bit more on hash tables, the use of for(each), and
    open.

    (all code untested)

    --
    John Small Perl scripts: http://johnbokma.com/perl/
    Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
     
    John Bokma, Jun 3, 2005
    #2
    1. Advertising

  3. Rodrick Brown wrote:
    > Just learning Perl so bare with me.


    There isn't really much Perl involved here except for the hash.

    > I have the following output file:


    I guess you mean input file?

    > pear
    > apple
    > apple
    > orange
    > mango
    > mango
    > pear
    > cherry
    > apple
    >
    > ill would like the count the ammount of occurances for each fruit.
    >
    > I spent a few hours trying to do this and just gave up if someone can
    > help me out with an example or a better way to do this than the
    > method i'm trying to use
    >
    > This is as far as I got

    [code snipped]

    Sorry, this is so convoluted, I'm not even trying to understand what you may
    have been thinking when writing it.

    The following code works:

    use warnings; use strict;
    my %cnt;
    open(LOG,"/tmp/fruits.txt") or die("Can't open file: $!\n");
    while(<LOG>){
    s/^\s*//; #remove leading white space
    s/\s*$//; #remove trailing white space
    $cnt{$_}++; #count this fruit
    }
    delete $cnt{''}; #delete empty key in case we picked up an empty line

    for (keys(%cnt)){#print the whole set
    print "$cnt{$_} $_\n";
    }

    jue
     
    Jürgen Exner, Jun 3, 2005
    #3
  4. Rodrick Brown

    John Bokma Guest

    John Bokma, Jun 3, 2005
    #4
  5. "Jürgen Exner" <> wrote in
    news:ngRne.38532$GN3.26737@trnddc04:

    > Rodrick Brown wrote:
    >> Just learning Perl so bare with me.


    I'd rather not be naked with strangers ;)

    ....

    >> This is as far as I got

    > [code snipped]
    >
    > Sorry, this is so convoluted,


    Agreed.

    > while(<LOG>){
    > s/^\s*//; #remove leading white space
    > s/\s*$//; #remove trailing white space
    > $cnt{$_}++; #count this fruit
    > }
    > delete $cnt{''}; #delete empty key in case we picked up an empty line


    Or:

    while(<LOG>) {
    next unless /^\s*(\w+)\s*$/;
    $cnt{$1}++;
    }

    Sinan
    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Jun 3, 2005
    #5
  6. A. Sinan Unur wrote:
    > "Jürgen Exner" <> wrote in
    >> s/^\s*//; #remove leading white space
    >> s/\s*$//; #remove trailing white space

    >
    > Or:
    > next unless /^\s*(\w+)\s*$/;


    See
    perldoc -q "strip blank"

    Another difference between our solutions would be the handling of lines that
    contain more than one single word, e.g. "green grapes" or "mini-tomatos".
    Which behaviour the OP wants is everybody's guess.

    jue
     
    Jürgen Exner, Jun 3, 2005
    #6
  7. John Bokma <> wrote:
    > Jürgen Exner wrote:
    >
    >> delete $cnt{''}; #delete empty key in case we picked up an empty line

    >
    > Must remember that one, more readable then next if $line eq '';



    More readable than what I use too:

    next unless length $line;


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Jun 3, 2005
    #7
  8. Tad McClellan wrote:
    > John Bokma <> wrote:
    >> Jürgen Exner wrote:
    >>
    >>> delete $cnt{''}; #delete empty key in case we picked up an empty
    >>> line

    >>
    >> Must remember that one, more readable then next if $line eq '';

    >
    > More readable than what I use too:
    > next unless length $line;


    I think my approach should be faster, too, because it eliminates the "if"
    test for every single line.

    jue
     
    Jürgen Exner, Jun 3, 2005
    #8
  9. "Jürgen Exner" <> wrote in
    news:V3Yne.1300$mb2.1255@trnddc07:

    > A. Sinan Unur wrote:
    >> "Jürgen Exner" <> wrote in
    >>> s/^\s*//; #remove leading white space
    >>> s/\s*$//; #remove trailing white space

    >>
    >> Or:
    >> next unless /^\s*(\w+)\s*$/;

    >
    > See
    > perldoc -q "strip blank"


    Hasty post on my part. However, note a couple of differences between the
    comparison in the FAQ and my suggestion:

    Although the simplest approach would seem to be

    $string =~ s/^\s*(.*?)\s*$/$1/;

    not only is this unnecessarily slow and destructive, it also fails
    with embedded newlines.

    Well, my suggestion does not involve s///, so the bit about
    'destructive' is not applicable. Embedded newlines also are not an issue
    because we are reading line-by-line from a file. As for speed:

    #! /usr/bin/perl

    use strict;
    use warnings;

    use Benchmark ':all';

    my $INPUT = [
    'pear ',
    ' apple ',
    'apple',
    ' orange ',
    ' mango ',
    'mango',
    ' pear',
    ' cherry ',
    'apple',
    '',
    ];

    sub capture {
    my @input = @{ $INPUT };
    my %counts;

    for (@input) {
    if( /^\s*(\w+)\s*$/ ) {
    $counts{$1}++;
    }
    }
    }

    sub strip {
    my @input = @{ $INPUT };
    my %counts;

    for (@input) {
    s/^\s*//;
    s/\s*$//;
    $counts{$_}++;
    }
    delete $counts{''};
    }

    cmpthese 0, {
    capture => \&capture,
    strip => \&strip,
    };

    __END__

    D:\Home> perl -v
    This is perl, v5.8.6 built for MSWin32-x86-multi-thread

    D:\Home> st
    Rate capture strip
    capture 29936/s -- -2%
    strip 30640/s 2% --

    OK, you have a point there (and I knew it even before I ran the
    benchmark.

    > Another difference between our solutions would be the handling of
    > lines that contain more than one single word, e.g. "green grapes" or
    > "mini-tomatos". Which behaviour the OP wants is everybody's guess.


    On the other hand, *this* is the crux of the matter, isn't it? Being as
    expressive as one can be (in Perl) about what part of the input string
    one wants to use enables others to be able to figure out what the code
    was meant to do. So, in that sense, me using (\w+) is not such a good
    idea. After all, words really do not contain digits.

    So, I might even use:

    if( /^\s*([[:alpha:]]+)\s*$/ ) {
    $counts{$1}++;
    }

    or even

    my %accept = map { $_ => 1 } qw{pear apple mango cherry};

    ....

    if( /^\s*(.+?)\s*$/ and $accept{$1}) {
    $counts{$1}++;
    }

    This is even slower, but it allows me to count only the input I want to
    count.

    There is some value in that.

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Jun 3, 2005
    #9
  10. Jürgen Exner wrote:
    > A. Sinan Unur wrote:
    >> "Jürgen Exner" <> wrote in
    >>>
    >>> s/^\s*//; #remove leading white space
    >>> s/\s*$//; #remove trailing white space

    >>
    >> Or:
    >> next unless /^\s*(\w+)\s*$/;

    >
    > See
    > perldoc -q "strip blank"


    That FAQ entry comments on the s/// operator. Is that applicable to
    capturing a value via the m// operator too?

    > Another difference between our solutions would be the handling of lines that
    > contain more than one single word, e.g. "green grapes" or "mini-tomatos".


    while (<LOG>) { /(\S(?:.*\S))/ and $cnt{$1}++ or next }

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jun 3, 2005
    #10
  11. Rodrick Brown

    vali Guest

    Jürgen Exner wrote:
    > A. Sinan Unur wrote:
    >
    >>"Jürgen Exner" <> wrote in
    >>
    >>> s/^\s*//; #remove leading white space
    >>> s/\s*$//; #remove trailing white space

    >>
    >>Or:
    >> next unless /^\s*(\w+)\s*$/;

    >
    >
    > See
    > perldoc -q "strip blank"
    >
    > Another difference between our solutions would be the handling of lines that
    > contain more than one single word, e.g. "green grapes" or "mini-tomatos".
    > Which behaviour the OP wants is everybody's guess.
    >
    > jue
    >
    >


    Wasn't aware about the above faq. I've been using for years:
    s/(^\s+|\s+$)//g;
    which seems to be the same (or not ?!) as:
    s/^\s*//; s/\s*$//;

    __Vali
     
    vali, Jun 3, 2005
    #11
  12. Gunnar Hjalmarsson wrote:
    > Jürgen Exner wrote:
    >> A. Sinan Unur wrote:
    >>> "Jürgen Exner" <> wrote in
    >>>>
    >>>> s/^\s*//; #remove leading white space
    >>>> s/\s*$//; #remove trailing white space
    >>>
    >>> Or:
    >>> next unless /^\s*(\w+)\s*$/;

    >>
    >> See
    >> perldoc -q "strip blank"

    >
    > That FAQ entry comments on the s/// operator. Is that applicable to
    > capturing a value via the m// operator too?


    Judging from Sinan's benchmark it's not.
    http://groups-beta.google.com/group/comp.lang.perl.misc/msg/76160d73413fba1c

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jun 3, 2005
    #12
  13. vali <> wrote in
    news:r80oe.6452$:

    > Jürgen Exner wrote:


    >> See
    >> perldoc -q "strip blank"
    >>
    >> Another difference between our solutions would be the handling of
    >> lines that contain more than one single word, e.g. "green grapes" or
    >> "mini-tomatos". Which behaviour the OP wants is everybody's guess.


    ....

    > Wasn't aware about the above faq. I've been using for years:
    > s/(^\s+|\s+$)//g;
    > which seems to be the same (or not ?!) as:
    > s/^\s*//; s/\s*$//;



    Not functionally the same. Your expression requires at least one \s either
    at the beginning or the end.

    Second, it uses alternation in the regex which is generally more expensive.

    Third, you are unnecessarily capturing.

    use strict;
    use warnings;
    #! /usr/bin/perl

    use strict;
    use warnings;

    use Benchmark ':all';

    my $INPUT = [
    'pear ',
    ' apple ',
    'apple',
    ' orange ',
    ' mango ',
    'mango',
    ' pear',
    ' cherry ',
    'apple',
    '',
    ];

    sub faq {
    my @input = @{ $INPUT };
    for (@input) {
    s/^\s*//;
    s/\s*$//;
    }
    }

    sub vali {
    my @input = @{ $INPUT };
    for (@input) {
    s/:)?^\s*)|:)?\s*$)//g;
    }
    }

    cmpthese 0, {
    faq => \&faq,
    vali => \&vali,
    };

    __END__

    D:\Home>perl -v

    This is perl, v5.8.6 built for MSWin32-x86-multi-thread

    D:\Home>perl t.pl
    Rate vali faq
    vali 2655/s -- -63%
    faq 7198/s 171% --


    Oooops!

    Sinan
     
    A. Sinan Unur, Jun 3, 2005
    #13
  14. A. Sinan Unur wrote:
    > vali <> wrote in
    > news:r80oe.6452$:
    >
    >>Jürgen Exner wrote:

    >
    >>>See
    >>> perldoc -q "strip blank"
    >>>
    >>>Another difference between our solutions would be the handling of
    >>>lines that contain more than one single word, e.g. "green grapes" or
    >>>"mini-tomatos". Which behaviour the OP wants is everybody's guess.

    >
    >>Wasn't aware about the above faq. I've been using for years:
    >>s/(^\s+|\s+$)//g;
    >>which seems to be the same (or not ?!) as:
    >>s/^\s*//; s/\s*$//;

    >
    > Not functionally the same. Your expression requires at least one \s either
    > at the beginning or the end.
    >
    > Second, it uses alternation in the regex which is generally more expensive.
    >
    > Third, you are unnecessarily capturing.
    >
    > use strict;
    > use warnings;
    > #! /usr/bin/perl
    >
    > use strict;
    > use warnings;
    >
    > use Benchmark ':all';
    >
    > my $INPUT = [
    > 'pear ',
    > ' apple ',
    > 'apple',
    > ' orange ',
    > ' mango ',
    > 'mango',
    > ' pear',
    > ' cherry ',
    > 'apple',
    > '',
    > ];
    >
    > sub faq {
    > my @input = @{ $INPUT };
    > for (@input) {
    > s/^\s*//;
    > s/\s*$//;
    > }
    > }


    The FAQ uses \s+ instead of \s* which is more efficient.



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Jun 3, 2005
    #14
  15. "John W. Krahn" <> wrote in
    news:HX2oe.31147$on1.29725@clgrps13:

    > A. Sinan Unur wrote:
    >> vali <> wrote in
    >> news:r80oe.6452$:


    ....

    >>>Wasn't aware about the above faq. I've been using for years:
    >>>s/(^\s+|\s+$)//g;
    >>>which seems to be the same (or not ?!) as:
    >>>s/^\s*//; s/\s*$//;


    ....
    >> sub faq {
    >> my @input = @{ $INPUT };
    >> for (@input) {
    >> s/^\s*//;
    >> s/\s*$//;
    >> }
    >> }

    >
    > The FAQ uses \s+ instead of \s* which is more efficient.


    Thank you for the correction. I failed to notice the typo in vali's post.

    I was suprised to see just how much more efficient \s+ was compared to \s*.

    Finally, this supports my assertion that s/(^\s+|\s+$)//g; is not the same
    as what the answer to the FAQ recommends.

    #! /usr/bin/perl

    use strict;
    use warnings;

    use Benchmark ':all';

    my $INPUT = [
    'pear ',
    ' apple ',
    'apple',
    ' orange ',
    ' mango ',
    'mango',
    ' pear',
    ' cherry ',
    'apple',
    '',
    ];

    sub s1 {
    my @input = @{ $INPUT };

    for (@input) {
    s/^\s*//;
    s/\s*$//;
    }
    }

    sub faq {
    my @input = @{ $INPUT };

    for (@input) {
    s/^\s+//;
    s/\s+$//;
    }
    }

    cmpthese 0, {
    s1 => \&s1,
    faq => \&faq,
    };

    __END__

    D:\Home>perl t.pl
    Rate s1 faq
    s1 8638/s -- -25%
    faq 11489/s 33% --

    Sinan
     
    A. Sinan Unur, Jun 3, 2005
    #15
  16. A. Sinan Unur <> wrote:

    > I was suprised to see just how much more efficient \s+ was compared to \s*.



    That shouldn't be too surprising after applying some intuition.

    Patterns with required elements describe a smaller set of matching
    strings, and they allow the regex engine to "fail early" as soon
    as it is determined that the required thing is not where it is
    required to be.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Jun 3, 2005
    #16
  17. On Sat, 04 Jun 2005 23:57:05 +0200, Abigail <> wrote:

    > Rodrick Brown () wrote on MMMMCCXCIV September
    > MCMXCIII in <URL:news:c7Qne.4832$>:
    > [] Hello,
    > []
    > [] Just learning Perl so bare with me.
    > []
    > [] I have the following output file:
    > []
    > [] pear
    > [] apple
    > [] apple
    > [] orange
    > [] mango
    > [] mango
    > [] pear
    > [] cherry
    > [] apple
    > []
    > [] ill would like the count the ammount of occurances for each fruit.
    >
    >
    > No need for perl here. Just some standard shell tools will do a fine job:
    >
    > grep -v '^ ' input_file | sort | uniq -c
    >
    > Abigail


    No, that will not count " apple" and " orange" or any other lines
    starting
    with one or more spaces. This will work better:

    perl -ple 's/^\s+//' input_file | sort | uniq -c

    Or:

    perl -nle 's/^\s+//;$c{$_}++;END{printf("%6d %s\n",$_,$c{$_}) for
    sort{$c{$a}<=>$c{$b}}keys%c}' input_file


    --
    Kjetil Skotheim
     
    Kjetil Skotheim, Jun 5, 2005
    #17
  18. Rodrick Brown

    Anno Siegel Guest

    A. Sinan Unur <> wrote in comp.lang.perl.misc:
    > "Jürgen Exner" <> wrote in
    > news:V3Yne.1300$mb2.1255@trnddc07:


    > D:\Home> st
    > Rate capture strip
    > capture 29936/s -- -2%
    > strip 30640/s 2% --
    >
    > OK, you have a point there (and I knew it even before I ran the
    > benchmark.


    No. 2% difference in a benchmark doesn't constitute a point. Another
    benchmark could have them the other way 'round. Results differ much
    more than that across compilers and/or machines. On mine, benchmark
    has "capture" in favor by 12%.

    Anno
     
    Anno Siegel, Jun 6, 2005
    #18
  19. Rodrick Brown

    Anno Siegel Guest

    A. Sinan Unur <> wrote in comp.lang.perl.misc:
    > vali <> wrote in
    > news:r80oe.6452$:
    >
    > > Jürgen Exner wrote:

    >
    > >> See
    > >> perldoc -q "strip blank"
    > >>
    > >> Another difference between our solutions would be the handling of
    > >> lines that contain more than one single word, e.g. "green grapes" or
    > >> "mini-tomatos". Which behaviour the OP wants is everybody's guess.

    >
    > ...
    >
    > > Wasn't aware about the above faq. I've been using for years:
    > > s/(^\s+|\s+$)//g;
    > > which seems to be the same (or not ?!) as:
    > > s/^\s*//; s/\s*$//;

    >
    >
    > Not functionally the same. Your expression requires at least one \s either
    > at the beginning or the end.
    >
    > Second, it uses alternation in the regex which is generally more expensive.
    >
    > Third, you are unnecessarily capturing.


    Fourth, the alternation keeps the second part from matching right away
    when there are both leading and trailing blanks. The regex must be
    applied again (via /g), so that its first alternative can fail, and the
    second one can match the trailing blanks.

    A pattern that starts with "^" and ends with "$" shouldn't normally
    need /g.

    Anno
     
    Anno Siegel, Jun 6, 2005
    #19
  20. -berlin.de (Anno Siegel) wrote in news:d81rrt$pst
    $-Berlin.DE:

    > A. Sinan Unur <> wrote in comp.lang.perl.misc:
    >> "Jürgen Exner" <> wrote in
    >> news:V3Yne.1300$mb2.1255@trnddc07:

    >
    >> D:\Home> st
    >> Rate capture strip
    >> capture 29936/s -- -2%
    >> strip 30640/s 2% --
    >>
    >> OK, you have a point there (and I knew it even before I ran the
    >> benchmark.

    >
    > No. 2% difference in a benchmark doesn't constitute a point.
    > Another benchmark could have them the other way 'round.
    > Results differ much more than that across compilers and/or machines.
    > On mine, benchmark has "capture" in favor by 12%.


    Indeed. Two hasty posts on my part in the same thread.

    asu1@recex:/home/asu1/.tmp > perl -v

    This is perl, v5.8.6 built for i386-freebsd-64int

    asu1@recex:/home/asu1/.tmp > perl ttt.pl
    Rate strip capture
    strip 20283/s -- -13%
    capture 23272/s 15% --

    Thanks for pointing this out.

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Jun 6, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rodrick Brown

    counting word occurances

    Rodrick Brown, Jun 1, 2005, in forum: Perl
    Replies:
    2
    Views:
    2,826
    Jürgen Exner
    Jun 2, 2005
  2. =?Utf-8?B?SmltIEhlYXZleQ==?=

    Multiple Occurances Of Value In String

    =?Utf-8?B?SmltIEhlYXZleQ==?=, Jun 29, 2004, in forum: ASP .Net
    Replies:
    4
    Views:
    535
    Martin Marinov
    Jun 29, 2004
  3. Franz Steinhaeusler
    Replies:
    9
    Views:
    510
    Fredrik Lundh
    Dec 15, 2004
  4. John
    Replies:
    8
    Views:
    397
    Fredrik Lundh
    Mar 11, 2006
  5. Sandman
    Replies:
    7
    Views:
    229
    Anno Siegel
    Aug 3, 2004
Loading...

Share This Page