counting word occurances

Rodrick Brown · Jun 3, 2005

Hello,

Just learning Perl so bare with me.

I have the following output file:

pear
apple
apple
orange
mango
mango
pear
cherry
apple

ill would like the count the ammount of occurances for each fruit.

I spent a few hours trying to do this and just gave up if someone can help
me out with an example or a better way to do this than the method i'm trying
to use

This is as far as I got

#!/usr/bin/perl -w

use strict;

my @keys;
my @fruits;
my %cnt;
my $types;
my $f = 1;
my $m;

open(LOG,"/tmp/fruits.txt") || die("Can't open file: $!\n");
while(<LOG>)
{
next if(/^\s+/);
push(@fruits,$_);
}

# Give all fruits a default value of 1
foreach my $types (@fruits)
{
$cnt{$types} = $f;
}

foreach $types (@fruits)
{
@keys = keys %cnt;
while(@keys)
{
my $fruitnames = pop(@keys);
if($types =~ m/$fruitnames/)
{
$cnt{$types}++;
print "$cnt{$types} $fruitnames";
}
}
}

The code doesnt work and i'm a bit fustrated that I couldnt get it working,
many times I thought I had it but I never did get the results I expected.

John Bokma · Jun 3, 2005

Rodrick said:
Hello,

Just learning Perl so bare with me.

I have the following output file:

pear
apple
apple
orange
mango
mango
pear
cherry
apple

ill would like the count the ammount of occurances for each fruit.

I spent a few hours trying to do this and just gave up if someone can
help me out with an example or a better way to do this than the method
i'm trying to use

This is as far as I got

#!/usr/bin/perl -w

don't use -w, use warnings; instead:

use strict;

use warnings;

my @keys;
my @fruits;
my %cnt;
my $types;
my $f = 1;
my $m;

do this when you need it, not ahead of time. I replace them with:

my $filename = '/tmp/fruits.txt';
my %count;

open(LOG,"/tmp/fruits.txt") || die("Can't open file: $!\n");

open my $fh, $filename or die "Can't open '$filename' for reading: $!";

while ( my $line = <$fh> ) {

$line =~ s/^\s+//; # remove leading whitespace
$line =~ s/\s+$//; # remove trailing whitespace (and \n)

next if $line eq ''; # skip empty lines

$count{ $line }++;
}

close $fh or die "Can't close '$filename' after reading: $!"

Note that some magic happens here and there, like incrementing an
undefined entry in a hash table (%count) assumes it had a value of zero.

Also note that I use an undefined variable in open, so it can be used as
a file handle.

foreach $types (@fruits)
{
@keys = keys %cnt;
while(@keys)
{
my $fruitnames = pop(@keys);
if($types =~ m/$fruitnames/)
{
$cnt{$types}++;
print "$cnt{$types} $fruitnames";
}
}
}

I don't even want to guess what's going on here

print "$count{$_} $_\n"
for sort { $count{ $b } <=> $count{ $a } } keys %count;

Since I have $b to the left, it sorts the keys of %count descending
based on the count of each item.

I recommend reading a bit more on hash tables, the use of for(each), and
open.

(all code untested)

Jürgen Exner · Jun 3, 2005

Rodrick said:
Just learning Perl so bare with me.

There isn't really much Perl involved here except for the hash.

I have the following output file:

I guess you mean input file?

pear
apple
apple
orange
mango
mango
pear
cherry
apple

ill would like the count the ammount of occurances for each fruit.

I spent a few hours trying to do this and just gave up if someone can
help me out with an example or a better way to do this than the
method i'm trying to use

This is as far as I got

[code snipped]

Sorry, this is so convoluted, I'm not even trying to understand what you may
have been thinking when writing it.

The following code works:

use warnings; use strict;
my %cnt;
open(LOG,"/tmp/fruits.txt") or die("Can't open file: $!\n");
while(<LOG>){
s/^\s*//; #remove leading white space
s/\s*$//; #remove trailing white space
$cnt{$_}++; #count this fruit
}
delete $cnt{''}; #delete empty key in case we picked up an empty line

for (keys(%cnt)){#print the whole set
print "$cnt{$_} $_\n";
}

jue

John Bokma · Jun 3, 2005

Jürgen Exner said:
delete $cnt{''}; #delete empty key in case we picked up an empty line

Must remember that one, more readable then next if $line eq '';

A. Sinan Unur · Jun 3, 2005

I'd rather not be naked with strangers

....

This is as far as I got

Click to expand...

[code snipped]

Sorry, this is so convoluted,
Agreed.

while(<LOG>){
s/^\s*//; #remove leading white space
s/\s*$//; #remove trailing white space
$cnt{$_}++; #count this fruit
}
delete $cnt{''}; #delete empty key in case we picked up an empty line

Or:

while(<LOG>) {
next unless /^\s*(\w+)\s*$/;
$cnt{$1}++;
}

Sinan

Jürgen Exner · Jun 3, 2005

A. Sinan Unur said:
Or:
next unless /^\s*(\w+)\s*$/;

See
perldoc -q "strip blank"

Another difference between our solutions would be the handling of lines that
contain more than one single word, e.g. "green grapes" or "mini-tomatos".
Which behaviour the OP wants is everybody's guess.

jue

Tad McClellan · Jun 3, 2005

John Bokma said:
Must remember that one, more readable then next if $line eq '';

More readable than what I use too:

next unless length $line;

Jürgen Exner · Jun 3, 2005

Tad said:
More readable than what I use too:
next unless length $line;

I think my approach should be faster, too, because it eliminates the "if"
test for every single line.

jue

A. Sinan Unur · Jun 3, 2005

See
perldoc -q "strip blank"

Hasty post on my part. However, note a couple of differences between the
comparison in the FAQ and my suggestion:

Although the simplest approach would seem to be

$string =~ s/^\s*(.*?)\s*$/$1/;

not only is this unnecessarily slow and destructive, it also fails
with embedded newlines.

Well, my suggestion does not involve s///, so the bit about
'destructive' is not applicable. Embedded newlines also are not an issue
because we are reading line-by-line from a file. As for speed:

#! /usr/bin/perl

use strict;
use warnings;

use Benchmark ':all';

my $INPUT = [
'pear ',
' apple ',
'apple',
' orange ',
' mango ',
'mango',
' pear',
' cherry ',
'apple',
'',
];

sub capture {
my @input = @{ $INPUT };
my %counts;

for (@input) {
if( /^\s*(\w+)\s*$/ ) {
$counts{$1}++;
}
}
}

sub strip {
my @input = @{ $INPUT };
my %counts;

for (@input) {
s/^\s*//;
s/\s*$//;
$counts{$_}++;
}
delete $counts{''};
}

cmpthese 0, {
capture => \&capture,
strip => \&strip,
};

__END__

D:\Home> perl -v
This is perl, v5.8.6 built for MSWin32-x86-multi-thread

D:\Home> st
Rate capture strip
capture 29936/s -- -2%
strip 30640/s 2% --

OK, you have a point there (and I knew it even before I ran the
benchmark.

Another difference between our solutions would be the handling of
lines that contain more than one single word, e.g. "green grapes" or
"mini-tomatos". Which behaviour the OP wants is everybody's guess.

On the other hand, *this* is the crux of the matter, isn't it? Being as
expressive as one can be (in Perl) about what part of the input string
one wants to use enables others to be able to figure out what the code
was meant to do. So, in that sense, me using (\w+) is not such a good
idea. After all, words really do not contain digits.

So, I might even use:

if( /^\s*([[:alpha:]]+)\s*$/ ) {
$counts{$1}++;
}

or even

my %accept = map { $_ => 1 } qw{pear apple mango cherry};

....

if( /^\s*(.+?)\s*$/ and $accept{$1}) {
$counts{$1}++;
}

This is even slower, but it allows me to count only the input I want to
count.

There is some value in that.

Sinan

Gunnar Hjalmarsson · Jun 3, 2005

Jürgen Exner said:
See
perldoc -q "strip blank"

That FAQ entry comments on the s/// operator. Is that applicable to
capturing a value via the m// operator too?

Another difference between our solutions would be the handling of lines that
contain more than one single word, e.g. "green grapes" or "mini-tomatos".

while (<LOG>) { /(\S(?:.*\S))/ and $cnt{$1}++ or next }

vali · Jun 3, 2005

Jürgen Exner said:
See
perldoc -q "strip blank"

Another difference between our solutions would be the handling of lines that
contain more than one single word, e.g. "green grapes" or "mini-tomatos".
Which behaviour the OP wants is everybody's guess.

jue

Wasn't aware about the above faq. I've been using for years:
s/(^\s+|\s+$)//g;
which seems to be the same (or not ?!) as:
s/^\s*//; s/\s*$//;

__Vali

Gunnar Hjalmarsson · Jun 3, 2005

Gunnar said:
That FAQ entry comments on the s/// operator. Is that applicable to
capturing a value via the m// operator too?

Judging from Sinan's benchmark it's not.
http://groups-beta.google.com/group/comp.lang.perl.misc/msg/76160d73413fba1c

A. Sinan Unur · Jun 3, 2005

Jürgen Exner wrote:
....

Wasn't aware about the above faq. I've been using for years:
s/(^\s+|\s+$)//g;
which seems to be the same (or not ?!) as:
s/^\s*//; s/\s*$//;

Not functionally the same. Your expression requires at least one \s either
at the beginning or the end.

Second, it uses alternation in the regex which is generally more expensive.

Third, you are unnecessarily capturing.

use strict;
use warnings;
#! /usr/bin/perl

use strict;
use warnings;

use Benchmark ':all';

my $INPUT = [
'pear ',
' apple ',
'apple',
' orange ',
' mango ',
'mango',
' pear',
' cherry ',
'apple',
'',
];

sub faq {
my @input = @{ $INPUT };
for (@input) {
s/^\s*//;
s/\s*$//;
}
}

sub vali {
my @input = @{ $INPUT };
for (@input) {
s/

?^\s*)|

?\s*$)//g;
}
}

cmpthese 0, {
faq => \&faq,
vali => \&vali,
};

__END__

D:\Home>perl -v

This is perl, v5.8.6 built for MSWin32-x86-multi-thread

D:\Home>perl t.pl
Rate vali faq
vali 2655/s -- -63%
faq 7198/s 171% --

Oooops!

Sinan

John W. Krahn · Jun 3, 2005

A. Sinan Unur said:
Jürgen Exner wrote:

Click to expand...

Wasn't aware about the above faq. I've been using for years:
s/(^\s+|\s+$)//g;
which seems to be the same (or not ?!) as:
s/^\s*//; s/\s*$//;

Click to expand...

Not functionally the same. Your expression requires at least one \s either
at the beginning or the end.

Second, it uses alternation in the regex which is generally more expensive.

Third, you are unnecessarily capturing.

use strict;
use warnings;
#! /usr/bin/perl

use strict;
use warnings;

use Benchmark ':all';

my $INPUT = [
'pear ',
' apple ',
'apple',
' orange ',
' mango ',
'mango',
' pear',
' cherry ',
'apple',
'',
];

sub faq {
my @input = @{ $INPUT };
for (@input) {
s/^\s*//;
s/\s*$//;
}
}

The FAQ uses \s+ instead of \s* which is more efficient.

John

A. Sinan Unur · Jun 3, 2005

....

....

The FAQ uses \s+ instead of \s* which is more efficient.

Thank you for the correction. I failed to notice the typo in vali's post.

I was suprised to see just how much more efficient \s+ was compared to \s*.

Finally, this supports my assertion that s/(^\s+|\s+$)//g; is not the same
as what the answer to the FAQ recommends.

#! /usr/bin/perl

use strict;
use warnings;

use Benchmark ':all';

my $INPUT = [
'pear ',
' apple ',
'apple',
' orange ',
' mango ',
'mango',
' pear',
' cherry ',
'apple',
'',
];

sub s1 {
my @input = @{ $INPUT };

for (@input) {
s/^\s*//;
s/\s*$//;
}
}

sub faq {
my @input = @{ $INPUT };

for (@input) {
s/^\s+//;
s/\s+$//;
}
}

cmpthese 0, {
s1 => \&s1,
faq => \&faq,
};

__END__

D:\Home>perl t.pl
Rate s1 faq
s1 8638/s -- -25%
faq 11489/s 33% --

Sinan

Tad McClellan · Jun 3, 2005

A. Sinan Unur said:
I was suprised to see just how much more efficient \s+ was compared to \s*.

That shouldn't be too surprising after applying some intuition.

Patterns with required elements describe a smaller set of matching
strings, and they allow the regex engine to "fail early" as soon
as it is determined that the required thing is not where it is
required to be.

Kjetil Skotheim · Jun 5, 2005

Rodrick Brown ([email protected]) wrote on MMMMCCXCIV September
MCMXCIII in <URL:[] Hello,
[]
[] Just learning Perl so bare with me.
[]
[] I have the following output file:
[]
[] pear
[] apple
[] apple
[] orange
[] mango
[] mango
[] pear
[] cherry
[] apple
[]
[] ill would like the count the ammount of occurances for each fruit.

No need for perl here. Just some standard shell tools will do a fine job:

grep -v '^ ' input_file | sort | uniq -c

Abigail

No, that will not count " apple" and " orange" or any other lines
starting
with one or more spaces. This will work better:

perl -ple 's/^\s+//' input_file | sort | uniq -c

Or:

perl -nle 's/^\s+//;$c{$_}++;END{printf("%6d %s\n",$_,$c{$_}) for
sort{$c{$a}<=>$c{$b}}keys%c}' input_file

Anno Siegel · Jun 6, 2005

A. Sinan Unur said:
news:V3Yne.1300$mb2.1255@trnddc07:

D:\Home> st
Rate capture strip
capture 29936/s -- -2%
strip 30640/s 2% --

OK, you have a point there (and I knew it even before I ran the
benchmark.

No. 2% difference in a benchmark doesn't constitute a point. Another
benchmark could have them the other way 'round. Results differ much
more than that across compilers and/or machines. On mine, benchmark
has "capture" in favor by 12%.

Anno

Anno Siegel · Jun 6, 2005

A. Sinan Unur said:
Not functionally the same. Your expression requires at least one \s either
at the beginning or the end.

Second, it uses alternation in the regex which is generally more expensive.

Third, you are unnecessarily capturing.

Fourth, the alternation keeps the second part from matching right away
when there are both leading and trailing blanks. The regex must be
applied again (via /g), so that its first alternative can fail, and the
second one can match the trailing blanks.

A pattern that starts with "^" and ends with "$" shouldn't normally
need /g.

Anno

A. Sinan Unur · Jun 6, 2005

(e-mail address removed)-berlin.de (Anno Siegel) wrote in [email protected]:

No. 2% difference in a benchmark doesn't constitute a point.
Another benchmark could have them the other way 'round.
Results differ much more than that across compilers and/or machines.
On mine, benchmark has "capture" in favor by 12%.

Indeed. Two hasty posts on my part in the same thread.

asu1@recex:/home/asu1/.tmp > perl -v

This is perl, v5.8.6 built for i386-freebsd-64int

asu1@recex:/home/asu1/.tmp > perl ttt.pl
Rate strip capture
strip 20283/s -- -13%
capture 23272/s 15% --

Thanks for pointing this out.

Sinan

Efficiently de-duping an array	8	Aug 24, 2007
counting word occurances	2	Jun 1, 2005
parsing a variable length record from a mixed format file	13	May 8, 2007
pulling data from a file/piping variables to if statement...	2	Feb 1, 2011
Term::Prompt broken? Or am I misreading documentation?	3	Feb 10, 2004
fix a per script	2	Jul 30, 2008
uninitialized value	4	Jun 23, 2007
Cannot have locale word characters in a variable	9	Sep 2, 2013

counting word occurances

Rodrick Brown

John Bokma

Jürgen Exner

John Bokma

A. Sinan Unur

Jürgen Exner

Tad McClellan

Jürgen Exner

A. Sinan Unur

Gunnar Hjalmarsson

vali

Gunnar Hjalmarsson

A. Sinan Unur

John W. Krahn

A. Sinan Unur

Tad McClellan

Kjetil Skotheim

Anno Siegel

Anno Siegel

A. Sinan Unur

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads