counting words in a file

W

wana

This morning, I wrote a little script to count words in a file. I did
this one on my pda, which happened to have 'Moby Dick' saved on the
storage card. Writing Perl scripts on the pda is a little like
playing video games... I wanted to print to a file the list of words
and the number of occurrence in descending order of occurrence. The
following seemed to work, although I initially did my counting loops
wrong and only counted the first occurrence per line. I was wondering
if I got it right and if there is a better way to do it. 'Programming
Perl' mentions the Schwartzian map-sort-map technique which I thought
might apply. Also, the loop that does the counting below; I tried it
this way:

$words{lc $1}++ while /(\w+)/gi for (<INF>);

but it did not work.


#!/usr/bin/perl

use strict;
my $fn = "/textucation/moby10b.txt";
open (INF, $fn) or die "error: $!";
my %words;
for (<INF>)
{
$words{lc $1}++ while /(\w+)/gi;
}
my @n = %words;
@n = reverse @n;
%words = @n;
my @k = keys %words;
sub num {$b <=> $a}
@k = sort num @k;
open (OUTF, ">file count.txt") or die "error: $!";
print OUTF "$_ $words{$_}\r\n" for (@k);

thanks!

wana
 
J

James Willmore

This morning, I wrote a little script to count words in a file. I did
this one on my pda, which happened to have 'Moby Dick' saved on the
storage card. Writing Perl scripts on the pda is a little like
playing video games... I wanted to print to a file the list of words
and the number of occurrence in descending order of occurrence. The
following seemed to work, although I initially did my counting loops
wrong and only counted the first occurrence per line. I was wondering
if I got it right and if there is a better way to do it. 'Programming
Perl' mentions the Schwartzian map-sort-map technique which I thought
might apply. Also, the loop that does the counting below; I tried it
this way:

$words{lc $1}++ while /(\w+)/gi for (<INF>);

but it did not work.


#!/usr/bin/perl

use strict;
my $fn = "/textucation/moby10b.txt";
open (INF, $fn) or die "error: $!";
my %words;
for (<INF>)
{
$words{lc $1}++ while /(\w+)/gi;
}
my @n = %words;
@n = reverse @n;
%words = @n;
my @k = keys %words;
sub num {$b <=> $a}
@k = sort num @k;
open (OUTF, ">file count.txt") or die "error: $!";
print OUTF "$_ $words{$_}\r\n" for (@k);

I put a little something together :)

=begin
#!/usr/bin/perl

use strict;
use warnings;

my $infile = 'words_in.txt';
my $outfile = 'words_out.txt';
my %words;

open IN, $infile
or die "Can't open $infile for reading: $!\n";

while(<IN>) {
#get rid of newlines
chomp;
#change all non-words into spaces
s/\W/ /g;
#split on whitespace and place the results into an array
my @line = split;
#place each lowercase word from @line as a key into hash %words
#increment the count for the key
$words{lc $_}++ for(@line);
# ... or, you could do ... and get rid of the above 2 steps
#$words{lc $_}++ for( split );
}

close IN;

open OUT, '+>', $outfile
or die "Can't open $outfile for writing: $!\n";

my $total = 0;
#sort, then reverse each key in %words
for(reverse sort keys %words) {
print OUT "$_: $words{$_}\n";
#increment total word count
$total += $words{$_};
}

print OUT "total: $total\n";

close OUT;
=cut

HTH

Jim
 
A

Anno Siegel

wana said:
This morning, I wrote a little script to count words in a file. I did
this one on my pda, which happened to have 'Moby Dick' saved on the
storage card. Writing Perl scripts on the pda is a little like
playing video games... I wanted to print to a file the list of words
and the number of occurrence in descending order of occurrence. The
following seemed to work, although I initially did my counting loops
wrong and only counted the first occurrence per line. I was wondering
if I got it right and if there is a better way to do it. 'Programming
Perl' mentions the Schwartzian map-sort-map technique which I thought

A Schwartzian won't gain much here. Once you get the sorting right
(see below), access the sort key through a hash. With a Schwartzian
you'd access it through an array built only for that purpose.
might apply. Also, the loop that does the counting below; I tried it
this way:

$words{lc $1}++ while /(\w+)/gi for (<INF>);

but it did not work.

No. Statement modifiers only work on simple statements, not ones
already modified.
#!/usr/bin/perl

use strict;

No warnings?
my $fn = "/textucation/moby10b.txt";
open (INF, $fn) or die "error: $!";
my %words;
for (<INF>)
{
$words{lc $1}++ while /(\w+)/gi;

The /i modifier does nothing and shouldn't be there.

I'd use split without arguments for that. You break "gable-ended
Spouter-Inn" into "gable" and "ended" plus "Spouter" and "Inn", split
keeps the hyphenated words.

Okay so far, you have a hash of words and their count now. What follows
is wrong.
my @n = %words;
@n = reverse @n;
%words = @n;

%words = reverse %words;

That would do the same, but it would be equally wrong. You can only
reverse a hash without loss of information when the values are unique,
but your word counts aren't. Looking at your output, have you noticed
that there appears to be only one word Melville used exactly once?
Every other word he must have used at least twice. That is unlikely,
and is indeed an artifact of this hash reversal.
my @k = keys %words;
sub num {$b <=> $a}
@k = sort num @k;

Scratch everything after the counting loop and simply do

my @k = sort { %words{ $b} <=> %words{ $a} } keys %words;

That's your sort.
open (OUTF, ">file count.txt") or die "error: $!";
print OUTF "$_ $words{$_}\r\n" for (@k);

Anno
 
A

Anno Siegel

wana said:
This morning, I wrote a little script to count words in a file. I did
this one on my pda, which happened to have 'Moby Dick' saved on the
storage card. Writing Perl scripts on the pda is a little like
playing video games... I wanted to print to a file the list of words
and the number of occurrence in descending order of occurrence. The
following seemed to work, although I initially did my counting loops
wrong and only counted the first occurrence per line. I was wondering
if I got it right and if there is a better way to do it. 'Programming
Perl' mentions the Schwartzian map-sort-map technique which I thought

A Schwartzian won't gain much here. Once you get the sorting right
(see below), access the sort key through a hash. With a Schwartzian
you'd access it through an array built only for that purpose.
might apply. Also, the loop that does the counting below; I tried it
this way:

$words{lc $1}++ while /(\w+)/gi for (<INF>);

but it did not work.

No. Statement modifiers only work on simple statements, not ones
already modified.
#!/usr/bin/perl

use strict;

No warnings?
my $fn = "/textucation/moby10b.txt";
open (INF, $fn) or die "error: $!";
my %words;
for (<INF>)
{
$words{lc $1}++ while /(\w+)/gi;

The /i modifier does nothing and shouldn't be there.

I'd use split without arguments for that. You break "gable-ended
Spouter-Inn" into "gable" and "ended" plus "Spouter" and "Inn", split
keeps the hyphenated words.

Okay so far, you have a hash of words and their count now. What follows
is wrong.
my @n = %words;
@n = reverse @n;
%words = @n;

%words = reverse %words;

That would do the same, but it would be equally wrong. You can only
reverse a hash without loss of information when the values are unique,
but your word counts aren't. Looking at your output, have you noticed
that there appears to be only one word Melville used exactly once?
Every other word he must have used at least twice. That is unlikely,
and is indeed an artifact of this hash reversal.
my @k = keys %words;
sub num {$b <=> $a}
@k = sort num @k;

Scratch everything after the counting loop and simply do

my @k = sort { %words{ $b} <=> %words{ $a} } keys %words;

That's your sort.
open (OUTF, ">file count.txt") or die "error: $!";
print OUTF "$_ $words{$_}\r\n" for (@k);

Code untested.

Anno
 
J

James Willmore

I didn't test it, and maybe it isn't the most efficient solution...
So what about:

while(<IN>){
chomp;
s/\W/ /g;
$file.=' '.$_;
}

@words=split(/\s+/,$file);
$wordCount=@words;

No reverse or sort in your example - like the OP wanted :-(
Also, the OP is working with a PDA. PDA's have a limited amount of
memory. I wrote what I wrote to take this into account. Your example may
work fine on a "real" system, but not likely to work as kindly on a system
with limited resources (like a PDA). And ... you're counting *all* the
words equally, instead of each occurence of the word like the OP had in
their example (the total was my idea). And ... there are far easier ways
to read a whole file into a scalar value (think 'undef $/').

Jim
 
I

ioneabu

No warnings?

Sorry, I had no good reason to leave it out really. Interestingly, as
fast as the program runs on the pda (400mhz x-scale 32mb ram), any
'use' statements add a second or two to the run time. It's no big deal
while playing around, but if I were to actually use a script for real
work, I would probably comment out 'use strict' and 'use warnings' and
avoid any absolutely unnecessary modules.

The /i modifier does nothing and shouldn't be there.

don't you need the /i so words like 'The' and 'the' are counted as the
same word?
I'd use split without arguments for that. You break "gable-ended
Spouter-Inn" into "gable" and "ended" plus "Spouter" and "Inn", split
keeps the hyphenated words.


Okay so far, you have a hash of words and their count now. What follows
is wrong.


%words = reverse %words;

That would do the same, but it would be equally wrong. You can only
reverse a hash without loss of information when the values are unique,
but your word counts aren't. Looking at your output, have you noticed
that there appears to be only one word Melville used exactly once?
Every other word he must have used at least twice. That is unlikely,
and is indeed an artifact of this hash reversal.


Scratch everything after the counting loop and simply do

my @k = sort { %words{ $b} <=> %words{ $a} } keys %words;

cool, thanks! Good thing I was doing this for fun and not being paid
for the results :) I guess I should test on a smaller text file with
known results. Thank you for trying out my code on the same text. If
anyone is looking for vast stores of text to play with, the Gutenberg
Project http://promo.net/pg/ has many gigabytes of plain text files of
classic literature. I have to make sure to leave out the introduction
GP puts at the beginning of the file in my searches. Melville
certainly didn't use the word computer, but my search came up with a
couple instances because of their intro.
That's your sort.


Anno

Thanks again!

wana
 
I

ioneabu

No warnings?

Sorry, I had no good reason to leave it out really. Interestingly, as
fast as the program runs on the pda (400mhz x-scale 32mb ram), any
'use' statements add a second or two to the run time. It's no big deal
while playing around, but if I were to actually use a script for real
work, I would probably comment out 'use strict' and 'use warnings' and
avoid any absolutely unnecessary modules.

The /i modifier does nothing and shouldn't be there.

don't you need the /i so words like 'The' and 'the' are counted as the
same word?
I'd use split without arguments for that. You break "gable-ended
Spouter-Inn" into "gable" and "ended" plus "Spouter" and "Inn", split
keeps the hyphenated words.


Okay so far, you have a hash of words and their count now. What follows
is wrong.


%words = reverse %words;

That would do the same, but it would be equally wrong. You can only
reverse a hash without loss of information when the values are unique,
but your word counts aren't. Looking at your output, have you noticed
that there appears to be only one word Melville used exactly once?
Every other word he must have used at least twice. That is unlikely,
and is indeed an artifact of this hash reversal.


Scratch everything after the counting loop and simply do

my @k = sort { %words{ $b} <=> %words{ $a} } keys %words;

cool, thanks! Good thing I was doing this for fun and not being paid
for the results :) I guess I should test on a smaller text file with
known results. Thank you for trying out my code on the same text. If
anyone is looking for vast stores of text to play with, the Gutenberg
Project http://promo.net/pg/ has many gigabytes of plain text files of
classic literature. I have to make sure to leave out the introduction
GP puts at the beginning of the file in my searches. Melville
certainly didn't use the word computer, but my search came up with a
couple instances because of their intro.
That's your sort.


Anno

Thanks again!

wana
 
J

John W. Krahn

don't you need the /i so words like 'The' and 'the' are counted as the
same word?

The \w character class includes all upper and lower case letters so using /i
is superfluous.


John
 
W

wana

my $fn = "/textucation/moby10b.txt";
don't you need the /i so words like 'The' and 'the' are counted as the
same word?

I feel dumb about this one. Right after posting, I realized that it
doesn't matter if 'The' and 'the' are not matched as the same word.
They will be counted as the same word in the hash when they are
converted first by the lc function. Either way, it still wouldn't
matter since I am not looking for a particular word but using (\w+).
Now I remember, this was left over from the original version where I
was just counting a single word.

do you mean like this?:

for (<INF>) {$words{lc $_}++ for split}

I tried this and also got a lot of words with punctuation stuck to
them. Like:
"the
really?
also,
oh.

maybe I could use a locale where - is a word character or just specify
my character set to match:

for (<INF>) {$words{lc $1}++ while /([-a-zA-Z0-9]+)/g}

or even leave out the numbers:

for (<INF>) {$words{lc $1}++ while /([-a-zA-Z]+)/g}

or find dates only:

for (<INF>) {$words{lc $1}++ while /([0-9]{4})/g}

A little Perl could probably be really useful for some English
professors and students who hate computers and programming :) Then
again, they might not care about finding interesting patterns in
classic literature :-(

Maybe a quick way to look for mispellings and other problems for book
editors?

works great...Thanks!
 
A

Anno Siegel

[...]
do you mean like this?:

for (<INF>) {$words{lc $_}++ for split}

I tried this and also got a lot of words with punctuation stuck to
them. Like:
"the
really?
also,
oh.

Sure. But that's *more* differentiation than you want and could be
corrected even after the fact. If you split too much, you
irretrievably count parts of a compound with the partial words.
maybe I could use a locale where - is a word character or just specify
my character set to match:

for (<INF>) {$words{lc $1}++ while /([-a-zA-Z0-9]+)/g}

or even leave out the numbers:

for (<INF>) {$words{lc $1}++ while /([-a-zA-Z]+)/g}

or find dates only:

for (<INF>) {$words{lc $1}++ while /([0-9]{4})/g}

There is a lot one can do, even more with alphabets that aren't
entirely ASCII. Then there's hyphenated words at the end of a line...

Here's my approximation (untested):

for ( <INF> ) {
$words{ lc()} ++ for split /[.,;:!?)]?\s+[(]?/;
}

(No it doesn't deal with hyphen-separation. That would require a
line-spanning action, always a nuisance.) It does deal with single
punctuation characters at the and of a word and with single parentheses.
The tendency is to be conservative and register funny words as such,
but unify the most predictable cases.
A little Perl could probably be really useful for some English
professors and students who hate computers and programming :) Then
again, they might not care about finding interesting patterns in
classic literature :-(

Maybe a quick way to look for mispellings and other problems for book
editors?

Oh, I suppose they have all the software they can use, from standard
word processors (possibly with specialized spelling lists) to high-prized
proprietary software. It would take more than a casual word counting
algorithm to offer something interesting to these folks.

But here's something I'd like to know. When I read (a good part of)
Moby Dick a while back, one of the words I had to look up was "descry".
I had never heard it before, but it came up again and again -- Melville
seems to love it. So, what's the count of "descry" and "descried" in
Moby Dick, please?

Anno
 
I

ioneabu

That's a slightly easier problem I think. If you want the total of the
two:

#!/usr/bin/perl
$fn = "/textucation/moby10b.txt";
open (INF, $fn) or die "error: $!";
my $c;
for (<INF>){$c++ while /descry|descried/gi;}
print "found $c";

I get 26 total.

Sorry for the poor posting format, I was forced into it by the new
Google Groups2 beta. It gets the posts up much faster, but it forces
this format in a reply to a post. Hopefully they will fix it so people
can reply in the format consistent with the group they are posting too.
wana
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top