count the number of occurences of each different word of a text

Mr_Noob · Dec 26, 2007

Hi all

i have a text that contains nearly 30000 words.
I'd like to know the number of occurences of each word of this text
and sort the result in descending order.
Is this easily feasible using perl? If so, how ? (i am a huge noob..)

thanks in advance

Jürgen Exner · Dec 26, 2007

Mr_Noob said:
i have a text that contains nearly 30000 words.
I'd like to know the number of occurences of each word of this text
and sort the result in descending order.
Is this easily feasible using perl?

Very simple actually.

If so, how ? (i am a huge noob..)

Just split() the string into words and count each word pattern in a hash,
using the word pattern as the key and the counter as the value.
Then sort the keys of that hash into an array by order of their values.
And then print them.

my $unhelpfulFAQ =
' How can I count the number of occurrences of a substring within a
string?
There are a number of ways, with varying efficiency. If you want
a count of a certain single character (X) within a string, you
can use the "tr///" function like so:

$string = "ThisXlineXhasXsomeXxsXinXit";
$count = ($string =~ tr/X//);
print "There are $count X characters in the string";

This is fine if you are just looking for a single character.
However, if you are trying to count multiple character
substrings within a larger string, "tr///" wont work. What you
can do is wrap a while() loop around a global pattern match. For
example, lets count negative integers:

$string = "-9 55 48 -2 23 -76 4 14 -44";
while ($string =~ /-\d+/g) { $count++ }
print "There are $count negative numbers in the string";';

my %count;
for (split /\W+/, $unhelpfulFAQ){
$count{$_}++;
}
my @sorted = sort {
$count{$b} <=> $count{$a}
} keys %count;
for (@sorted) {
print "$_: \t $count{$_}\n";
}

nolo contendere · Dec 26, 2007

Hi all

i have a text that contains nearly 30000 words.
I'd like to know the number of occurences of each word of this text
and sort the result in descending order.
Is this easily feasible using perl? If so, how ? (i am a huge noob..)

use strict; use warnings;

my %word_count;
my $text = 'the quick brown fox jumped over the lazy dog. the dog
liked it.';

my @words = map { s/\.$//; $_; } split ' ', $text;
for ( @words ) {
$word_count{$_}++;
}

for my $word ( sort { $word_count{$b} <=> $word_count{$a} } keys
%word_count ) {
print "$word => $word_count{$word}\n";
}

__OUTPUT__
the => 3
dog => 2
jumped => 1
over => 1
it => 1
liked => 1
lazy => 1
brown => 1
fox => 1
quick => 1

...you may want to sort on the word if the counts match up as well, so
the output will be consistent.

Mr_Noob · Dec 26, 2007

use strict; use warnings;

my %word_count;
my $text = 'the quick brown fox jumped over the lazy dog. the dog
liked it.';

my @words = map { s/\.$//; $_; } split ' ', $text;
for ( @words ) {
$word_count{$_}++;

}

for my $word ( sort { $word_count{$b} <=> $word_count{$a} } keys
%word_count ) {
print "$word => $word_count{$word}\n";

}

__OUTPUT__
the => 3
dog => 2
jumped => 1
over => 1
it => 1
liked => 1
lazy => 1
brown => 1
fox => 1
quick => 1

...you may want to sort on the word if the counts match up as well, so
the output will be consistent.

Thanks a lot for your answers.
I gave your script a try and it works perfectly !
However, i'd like to tell $text to look for a file, but i can't find
a way to do so :

#!/usr/bin/perl -w
use strict; use warnings;
my %word_count;
my $text = system ("cat /Users/test/Desktop/mytext.txt");
my @words = map { s/\.$//; $_; } split ' ', $text;
for ( @words ) {
$word_count{$_}++;
}
for my $word ( sort { $word_count{$b} <=> $word_count{$a} } keys
%word_count ) {
print "$word => $word_count{$word}\n";
}

any idea?

thank you again

nolo contendere · Dec 26, 2007

Thanks a lot for your answers.
I gave your script a try and it works perfectly !
However, i'd like to tell $text to look for a file, but i can't find
a way to do so :

#!/usr/bin/perl -w
use strict; use warnings;
my %word_count;
my $text = system ("cat /Users/test/Desktop/mytext.txt");

instead of what you have above, create a function to "slurp" the file
into a scalar variable.

sub slurp_file {

##------------------------------------------------------------------
## Reads contents of a text file into a string
##
## INPUTS: 1) Filename
##
## OUTPUTS: 1) String which contains contents of input file
##
my ( $filename ) = @_;
open my $fh, '<', $filename or die "can't open $filename: $!";
my $text = do { local $/; <$fh> };
close $fh;
return $text;
}

then, just call it:

my $text = slurp_file( '/Users/test/Desktop/mytext.txt' );

Mr_Noob · Dec 26, 2007

instead of what you have above, create a function to "slurp" the file
into a scalar variable.

sub slurp_file {

##------------------------------------------------------------------
## Reads contents of a text file into a string
##
## INPUTS: 1) Filename
##
## OUTPUTS: 1) String which contains contents of input file
##
my ( $filename ) = @_;
open my $fh, '<', $filename or die "can't open $filename: $!";
my $text = do { local $/; <$fh> };
close $fh;
return $text;

}

then, just call it:

my $text = slurp_file( '/Users/test/Desktop/mytext.txt' );

Yes! Perfect ! thanks a lot !

Jürgen Exner · Dec 26, 2007

nolo contendere said:
instead of what you have above, create a function to "slurp" the file
into a scalar variable.

Mind to explain your reasoning, please?
You are processing the file/string in a totally linear manner (no going
back). Therefore I don't see any reason to slurp in the whole file in one
piece instead of just processing it line by line.

jue

nolo contendere · Dec 26, 2007

Mind to explain your reasoning, please?
You are processing the file/string in a totally linear manner (no going
back). Therefore I don't see any reason to slurp in the whole file in one
piece instead of just processing it line by line.

jue

Well, it was laziness really. I was giving the OP a fish instead of
teaching how to fish. The code I delivered was the easiest way (not
the best, I agree) I could think of to plug into the existing code,
and that fulfilled the OP's need.

It doesn't really matter for small amounts of data. I wouldn't do it
the same way if the nature of the problem were different.

Mr_Noob · Dec 26, 2007

Yes! Perfect ! thanks a lot !

well, i still have a little problem. Here is a sample of my output:

dog => 5
cat => 3
dog, => 2
...

how can i avoid the distinction between a word and a word followed by
a coma?

nolo contendere · Dec 26, 2007

well, i still have a little problem. Here is a sample of my output:

dog => 5
cat => 3
dog, => 2
...

how can i avoid the distinction between a word and a word followed by
a coma?

in this line:

my @words = map { s/\.$//; $_; } split ' ', $text;

...I strip any words of a period. you can change this to strip them of
commas as well, either with a logical 'or', or a character class, or
with a separate s/// statement.

Uri Guttman · Dec 26, 2007

N> Yes! Perfect ! thanks a lot !

not so perfect. it is slow and doesn't support various useful options.
check out File::Slurp on cpan and you won't have to cut/paste that
code. and nolo, you should use it too.

uri

Uri Guttman · Dec 26, 2007

nc> Well, it was laziness really. I was giving the OP a fish instead of
nc> teaching how to fish. The code I delivered was the easiest way (not
nc> the best, I agree) I could think of to plug into the existing code,
nc> and that fulfilled the OP's need.

that was easier than use File::Slurp?

uri

nolo contendere · Dec 26, 2007

>> >instead of what you have above, create a function to "slurp" the file
>> >into a scalar variable.
>>
>> Mind to explain your reasoning, please?
>> You are processing the file/string in a totally linear manner (no going
>> back). Therefore I don't see any reason to slurp in the whole file in one
>> piece instead of just processing it line by line.
>>
>> jue

nc> Well, it was laziness really. I was giving the OP a fish instead of
nc> teaching how to fish. The code I delivered was the easiest way (not
nc> the best, I agree) I could think of to plug into the existing code,
nc> and that fulfilled the OP's need.

that was easier than use File::Slurp?

Uri, no, that was not easier than use File::Slurp, but I like to avoid
using modules for trivial tasks if I can, mainly due to bureaucratic
restrictions imposed by sysadmins, etc. If the OP had total control
over his environment, installing tested and optimized modules would be
the preferred solution.

Jürgen Exner · Dec 26, 2007

nolo contendere said:
my @words = map { s/\.$//; $_; } split ' ', $text;

...I strip any words of a period. you can change this to strip them of
commas as well, either with a logical 'or', or a character class, or
with a separate s/// statement.

Is there a specific reason, why you are using this awful map and s///
instead of just splitt()ing at non-word characters?

my @words = split /\W+/, $test;

jue

nolo contendere · Dec 26, 2007

Is there a specific reason, why you are using this awful map and s///
instead of just splitt()ing at non-word characters?

my @words = split /\W+/, $test;

That works very well, except when dealing with my ex-wife and her
cohorts.

Jürgen Exner · Dec 26, 2007

nolo contendere said:
Well, it was laziness really. I was giving the OP a fish instead of
teaching how to fish. The code I delivered was the easiest way (not
the best, I agree) I could think of to plug into the existing code,
and that fulfilled the OP's need.

A valid reason. Although I dont' quite agree. Wrapping this piece of codeinto a
while (my $text = <FH>) {
...
}
loop would have been even easier than defining a new sub. At least IMO.

jue

Ben Morrow · Dec 27, 2007

Quoth nolo contendere said:
That works very well, except when dealing with my ex-wife and her
cohorts.

So use your own definition of 'word' (\w is not a good idea in this case
anyway, as it includes '_')

my @words = split /[^[:alnum:]-]/, $test;

Ben

Ted Zlatanov · Dec 28, 2007

nc> Uri, no, that was not easier than use File::Slurp, but I like to avoid
nc> using modules for trivial tasks if I can, mainly due to bureaucratic
nc> restrictions imposed by sysadmins, etc. If the OP had total control
nc> over his environment, installing tested and optimized modules would be
nc> the preferred solution.

Slurping files is not trivial, it only looks that way

Look at
File::Slurp to see how complicated it is when done right.

As far as CPAN goes, I often hear the complain about bureaucracy getting
in the way. Is there something other than CPAN::AutoINC (which just
calls CPAN to install the missing modules) that will do run-time
retrieval of the modules, put them in a temporary place, and load them?
For pure Perl modules that would work well, especially if a local mirror
was used. I looked on CPAN but couldn't find something like this.

Ted

nolo contendere · Dec 28, 2007

nc> Uri, no, that was not easier than use File::Slurp, but I like to avoid
nc> using modules for trivial tasks if I can, mainly due to bureaucratic
nc> restrictions imposed by sysadmins, etc. If the OP had total control
nc> over his environment, installing tested and optimized modules would be
nc> the preferred solution.

Slurping files is not trivial, it only looks that way Look at
File::Slurp to see how complicated it is when done right.

Hmm, I suppose I should amend my earlier statement to the effect that
Slurping files for the most common cases is trivial. Uri himself
stated this 4 years ago (http://www.perl.com/pub/a/2003/11/21/
slurp.html):

Traditional Slurping

Perl has always supported slurping files with minimal code. Slurping
of a file to a list of lines is trivial, just call the <> operator
in a list context:

my @lines = <FH> ;

and slurping to a scalar isn't much more work. Just set the built in
variable $/ (the input record separator) to the undefined value and
read in the file with <>:

open( my $fh, $file ) or die "sudden flaming death\n"

As far as CPAN goes, I often hear the complain about bureaucracy getting
in the way. Is there something other than CPAN::AutoINC (which just
calls CPAN to install the missing modules) that will do run-time
retrieval of the modules, put them in a temporary place, and load them?
For pure Perl modules that would work well, especially if a local mirror
was used. I looked on CPAN but couldn't find something like this.

Ted

I'm unaware of anything like that, Ted. This seems messy though,
particularly when one
considers that there are multiple environments to think of, and a user
won't always
have access to certain directory structures, or those structures won't
even exist
in the prod environment. Then there are the permission issues, etc.
Seems much simpler
just to use the trivial self-rolled code.

Ted Zlatanov · Dec 28, 2007

nc> Hmm, I suppose I should amend my earlier statement to the effect that
nc> Slurping files for the most common cases is trivial.

Agreed. Know your inputs and you'll know what's right

nc> I'm unaware of anything like that, Ted. This seems messy though,
nc> particularly when one considers that there are multiple environments
nc> to think of, and a user won't always have access to certain
nc> directory structures, or those structures won't even exist in the
nc> prod environment. Then there are the permission issues, etc. Seems
nc> much simpler just to use the trivial self-rolled code.

These are the things that Perl makes easy, though. File::Temp for
instance will work in most cases to give you temporary storage (or
IO::Scalar in a pinch, to do I/O to a scalar). I think it's an
interesting idea and I could swear it's been implemented already (but
Google and CPAN searches didn't turn anything up).

Ted

Sort and count word pairs in a string	6	Jan 29, 2023
Find and count strings of text from multiple files	17	Dec 16, 2021
Add a list of videos each one in a different button in a web page	1	Dec 10, 2022
Sort by number of characters	0	Nov 3, 2023
Sort by number of characters	1	Nov 2, 2023
A number everyday of the month "and" a different number depending on the day of the month´s day time	2	Mar 16, 2021
Perl script to return the number of occurences of multiple lines in afile	8	Dec 29, 2007
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022

count the number of occurences of each different word of a text

Mr_Noob

Jürgen Exner

nolo contendere

Mr_Noob

nolo contendere

Mr_Noob

Jürgen Exner

nolo contendere

Mr_Noob

nolo contendere

Uri Guttman

Uri Guttman

nolo contendere

Jürgen Exner

nolo contendere

Jürgen Exner

Ben Morrow

Ted Zlatanov

nolo contendere

Ted Zlatanov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads