count the number of occurences of each different word of a text

M

Mr_Noob

Hi all

i have a text that contains nearly 30000 words.
I'd like to know the number of occurences of each word of this text
and sort the result in descending order.
Is this easily feasible using perl? If so, how ? (i am a huge noob..)

thanks in advance
 
J

Jürgen Exner

Mr_Noob said:
i have a text that contains nearly 30000 words.
I'd like to know the number of occurences of each word of this text
and sort the result in descending order.
Is this easily feasible using perl?

Very simple actually.
If so, how ? (i am a huge noob..)

Just split() the string into words and count each word pattern in a hash,
using the word pattern as the key and the counter as the value.
Then sort the keys of that hash into an array by order of their values.
And then print them.

my $unhelpfulFAQ =
' How can I count the number of occurrences of a substring within a
string?
There are a number of ways, with varying efficiency. If you want
a count of a certain single character (X) within a string, you
can use the "tr///" function like so:

$string = "ThisXlineXhasXsomeXxsXinXit";
$count = ($string =~ tr/X//);
print "There are $count X characters in the string";

This is fine if you are just looking for a single character.
However, if you are trying to count multiple character
substrings within a larger string, "tr///" wont work. What you
can do is wrap a while() loop around a global pattern match. For
example, lets count negative integers:

$string = "-9 55 48 -2 23 -76 4 14 -44";
while ($string =~ /-\d+/g) { $count++ }
print "There are $count negative numbers in the string";';

my %count;
for (split /\W+/, $unhelpfulFAQ){
$count{$_}++;
}
my @sorted = sort {
$count{$b} <=> $count{$a}
} keys %count;
for (@sorted) {
print "$_: \t $count{$_}\n";
}
 
N

nolo contendere

Hi all

i have a text that contains nearly 30000 words.
I'd like to know the number of occurences of each word of this text
and sort the result in descending order.
Is this easily feasible using perl? If so, how ? (i am a huge noob..)

use strict; use warnings;

my %word_count;
my $text = 'the quick brown fox jumped over the lazy dog. the dog
liked it.';

my @words = map { s/\.$//; $_; } split ' ', $text;
for ( @words ) {
$word_count{$_}++;
}

for my $word ( sort { $word_count{$b} <=> $word_count{$a} } keys
%word_count ) {
print "$word => $word_count{$word}\n";
}

__OUTPUT__
the => 3
dog => 2
jumped => 1
over => 1
it => 1
liked => 1
lazy => 1
brown => 1
fox => 1
quick => 1


...you may want to sort on the word if the counts match up as well, so
the output will be consistent.
 
M

Mr_Noob

use strict; use warnings;

my %word_count;
my $text = 'the quick brown fox jumped over the lazy dog. the dog
liked it.';

my @words = map { s/\.$//; $_; } split ' ', $text;
for ( @words ) {
    $word_count{$_}++;

}

for my $word ( sort { $word_count{$b} <=> $word_count{$a} } keys
%word_count ) {
    print "$word => $word_count{$word}\n";

}

__OUTPUT__
the => 3
dog => 2
jumped => 1
over => 1
it => 1
liked => 1
lazy => 1
brown => 1
fox => 1
quick => 1

...you may want to sort on the word if the counts match up as well, so
the output will be consistent.

Thanks a lot for your answers.
I gave your script a try and it works perfectly !
However, i'd like to tell $text to look for a file, but i can't find
a way to do so :

#!/usr/bin/perl -w
use strict; use warnings;
my %word_count;
my $text = system ("cat /Users/test/Desktop/mytext.txt");
my @words = map { s/\.$//; $_; } split ' ', $text;
for ( @words ) {
$word_count{$_}++;
}
for my $word ( sort { $word_count{$b} <=> $word_count{$a} } keys
%word_count ) {
print "$word => $word_count{$word}\n";
}

any idea?

thank you again
 
N

nolo contendere

Thanks a lot for your answers.
I gave your script a try and it works perfectly !
However, i'd like to tell $text to look for  a file, but i can't find
a way to do so :

#!/usr/bin/perl -w
use strict; use warnings;
my %word_count;
my $text = system ("cat /Users/test/Desktop/mytext.txt");

instead of what you have above, create a function to "slurp" the file
into a scalar variable.

sub slurp_file {

##------------------------------------------------------------------
## Reads contents of a text file into a string
##
## INPUTS: 1) Filename
##
## OUTPUTS: 1) String which contains contents of input file
##
my ( $filename ) = @_;
open my $fh, '<', $filename or die "can't open $filename: $!";
my $text = do { local $/; <$fh> };
close $fh;
return $text;
}

then, just call it:

my $text = slurp_file( '/Users/test/Desktop/mytext.txt' );
 
M

Mr_Noob

instead of what you have above, create a function to "slurp" the file
into a scalar variable.

sub slurp_file {

##------------------------------------------------------------------
    ## Reads contents of a text file into a string
    ##
    ## INPUTS:  1) Filename
    ##
    ## OUTPUTS: 1) String which contains contents of input file
    ##
    my ( $filename ) = @_;
    open my $fh, '<', $filename or die "can't open $filename: $!";
    my $text = do { local $/; <$fh> };
    close $fh;
    return $text;

}

then, just call it:

my $text = slurp_file( '/Users/test/Desktop/mytext.txt' );

Yes! Perfect ! thanks a lot !
 
J

Jürgen Exner

nolo contendere said:
instead of what you have above, create a function to "slurp" the file
into a scalar variable.

Mind to explain your reasoning, please?
You are processing the file/string in a totally linear manner (no going
back). Therefore I don't see any reason to slurp in the whole file in one
piece instead of just processing it line by line.

jue
 
N

nolo contendere

Mind to explain your reasoning, please?
You are processing the file/string in a totally linear manner (no going
back). Therefore I don't see any reason to slurp in the whole file in one
piece instead of just processing it line by line.

jue

Well, it was laziness really. I was giving the OP a fish instead of
teaching how to fish. The code I delivered was the easiest way (not
the best, I agree) I could think of to plug into the existing code,
and that fulfilled the OP's need.

It doesn't really matter for small amounts of data. I wouldn't do it
the same way if the nature of the problem were different.
 
M

Mr_Noob

Yes! Perfect ! thanks a lot !

well, i still have a little problem. Here is a sample of my output:

dog => 5
cat => 3
dog, => 2
...

how can i avoid the distinction between a word and a word followed by
a coma?
 
N

nolo contendere

well, i still have a little problem. Here is a sample of my output:

dog => 5
cat => 3
dog, => 2
...

how can i avoid the distinction between a word and a word followed by
a coma?

in this line:

my @words = map { s/\.$//; $_; } split ' ', $text;

...I strip any words of a period. you can change this to strip them of
commas as well, either with a logical 'or', or a character class, or
with a separate s/// statement.
 
U

Uri Guttman

N> Yes! Perfect ! thanks a lot !

not so perfect. it is slow and doesn't support various useful options.
check out File::Slurp on cpan and you won't have to cut/paste that
code. and nolo, you should use it too.

uri
 
U

Uri Guttman

nc> Well, it was laziness really. I was giving the OP a fish instead of
nc> teaching how to fish. The code I delivered was the easiest way (not
nc> the best, I agree) I could think of to plug into the existing code,
nc> and that fulfilled the OP's need.

that was easier than use File::Slurp?

uri
 
N

nolo contendere

  >> >instead of what you have above, create a function to "slurp" the file
  >> >into a scalar variable.
  >>
  >> Mind to explain your reasoning, please?
  >> You are processing the file/string in a totally linear manner (no going
  >> back). Therefore I don't see any reason to slurp in the whole file in one
  >> piece instead of just processing it line by line.
  >>
  >> jue

  nc> Well, it was laziness really. I was giving the OP a fish instead of
  nc> teaching how to fish. The code I delivered was the easiest way (not
  nc> the best, I agree) I could think of to plug into the existing code,
  nc> and that fulfilled the OP's need.

that was easier than use File::Slurp?

Uri, no, that was not easier than use File::Slurp, but I like to avoid
using modules for trivial tasks if I can, mainly due to bureaucratic
restrictions imposed by sysadmins, etc. If the OP had total control
over his environment, installing tested and optimized modules would be
the preferred solution.
 
J

Jürgen Exner

nolo contendere said:
my @words = map { s/\.$//; $_; } split ' ', $text;

...I strip any words of a period. you can change this to strip them of
commas as well, either with a logical 'or', or a character class, or
with a separate s/// statement.

Is there a specific reason, why you are using this awful map and s///
instead of just splitt()ing at non-word characters?

my @words = split /\W+/, $test;

jue
 
N

nolo contendere

Is there a specific reason, why you are using this awful map and s///
instead of just splitt()ing at non-word characters?

        my @words = split /\W+/, $test;

That works very well, except when dealing with my ex-wife and her
cohorts.
 
J

Jürgen Exner

nolo contendere said:
Well, it was laziness really. I was giving the OP a fish instead of
teaching how to fish. The code I delivered was the easiest way (not
the best, I agree) I could think of to plug into the existing code,
and that fulfilled the OP's need.

A valid reason. Although I dont' quite agree. Wrapping this piece of codeinto a
while (my $text = <FH>) {
...
}
loop would have been even easier than defining a new sub. At least IMO.

jue
 
B

Ben Morrow

Quoth nolo contendere said:
That works very well, except when dealing with my ex-wife and her
cohorts.

So use your own definition of 'word' (\w is not a good idea in this case
anyway, as it includes '_')

my @words = split /[^[:alnum:]-]/, $test;

Ben
 
T

Ted Zlatanov

nc> Uri, no, that was not easier than use File::Slurp, but I like to avoid
nc> using modules for trivial tasks if I can, mainly due to bureaucratic
nc> restrictions imposed by sysadmins, etc. If the OP had total control
nc> over his environment, installing tested and optimized modules would be
nc> the preferred solution.

Slurping files is not trivial, it only looks that way :) Look at
File::Slurp to see how complicated it is when done right.

As far as CPAN goes, I often hear the complain about bureaucracy getting
in the way. Is there something other than CPAN::AutoINC (which just
calls CPAN to install the missing modules) that will do run-time
retrieval of the modules, put them in a temporary place, and load them?
For pure Perl modules that would work well, especially if a local mirror
was used. I looked on CPAN but couldn't find something like this.

Ted
 
N

nolo contendere

nc> Uri, no, that was not easier than use File::Slurp, but I like to avoid
nc> using modules for trivial tasks if I can, mainly due to bureaucratic
nc> restrictions imposed by sysadmins, etc. If the OP had total control
nc> over his environment, installing tested and optimized modules would be
nc> the preferred solution.

Slurping files is not trivial, it only looks that way :)  Look at
File::Slurp to see how complicated it is when done right.

Hmm, I suppose I should amend my earlier statement to the effect that
Slurping files for the most common cases is trivial. Uri himself
stated this 4 years ago (http://www.perl.com/pub/a/2003/11/21/
slurp.html):

Traditional Slurping

Perl has always supported slurping files with minimal code. Slurping
of a file to a list of lines is trivial, just call the <> operator
in a list context:


my @lines = <FH> ;

and slurping to a scalar isn't much more work. Just set the built in
variable $/ (the input record separator) to the undefined value and
read in the file with <>:

open( my $fh, $file ) or die "sudden flaming death\n"
As far as CPAN goes, I often hear the complain about bureaucracy getting
in the way.  Is there something other than CPAN::AutoINC (which just
calls CPAN to install the missing modules) that will do run-time
retrieval of the modules, put them in a temporary place, and load them?
For pure Perl modules that would work well, especially if a local mirror
was used.  I looked on CPAN but couldn't find something like this.

Ted

I'm unaware of anything like that, Ted. This seems messy though,
particularly when one
considers that there are multiple environments to think of, and a user
won't always
have access to certain directory structures, or those structures won't
even exist
in the prod environment. Then there are the permission issues, etc.
Seems much simpler
just to use the trivial self-rolled code.
 
T

Ted Zlatanov

nc> Hmm, I suppose I should amend my earlier statement to the effect that
nc> Slurping files for the most common cases is trivial.

Agreed. Know your inputs and you'll know what's right :)

nc> I'm unaware of anything like that, Ted. This seems messy though,
nc> particularly when one considers that there are multiple environments
nc> to think of, and a user won't always have access to certain
nc> directory structures, or those structures won't even exist in the
nc> prod environment. Then there are the permission issues, etc. Seems
nc> much simpler just to use the trivial self-rolled code.

These are the things that Perl makes easy, though. File::Temp for
instance will work in most cases to give you temporary storage (or
IO::Scalar in a pinch, to do I/O to a scalar). I think it's an
interesting idea and I could swear it's been implemented already (but
Google and CPAN searches didn't turn anything up).

Ted
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,019
Latest member
RoxannaSta

Latest Threads

Top