Find length of files

M

Michael Preminger

Hello!

I have a huge directory, for which I need the word-count of all files
(like wc -w * , and then put all length into a database)

Is there a smart way to do it in perl? (apart from wc * > file and then
open the file..)

Thanks

Michael
 
J

John Bokma

Michael said:
Hello!

I have a huge directory, for which I need the word-count of all files
(like wc -w * , and then put all length into a database)

Is there a smart way to do it in perl? (apart from wc * > file and then
open the file..)

foreach file
open file,
put each word in a hash, like $words{$word}++;
this way you get the count of each word.
 
A

A. Sinan Unur

The wordcount is not the number of times a word occurs, John.
It's the number of words in a file.

Just a minor point: You _can_ obtain the number of words in a file by
adding the number of occurances of each word in the file.
 
J

John Bokma

Rusty said:
The wordcount is not the number of times a word occurs, John.
It's the number of words in a file.

Ah, ok, then do something like:

$words_in_file{$filename}++;

for every word you find :-D.
Using wc will probably be faster than the perl way (unless there's a
perl command for doing wordcounts I'm not aware of),

Uhm, I guess for every file you have to fork, so I doubt it.

Hard to say, without benchmarking.
 
J

John Bokma

Purl said:
count and number of characters. Some say number of
characters is file size but I am not sure if this
includes file headers or not. I have never compared
character count to file size. Perhaps they are the
same measure.

Guess so: " -c, --bytes, --chars
print the byte counts"

(man wc)

File headers are of course a thing that's hard to guess for a program
like wc, and depends on the file contents specification.
 
T

Tad McClellan

Michael Preminger said:
I have a huge directory, for which I need the word-count of all files
(like wc -w * , and then put all length into a database)

Is there a smart way to do it in perl?


Yes and no, depending on the definition of "smart". :)

(apart from wc * > file and then
open the file..)


I don't like shelling-out for things that are easily done
in native Perl.

Perhaps you can adapt this "wc -w" workalike one-liner to your purposes?

perl -ane '$c+=@F; print("$c $ARGV\n"), $c=0 if eof(ARGV)' *


or suitable for a Real Program:

my $cnt = 0;
while ( <> ) {
my @words = split;
$cnt += @words;
if ( eof(ARGV) ) {
printf "$cnt $ARGV\n";
$cnt = 0;
}
}
 
J

Joe Smith

Purl said:
That syntax produces three outputs, line count, word
count and number of characters. Some say number of
characters is file size but I am not sure if this
includes file headers or not. I have never compared
character count to file size. Perhaps they are the
same measure.

What do you mean by file headers?
On the systems I work with, files do not have any headers.
The character count is exactly equal to the file size.
-Joe
 
A

Anno Siegel

Tad McClellan said:
Yes and no, depending on the definition of "smart". :)




I don't like shelling-out for things that are easily done
in native Perl.

Perhaps you can adapt this "wc -w" workalike one-liner to your purposes?

perl -ane '$c+=@F; print("$c $ARGV\n"), $c=0 if eof(ARGV)' *


or suitable for a Real Program:

my $cnt = 0;
while ( <> ) {
my @words = split;
$cnt += @words;
if ( eof(ARGV) ) {
printf "$cnt $ARGV\n";
$cnt = 0;
}
}

Alternatively, tr/// can be used if speed is an issue but space isn't.

my $cnt;
for ( do { local $/; <> } ) {
tr/tr/\n\t / /s; # replace sequences of white space with single blanks
$cnt = tr/ //; # count blanks
}

Because split() ignores trailing white space but tr/// doesn't, the
tr/// count may be one higher than the split() count, but that's
small stuff :)

Anno
 
R

Rusty Phillips

Uhm, I guess for every file you have to fork, so I doubt it.You only have to run wc once ("wc -w *"), so there should only be one
fork. Because wc is a compiled program designed especially for this
purpose, it is hopefully faster than perl at fetching and reading all
of the files quickly - enough so to overcome the penalty lost in
forking once (probably - need a benchmark to be sure).

In addition, it makes the perl coding simpler. You don't have to
bother with globbing and opening and closing the multiple files it needs,
or with scanning through the files. All you have to do is parse the
output from the wc command.
 
A

Anno Siegel

Purl Gurl said:
You need to add one to your final count. The first
word of a file will not be counted when a space
counting method is used.

Depends.

If there is white space after the last word, but none before the
first one, every word is followed by one or more white-space characters,
so the count is correct. Most texts read from a file are like that.

If white space precedes the first word, the count is one high, if none
follows the last word, it's one low. So the correction is

my $wordcount = tr/ // - /^[^\w]/ +/\w$/;

Anno
 
J

John Bokma

Rusty said:
You only have to run wc once ("wc -w *"), so there should only be one
fork. Because wc is a compiled program designed especially for this

Ah, indeed. Getting rusty ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top