Find length of files

Michael Preminger · Jun 23, 2004

Hello!

I have a huge directory, for which I need the word-count of all files
(like wc -w * , and then put all length into a database)

Is there a smart way to do it in perl? (apart from wc * > file and then
open the file..)

Thanks

Michael

John Bokma · Jun 23, 2004

Michael said:
Hello!

I have a huge directory, for which I need the word-count of all files
(like wc -w * , and then put all length into a database)

Is there a smart way to do it in perl? (apart from wc * > file and then
open the file..)

foreach file
open file,
put each word in a hash, like $words{$word}++;
this way you get the count of each word.

Rusty Phillips · Jun 23, 2004

The wordcount is not the number of times a word occurs, John.
It's the number of words in a file.

Using wc will probably be faster than the perl way (unless there's a
perl command for doing wordcounts I'm not aware of), but you don't
need to open a file.
Just use a pipe.

http://www.devdaily.com/perl/edu/articles/pl010004.shtml

A. Sinan Unur · Jun 24, 2004

The wordcount is not the number of times a word occurs, John.
It's the number of words in a file.

Just a minor point: You _can_ obtain the number of words in a file by
adding the number of occurances of each word in the file.

John Bokma · Jun 24, 2004

Rusty said:
The wordcount is not the number of times a word occurs, John.
It's the number of words in a file.

Ah, ok, then do something like:

$words_in_file{$filename}++;

for every word you find :-D.

Using wc will probably be faster than the perl way (unless there's a
perl command for doing wordcounts I'm not aware of),

Uhm, I guess for every file you have to fork, so I doubt it.

Hard to say, without benchmarking.

John Bokma · Jun 24, 2004

Purl said:
count and number of characters. Some say number of
characters is file size but I am not sure if this
includes file headers or not. I have never compared
character count to file size. Perhaps they are the
same measure.

Guess so: " -c, --bytes, --chars
print the byte counts"

(man wc)

File headers are of course a thing that's hard to guess for a program
like wc, and depends on the file contents specification.

Tad McClellan · Jun 24, 2004

Michael Preminger said:
I have a huge directory, for which I need the word-count of all files
(like wc -w * , and then put all length into a database)

Is there a smart way to do it in perl?

Yes and no, depending on the definition of "smart".

(apart from wc * > file and then
open the file..)

I don't like shelling-out for things that are easily done
in native Perl.

Perhaps you can adapt this "wc -w" workalike one-liner to your purposes?

perl -ane '$c+=@F; print("$c $ARGV\n"), $c=0 if eof(ARGV)' *

or suitable for a Real Program:

my $cnt = 0;
while ( <> ) {
my @words = split;
$cnt += @words;
if ( eof(ARGV) ) {
printf "$cnt $ARGV\n";
$cnt = 0;
}
}

Joe Smith · Jun 24, 2004

Purl said:
That syntax produces three outputs, line count, word
count and number of characters. Some say number of
characters is file size but I am not sure if this
includes file headers or not. I have never compared
character count to file size. Perhaps they are the
same measure.

What do you mean by file headers?
On the systems I work with, files do not have any headers.
The character count is exactly equal to the file size.
-Joe

Anno Siegel · Jun 24, 2004

Tad McClellan said:
Yes and no, depending on the definition of "smart".

I don't like shelling-out for things that are easily done
in native Perl.

Perhaps you can adapt this "wc -w" workalike one-liner to your purposes?

perl -ane '$c+=@F; print("$c $ARGV\n"), $c=0 if eof(ARGV)' *

or suitable for a Real Program:

my $cnt = 0;
while ( <> ) {
my @words = split;
$cnt += @words;
if ( eof(ARGV) ) {
printf "$cnt $ARGV\n";
$cnt = 0;
}
}

Alternatively, tr/// can be used if speed is an issue but space isn't.

my $cnt;
for ( do { local $/; <> } ) {
tr/tr/\n\t / /s; # replace sequences of white space with single blanks
$cnt = tr/ //; # count blanks
}

Because split() ignores trailing white space but tr/// doesn't, the
tr/// count may be one higher than the split() count, but that's
small stuff

Anno

Rusty Phillips · Jun 24, 2004

Uhm, I guess for every file you have to fork, so I doubt it.You only have to run wc once ("wc -w *"), so there should only be one
fork. Because wc is a compiled program designed especially for this
purpose, it is hopefully faster than perl at fetching and reading all
of the files quickly - enough so to overcome the penalty lost in
forking once (probably - need a benchmark to be sure).

In addition, it makes the perl coding simpler. You don't have to
bother with globbing and opening and closing the multiple files it needs,
or with scanning through the files. All you have to do is parse the
output from the wc command.

Anno Siegel · Jun 24, 2004

Purl Gurl said:
You need to add one to your final count. The first
word of a file will not be counted when a space
counting method is used.

Depends.

If there is white space after the last word, but none before the
first one, every word is followed by one or more white-space characters,
so the count is correct. Most texts read from a file are like that.

If white space precedes the first word, the count is one high, if none
follows the last word, it's one low. So the correction is

my $wordcount = tr/ // - /^[^\w]/ +/\w$/;

Anno

John Bokma · Jun 24, 2004

Rusty said:
You only have to run wc once ("wc -w *"), so there should only be one
fork. Because wc is a compiled program designed especially for this

Ah, indeed. Getting rusty ;-)

Find and count strings of text from multiple files	17	Dec 16, 2021
Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Pyautogui, cv2 and cannot find image	0	Feb 7, 2023
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022
Data saving in condition of changing reality	0	Apr 29, 2022
Variable length lookbehind not implemented	19	Aug 21, 2013

Find length of files

Michael Preminger

John Bokma

Rusty Phillips

A. Sinan Unur

John Bokma

John Bokma

Tad McClellan

Joe Smith

Anno Siegel

Rusty Phillips

Anno Siegel

John Bokma

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads