Find length of files

Discussion in 'Perl Misc' started by Michael Preminger, Jun 23, 2004.

  1. Hello!

    I have a huge directory, for which I need the word-count of all files
    (like wc -w * , and then put all length into a database)

    Is there a smart way to do it in perl? (apart from wc * > file and then
    open the file..)

    Thanks

    Michael
    Michael Preminger, Jun 23, 2004
    #1
    1. Advertising

  2. Michael Preminger

    John Bokma Guest

    Michael Preminger wrote:

    > Hello!
    >
    > I have a huge directory, for which I need the word-count of all files
    > (like wc -w * , and then put all length into a database)
    >
    > Is there a smart way to do it in perl? (apart from wc * > file and then
    > open the file..)


    foreach file
    open file,
    put each word in a hash, like $words{$word}++;
    this way you get the count of each word.

    --
    John MexIT: http://johnbokma.com/mexit/
    personal page: http://johnbokma.com/
    Experienced Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
    John Bokma, Jun 23, 2004
    #2
    1. Advertising

  3. The wordcount is not the number of times a word occurs, John.
    It's the number of words in a file.

    Using wc will probably be faster than the perl way (unless there's a
    perl command for doing wordcounts I'm not aware of), but you don't
    need to open a file.
    Just use a pipe.

    http://www.devdaily.com/perl/edu/articles/pl010004.shtml
    Rusty Phillips, Jun 23, 2004
    #3
  4. Rusty Phillips <> wrote in
    news:p:

    > The wordcount is not the number of times a word occurs, John.
    > It's the number of words in a file.


    Just a minor point: You _can_ obtain the number of words in a file by
    adding the number of occurances of each word in the file.

    --
    A. Sinan Unur
    (reverse each component for email address)
    A. Sinan Unur, Jun 24, 2004
    #4
  5. Michael Preminger

    John Bokma Guest

    Rusty Phillips wrote:

    > The wordcount is not the number of times a word occurs, John.
    > It's the number of words in a file.


    Ah, ok, then do something like:

    $words_in_file{$filename}++;

    for every word you find :-D.

    > Using wc will probably be faster than the perl way (unless there's a
    > perl command for doing wordcounts I'm not aware of),


    Uhm, I guess for every file you have to fork, so I doubt it.

    Hard to say, without benchmarking.

    --
    John MexIT: http://johnbokma.com/mexit/
    personal page: http://johnbokma.com/
    Experienced Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
    John Bokma, Jun 24, 2004
    #5
  6. Michael Preminger

    John Bokma Guest

    Purl Gurl wrote:

    > count and number of characters. Some say number of
    > characters is file size but I am not sure if this
    > includes file headers or not. I have never compared
    > character count to file size. Perhaps they are the
    > same measure.


    Guess so: " -c, --bytes, --chars
    print the byte counts"

    (man wc)

    File headers are of course a thing that's hard to guess for a program
    like wc, and depends on the file contents specification.

    --
    John MexIT: http://johnbokma.com/mexit/
    personal page: http://johnbokma.com/
    Experienced Perl programmer available: http://castleamber.com/
    Happy Customers: http://castleamber.com/testimonials.html
    John Bokma, Jun 24, 2004
    #6
  7. Michael Preminger <> wrote:

    > I have a huge directory, for which I need the word-count of all files
    > (like wc -w * , and then put all length into a database)
    >
    > Is there a smart way to do it in perl?



    Yes and no, depending on the definition of "smart". :)


    > (apart from wc * > file and then
    > open the file..)



    I don't like shelling-out for things that are easily done
    in native Perl.

    Perhaps you can adapt this "wc -w" workalike one-liner to your purposes?

    perl -ane '$c+=@F; print("$c $ARGV\n"), $c=0 if eof(ARGV)' *


    or suitable for a Real Program:

    my $cnt = 0;
    while ( <> ) {
    my @words = split;
    $cnt += @words;
    if ( eof(ARGV) ) {
    printf "$cnt $ARGV\n";
    $cnt = 0;
    }
    }


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Jun 24, 2004
    #7
  8. Michael Preminger

    Joe Smith Guest

    Purl Gurl wrote:

    >>(apart from wc * > file and then open the file..)

    >
    > That syntax produces three outputs, line count, word
    > count and number of characters. Some say number of
    > characters is file size but I am not sure if this
    > includes file headers or not. I have never compared
    > character count to file size. Perhaps they are the
    > same measure.


    What do you mean by file headers?
    On the systems I work with, files do not have any headers.
    The character count is exactly equal to the file size.
    -Joe
    Joe Smith, Jun 24, 2004
    #8
  9. Michael Preminger

    Anno Siegel Guest

    Tad McClellan <> wrote in comp.lang.perl.misc:
    > Michael Preminger <> wrote:
    >
    > > I have a huge directory, for which I need the word-count of all files
    > > (like wc -w * , and then put all length into a database)
    > >
    > > Is there a smart way to do it in perl?

    >
    >
    > Yes and no, depending on the definition of "smart". :)
    >
    >
    > > (apart from wc * > file and then
    > > open the file..)

    >
    >
    > I don't like shelling-out for things that are easily done
    > in native Perl.
    >
    > Perhaps you can adapt this "wc -w" workalike one-liner to your purposes?
    >
    > perl -ane '$c+=@F; print("$c $ARGV\n"), $c=0 if eof(ARGV)' *
    >
    >
    > or suitable for a Real Program:
    >
    > my $cnt = 0;
    > while ( <> ) {
    > my @words = split;
    > $cnt += @words;
    > if ( eof(ARGV) ) {
    > printf "$cnt $ARGV\n";
    > $cnt = 0;
    > }
    > }


    Alternatively, tr/// can be used if speed is an issue but space isn't.

    my $cnt;
    for ( do { local $/; <> } ) {
    tr/tr/\n\t / /s; # replace sequences of white space with single blanks
    $cnt = tr/ //; # count blanks
    }

    Because split() ignores trailing white space but tr/// doesn't, the
    tr/// count may be one higher than the split() count, but that's
    small stuff :)

    Anno
    Anno Siegel, Jun 24, 2004
    #9
  10. > Uhm, I guess for every file you have to fork, so I doubt it.
    >

    You only have to run wc once ("wc -w *"), so there should only be one
    fork. Because wc is a compiled program designed especially for this
    purpose, it is hopefully faster than perl at fetching and reading all
    of the files quickly - enough so to overcome the penalty lost in
    forking once (probably - need a benchmark to be sure).

    In addition, it makes the perl coding simpler. You don't have to
    bother with globbing and opening and closing the multiple files it needs,
    or with scanning through the files. All you have to do is parse the
    output from the wc command.
    Rusty Phillips, Jun 24, 2004
    #10
  11. Michael Preminger

    Anno Siegel Guest

    Purl Gurl <> wrote in comp.lang.perl.misc:
    > Anno Siegel wrote:
    >
    > > > Michael Preminger wrote:

    >
    > (snipped)
    >
    > > > > I have a huge directory, for which I need the word-count of all files
    > > > > (like wc -w * , and then put all length into a database)

    >
    > > Because split() ignores trailing white space but tr/// doesn't, the
    > > tr/// count may be one higher than the split() count, but that's
    > > small stuff :)

    >
    >
    > You need to add one to your final count. The first
    > word of a file will not be counted when a space
    > counting method is used.


    Depends.

    If there is white space after the last word, but none before the
    first one, every word is followed by one or more white-space characters,
    so the count is correct. Most texts read from a file are like that.

    If white space precedes the first word, the count is one high, if none
    follows the last word, it's one low. So the correction is

    my $wordcount = tr/ // - /^[^\w]/ +/\w$/;

    Anno
    Anno Siegel, Jun 24, 2004
    #11
  12. Michael Preminger

    John Bokma Guest

    John Bokma, Jun 24, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mitchua
    Replies:
    5
    Views:
    2,715
    Eric J. Roode
    Jul 17, 2003
  2. =?Utf-8?B?SG96aQ==?=
    Replies:
    1
    Views:
    6,937
    Ken Cox [Microsoft MVP]
    Jun 2, 2004
  3. Sam
    Replies:
    3
    Views:
    14,081
    Karl Seguin
    Feb 17, 2005
  4. Replies:
    2
    Views:
    5,911
  5. Dan Manes
    Replies:
    1
    Views:
    723
    David Browne
    Apr 23, 2006
Loading...

Share This Page