script too slow - sometimes hangs

  • Thread starter The King of Pots and Pans
  • Start date
T

The King of Pots and Pans

I wrote the following perl script to wander through my hard drive
(from current working directory) and tell me how many of which file
types I have. It works for small directory hierarchies. It seems to
freeze up on some arbitrary file when trundling through a large
directory hierarchy. Not sure why.

It uses the 'file' shell command, and is extremely slow! How could
this script accomplish the same goal faster?

Here is a sample of the output:

[steve@sol testcode]$ ./classify_files.pl

Classifying all files recursively...
Found 902 files.
436 - ASCII C program text
252 - C++ program text
85 - directory
38 - ASCII text
15 - ASCII C++ program text
14 - ASCII English text
13 - ASCII C program text, with CRLF line terminators
9 - a /usr/bin/perl -w script text executable
6 - ASCII text, with CRLF line terminators
4 - a /usr/bin/perl script text executable
4 - Bourne-Again shell script text executable
3 - ASCII make commands text
2 - ASCII C program text, with very long lines
2 - ISO-8859 C program text, with CRLF line terminators
2 - ASCII text, with very long lines, with CRLF line terminators
2 - ASCII C++ program text, with CRLF line terminators
2 - ASCII English text, with CRLF line terminators
2 - character Computer Graphics Metafile
2 - ASCII English text, with very long lines
1 - data
(etc ... the rest removed for brevity)

I am a neophyte perl person ... don't do much of it but would like to
get better. Certainly there are much much better ways to do this so
that is why I am asking. Thanks. Here's the perl script. How can it be
faster?

#!/usr/bin/perl -w

use strict;
use Cwd;

my %ftype;
my $total_files = 0;

sub classify_file
{
my $output = `file $_`;

# ignore these ones
if($output =~ /broken symbolic link/ or
$output =~ /symbolic link to/ or
$output =~ /can\'t stat/)
{
return;
}

# remove filename
$output =~ s/.+: //;

# increment value for this key
++$ftype{$output};
++$total_files;
}

sub recurse_dir
{
# only get cwd, faster that way
my $cwd = getcwd();

# read all files in current dir
foreach(<*>)
{
# recurse into directories
if(-d $_)
{
chdir("$cwd/$_");
recurse_dir($_);
chdir($cwd);
}

# perform the 'file' shell command
classify_file("$cwd/$_");
}
}

print "\nClassifying all files recursively...";
recurse_dir(getcwd());
print "\nFound $total_files files.";
print "\n";

# print in descending order by value
foreach my $key (sort { $ftype{$b} <=> $ftype{$a} } (keys %ftype))
{
print "$ftype{$key} - $key";
}
 
T

Tad McClellan

The King of Pots and Pans said:
I wrote the following perl script to wander through my hard drive
(from current working directory) and tell me how many of which file
types I have. It works for small directory hierarchies. It seems to
freeze up on some arbitrary file when trundling through a large
directory hierarchy. Not sure why.


Probably a circular symref, easily fixed by using a module that
avoids cycles instead of attempting to write it yourself.

How can it be
faster?


I don't know.

I expect file(1) is where the time is spent. Have you found a
module that will detect file types instead of shelling out?

(like mabye File::Type)

#!/usr/bin/perl -w


use warnings; # better than -w if you have a modern perl

$output =~ /can\'t stat/)


Single quotes are not special in regexes/strings, no back slash needed.

sub recurse_dir


use File::Find;

chdir("$cwd/$_");
recurse_dir($_);
chdir($cwd);


If the chdir's fail you will never know it because you are
not checking the return values:

chdir($cwd) or die "could not cd to '$cwd' $!";

(but that whole issue will go away when you change to File::Find)
 
J

James Willmore

I wrote the following perl script to wander through my hard drive
(from current working directory) and tell me how many of which file
types I have. It works for small directory hierarchies. It seems to
freeze up on some arbitrary file when trundling through a large
directory hierarchy. Not sure why.

[ ... ]
#!/usr/bin/perl -w

use strict;
use Cwd;

my %ftype;
my $total_files = 0;

sub classify_file
{
my $output = `file $_`;

# ignore these ones
if($output =~ /broken symbolic link/ or
$output =~ /symbolic link to/ or
$output =~ /can\'t stat/)
{
return;
}

# remove filename
$output =~ s/.+: //;

# increment value for this key
++$ftype{$output};
++$total_files;
}

sub recurse_dir
{
# only get cwd, faster that way
my $cwd = getcwd();

# read all files in current dir
foreach(<*>)
{
# recurse into directories
if(-d $_)
{
chdir("$cwd/$_");
recurse_dir($_);
chdir($cwd);
}

# perform the 'file' shell command
classify_file("$cwd/$_");
}
}

print "\nClassifying all files recursively...";
recurse_dir(getcwd());
print "\nFound $total_files files.";
print "\n";

# print in descending order by value
foreach my $key (sort { $ftype{$b} <=> $ftype{$a} } (keys %ftype))
{
print "$ftype{$key} - $key";
}

`perldoc File::Find`
It's easier to use File::Find to recurse directories than using Cwd.

Just a suggestion :)
HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
Computer programmers do it byte by byte
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,024
Latest member
ARDU_PROgrammER

Latest Threads

Top