script too slow - sometimes hangs

The King of Pots and Pans · Apr 24, 2004

I wrote the following perl script to wander through my hard drive
(from current working directory) and tell me how many of which file
types I have. It works for small directory hierarchies. It seems to
freeze up on some arbitrary file when trundling through a large
directory hierarchy. Not sure why.

It uses the 'file' shell command, and is extremely slow! How could
this script accomplish the same goal faster?

Here is a sample of the output:

[steve@sol testcode]$ ./classify_files.pl

Classifying all files recursively...
Found 902 files.
436 - ASCII C program text
252 - C++ program text
85 - directory
38 - ASCII text
15 - ASCII C++ program text
14 - ASCII English text
13 - ASCII C program text, with CRLF line terminators
9 - a /usr/bin/perl -w script text executable
6 - ASCII text, with CRLF line terminators
4 - a /usr/bin/perl script text executable
4 - Bourne-Again shell script text executable
3 - ASCII make commands text
2 - ASCII C program text, with very long lines
2 - ISO-8859 C program text, with CRLF line terminators
2 - ASCII text, with very long lines, with CRLF line terminators
2 - ASCII C++ program text, with CRLF line terminators
2 - ASCII English text, with CRLF line terminators
2 - character Computer Graphics Metafile
2 - ASCII English text, with very long lines
1 - data
(etc ... the rest removed for brevity)

I am a neophyte perl person ... don't do much of it but would like to
get better. Certainly there are much much better ways to do this so
that is why I am asking. Thanks. Here's the perl script. How can it be
faster?

#!/usr/bin/perl -w

use strict;
use Cwd;

my %ftype;
my $total_files = 0;

sub classify_file
{
my $output = `file $_`;

# ignore these ones
if($output =~ /broken symbolic link/ or
$output =~ /symbolic link to/ or
$output =~ /can\'t stat/)
{
return;
}

# remove filename
$output =~ s/.+: //;

# increment value for this key
++$ftype{$output};
++$total_files;
}

sub recurse_dir
{
# only get cwd, faster that way
my $cwd = getcwd();

# read all files in current dir
foreach(<*>)
{
# recurse into directories
if(-d $_)
{
chdir("$cwd/$_");
recurse_dir($_);
chdir($cwd);
}

# perform the 'file' shell command
classify_file("$cwd/$_");
}
}

print "\nClassifying all files recursively...";
recurse_dir(getcwd());
print "\nFound $total_files files.";
print "\n";

# print in descending order by value
foreach my $key (sort { $ftype{$b} <=> $ftype{$a} } (keys %ftype))
{
print "$ftype{$key} - $key";
}

Tad McClellan · Apr 24, 2004

The King of Pots and Pans said:
I wrote the following perl script to wander through my hard drive
(from current working directory) and tell me how many of which file
types I have. It works for small directory hierarchies. It seems to
freeze up on some arbitrary file when trundling through a large
directory hierarchy. Not sure why.

Probably a circular symref, easily fixed by using a module that
avoids cycles instead of attempting to write it yourself.

How can it be
faster?

I don't know.

I expect file(1) is where the time is spent. Have you found a
module that will detect file types instead of shelling out?

(like mabye File::Type)

#!/usr/bin/perl -w

use warnings; # better than -w if you have a modern perl

$output =~ /can\'t stat/)

Single quotes are not special in regexes/strings, no back slash needed.

sub recurse_dir

use File::Find;

chdir("$cwd/$_");
recurse_dir($_);
chdir($cwd);

If the chdir's fail you will never know it because you are
not checking the return values:

chdir($cwd) or die "could not cd to '$cwd' $!";

(but that whole issue will go away when you change to File::Find)

James Willmore · Apr 24, 2004

I wrote the following perl script to wander through my hard drive
(from current working directory) and tell me how many of which file
types I have. It works for small directory hierarchies. It seems to
freeze up on some arbitrary file when trundling through a large
directory hierarchy. Not sure why.

[ ... ]

#!/usr/bin/perl -w

use strict;
use Cwd;

my %ftype;
my $total_files = 0;

sub classify_file
{
my $output = `file $_`;

# ignore these ones
if($output =~ /broken symbolic link/ or
$output =~ /symbolic link to/ or
$output =~ /can\'t stat/)
{
return;
}

# remove filename
$output =~ s/.+: //;

# increment value for this key
++$ftype{$output};
++$total_files;
}

sub recurse_dir
{
# only get cwd, faster that way
my $cwd = getcwd();

# read all files in current dir
foreach(<*>)
{
# recurse into directories
if(-d $_)
{
chdir("$cwd/$_");
recurse_dir($_);
chdir($cwd);
}

# perform the 'file' shell command
classify_file("$cwd/$_");
}
}

print "\nClassifying all files recursively...";
recurse_dir(getcwd());
print "\nFound $total_files files.";
print "\n";

# print in descending order by value
foreach my $key (sort { $ftype{$b} <=> $ftype{$a} } (keys %ftype))
{
print "$ftype{$key} - $key";
}

`perldoc File::Find`
It's easier to use File::Find to recurse directories than using Cwd.

Just a suggestion

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
Computer programmers do it byte by byte

Sherm Pendley · Apr 24, 2004

The said:
It uses the 'file' shell command, and is extremely slow!

Have a look at File::MMagic.

sherm--

Very slow	16	Jan 12, 2012
Backtick command with long output super slow	5	Dec 4, 2012
Strange behaviour of Python when importing something from a symlinked script	0	May 23, 2013
os.chdir doesn't accept variables sometimes	3	Jun 2, 2006
FAQ 7.30 What does "bad interpreter" mean?	0	Mar 11, 2011
How to keep the script from stopping or hanging	5	Apr 7, 2012
Brocade Switch Perl Script	0	Aug 19, 2016
subprocess call acts differently than command line call?	2	Jun 7, 2007

script too slow - sometimes hangs

The King of Pots and Pans

Tad McClellan

James Willmore

Sherm Pendley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads