Speeding up glob?

Jim · Apr 25, 2005

Hi

I have a very simple perl program that runs _very_ slowly. Here's my
code:

#!/usr/local/bin/perl
#
# script to keep only a weeks worth of files
#
use File::stat;

$time = time;

# get list of all files in the backup directory
@files = glob ("/backup/output.log*");

unless (@files[0]) {
print "No files to process\n";
exit;
}

while (<@files>) {
$filename = $_;
$st = stat($_);

$mod_time = $time - $st->mtime;

# if file edit time is greater than x days, delete the file
# 1440 minutes in a day
# 86400 seconds in a day
# 604800 seconds in a week
# 2419200 seconds in a month
# 7257600 seconds in 90 days

if ($mod_time > 7257600) {
print "Deleting file $filename\n";
unlink ($filename);
}
else {
#do nothing
}
}

There are several thousand files (~21K) in this directory and many
thousands of those files fit the criteria to delete. It takes a really
long time to run this program. What's the holdup? Is it glob? My OS
(Solaris 8)? IO? Any way to speed this up? Thanks.

Jim

Mark Clements · Apr 25, 2005

Jim said:
Hi

I have a very simple perl program that runs _very_ slowly. Here's my
code:

#!/usr/local/bin/perl
#
# script to keep only a weeks worth of files
#

You need to run with strict and warnings turned on. Please read the
posting guidelines (subject "Posting Guidelines for
comp.lang.perl.misc"), which are posted regularly.

use File::stat;

$time = time;

# get list of all files in the backup directory
@files = glob ("/backup/output.log*");

unless (@files[0]) {
print "No files to process\n";
exit;
}

while (<@files>) {
$filename = $_;
$st = stat($_);

There are several thousand files (~21K) in this directory and many
thousands of those files fit the criteria to delete. It takes a really
long time to run this program. What's the holdup? Is it glob? My OS
(Solaris 8)? IO? Any way to speed this up? Thanks.

Solaris doesn't (or didn't - I stand open to correction) perform very
well with large directories on ufs. How long does eg ls take to complete
in this directory?

Secondly, you can benchmark your programs using a number of different
methods to work out where bottlenecks are. check out

Benchmark::Timer
Devel:

Prof

regards,

Mark

J. Gleixner · Apr 25, 2005

Jim said:
There are several thousand files (~21K) in this directory and many
thousands of those files fit the criteria to delete. It takes a really
long time to run this program. What's the holdup? Is it glob? My OS
(Solaris 8)? IO? Any way to speed this up? Thanks.

Using File::Find or readdir, and process & unlinking each one, if it
passes your test, would probably be better. Similar to reading &
processing each line of a file, compared to slurping in the entire file
and then iterating through each line.

The "fastest" option would be to just used find (man find), and you'll
probably need to use xargs also (man xargs).

xhoster · Apr 25, 2005

Jim said:
Hi

I have a very simple perl program that runs _very_ slowly. Here's my
code:
....
@files = glob ("/backup/output.log*");

while (<@files>) {

You are double globbing. Don't do that.

Xho

Tintin · Apr 25, 2005

Jim said:
Hi

I have a very simple perl program that runs _very_ slowly. Here's my
code:

#!/usr/local/bin/perl
#
# script to keep only a weeks worth of files
#
use File::stat;

$time = time;

# get list of all files in the backup directory
@files = glob ("/backup/output.log*");

unless (@files[0]) {
print "No files to process\n";
exit;
}

while (<@files>) {
$filename = $_;
$st = stat($_);

$mod_time = $time - $st->mtime;

# if file edit time is greater than x days, delete the file
# 1440 minutes in a day
# 86400 seconds in a day
# 604800 seconds in a week
# 2419200 seconds in a month
# 7257600 seconds in 90 days

if ($mod_time > 7257600) {
print "Deleting file $filename\n";
unlink ($filename);
}
else {
#do nothing
}
}

There are several thousand files (~21K) in this directory and many
thousands of those files fit the criteria to delete. It takes a really
long time to run this program. What's the holdup? Is it glob? My OS
(Solaris 8)? IO? Any way to speed this up? Thanks.

The bottleneck is mostly going to be OS & I/O, but you could try

find /backups -name "output.log*" -mtime +7 | xargs rm -f

peter pilsl · Apr 25, 2005

Jim said:
Hi

I have a very simple perl program that runs _very_ slowly. Here's my
code:

#!/usr/local/bin/perl
#
# script to keep only a weeks worth of files
#
use File::stat;

$time = time;

# get list of all files in the backup directory
@files = glob ("/backup/output.log*");

unless (@files[0]) {
print "No files to process\n";
exit;
}

while (<@files>) {
$filename = $_;
$st = stat($_);

$mod_time = $time - $st->mtime;

# if file edit time is greater than x days, delete the file
# 1440 minutes in a day
# 86400 seconds in a day
# 604800 seconds in a week
# 2419200 seconds in a month
# 7257600 seconds in 90 days

if ($mod_time > 7257600) {
print "Deleting file $filename\n";
unlink ($filename);
}
else {
#do nothing
}
}

There are several thousand files (~21K) in this directory and many
thousands of those files fit the criteria to delete. It takes a really
long time to run this program. What's the holdup? Is it glob? My OS
(Solaris 8)? IO? Any way to speed this up? Thanks.

Jim

just for comparison:

I just wrote a small script that creates 20k empty files and get the
stat of the files and deletes them again.
Its pretty fast on my machine: a linux 2.4.x on Athlon1800XP with 1GB
Ram and IDE with softwareraid level1 and loads of daemons running on it.
So definitely not a machine with fast I/O.

# time ./p.pl create
0.18user 0.71system 0:00.88elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (357major+76minor)pagefaults 0swaps

# time ./p.pl delete
0.12user 1.18system 0:01.29elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (364major+820minor)pagefaults 0swaps

So its not the globbing, but maybe the double-globbing as Xho pointed
out already !!

Try the following on your machine:

#!/usr/bin/perl -w
use strict;

if ($ARGV[0]=~/create/) {
foreach (0..20000) {
open (FH,">x$_"); close FH;
}
}

if ($ARGV[0]=~/delete/) {
my @files=glob ("x*");
foreach(@files) {
stat($_);
unlink($_);
}
}

best,
peter

Ala Qumsieh · Apr 25, 2005

Jim said:
I have a very simple perl program that runs _very_ slowly. Here's my
code:

unless (@files[0]) {

This works. But you probably meant:

unless (@files) {

or
unless ($files[0]) {

Type this for more info on the diff between $files[0] and @files[0]:

perldoc -q 'difference.*\$array'

while (<@files>) {

This is doing much more work than you think it is. Change it to:

foreach (@files) {

--Ala

Tad McClellan · Apr 25, 2005

unless (@files[0]) {

You should always enable warnings when developing Perl code!

# if file edit time is greater than x days, delete the file
# 1440 minutes in a day
# 86400 seconds in a day
# 604800 seconds in a week
# 2419200 seconds in a month

You do not need "in a week" nor "in a month".

You already have how many in a day, multiply by 90 to get how
many are in 90 days.

# 7257600 seconds in 90 days

Wrong answer...

if ($mod_time > 7257600) {

if ($mod_time > 60 * 60 * 24 * 90) {

Perl will constant-fold it for you.

There are several thousand files (~21K) in this directory

Then the largest bottleneck is probably the OS and filesystem,
not the programming language (though your algorithm seems
sub-optimal too).

It takes a really
long time to run this program. What's the holdup?

There are several thousand files (~21K) in that directory.

Joe Smith · Apr 26, 2005

Jim said:
I have a very simple perl program that runs _very_ slowly.

You posted this question earlier and have already gotten and
answer. Why are you not accepting the answers already given?

while (<@files>) {

Big error right there. ' said:
$st = stat($_);
$mod_time = $time - $st->mtime;
# 1440 minutes in a day

Why are you doing that way? Have you not heard of -M()>

if (-M $_ > 7 ) { print 'File $_ is older than 7.000 days\n"; }

-Joe

Jim · Apr 26, 2005

Type this for more info on the diff between $files[0] and @files[0]:

perldoc -q 'difference.*\$array'

while (<@files>) {

Click to expand...

This is doing much more work than you think it is. Change it to:

foreach (@files) {

Changing my while to a foreach has sped up the program considerably.
Thanks to those for the help.

Jim

Jim · Apr 26, 2005

You posted this question earlier and have already gotten and
answer. Why are you not accepting the answers already given?

Why are you doing that way? Have you not heard of -M()>

if (-M $_ > 7 ) { print 'File $_ is older than 7.000 days\n"; }

-Joe

I hadn't posted this question earlier. This is the first time I've
posted it. I hope. Unless I'm losing my mind.

The foreach speeds things up noticeably. I'm using that now.

No, I have not heard of -M(). Thanks for pointing it out to me.

Jim

Jim · Apr 26, 2005

You already have how many in a day, multiply by 90 to get how
many are in 90 days.

Wrong answer...

I would argue that it's not the "wrong answer", just a different answer
than you would've used. Six of one, half dozen of the other it seems to
me.

if ($mod_time > 60 * 60 * 24 * 90) {

Perl will constant-fold it for you.

Then the largest bottleneck is probably the OS and filesystem,
not the programming language (though your algorithm seems
sub-optimal too).

There are several thousand files (~21K) in that directory.

Using a foreach instead of my while (double globbing, I believe it was
referred to as) sped things up noticeably.

Thanks for your help.

Jim

Big and Blue · Apr 27, 2005

Jim said:
.....

Which is a cause of your problem. And it will cause you problems every
time you try to access any of the files too.

If you had 21000 documents you wouldn't throw them all into one drawer
and expect to find one quickly. You'd arrange them by some category and
put each category into a separate drawer. File system have directory
hierarchy for just such an arrangement. If you use the facility you will
find your code runs much faster.

peter pilsl · Apr 27, 2005

Big said:
Which is a cause of your problem. And it will cause you problems
every time you try to access any of the files too.

If you had 21000 documents you wouldn't throw them all into one
drawer and expect to find one quickly. You'd arrange them by some
category and put each category into a separate drawer. File system
have directory hierarchy for just such an arrangement. If you use the
facility you will find your code runs much faster.

21000 files in one folder ist not a big deal for modern filesystems.
Like I wrote in my posting:

creating 21k files is about 1second on my old server. getting the stat
of each and unlink it, is about 2seconds.

best,
peter

xhoster · Apr 27, 2005

Big and Blue said:
Which is a cause of your problem.

Actually, that wasn't the cause of his problems.

If you had 21000 documents you wouldn't throw them all into one
drawer and expect to find one quickly.

Lots of things are done on computers differently than they are done
by hand.

You'd arrange them by some
category and put each category into a separate drawer.

Assuming that there are different categories to arrange them into.
Maybe the name of the file is the optimal level of categorization
that exists. In which case I would break them into different drawers only
for the artificial reason that drawers only have a certain physical
capacity.

File system have
directory hierarchy for just such an arrangement.

Good file systems will let you use the natural arrangement rather
than an artificial one. And if that means 20,000 files in a directory,
so be it. On a good file system, it makes a neglibible difference. Even
on a bad file system, I suspect it makes far, far less difference than the
double globbing issue does.

Xho

Anno Siegel · Apr 29, 2005

Jim said:
Type this for more info on the diff between $files[0] and @files[0]:

perldoc -q 'difference.*\$array'

while (<@files>) {

Click to expand...

This is doing much more work than you think it is. Change it to:

foreach (@files) {

Click to expand...

Changing my while to a foreach has sped up the program considerably.
Thanks to those for the help.

You're missing the point. The speed difference between while and foreach
is marginal. Globbing all the filenames (again) in

while ( <@files> ) {

is what kills it.

Anno

Inconsistent results from (dos)glob	24	Jan 27, 2010
Thoughts on speeding up PDF::API2	3	Sep 12, 2008
Peer Review for Folder Delete Script	20	Jan 24, 2009
Python Coding Problem- How To Convert Seconds Into Minutes, Hours, And Day?	1	Oct 3, 2012
Speeding up an application - general rules	7	Dec 22, 2006
I want to Display Excel As HTML In js	2	Feb 24, 2023
Problem with glob and filenames containing '[' and ']'	9	Sep 27, 2006
FAQ 5.26 How do I get a file's timestamp in perl?	0	Apr 13, 2011

Speeding up glob?

Jim

Mark Clements

J. Gleixner

xhoster

Tintin

peter pilsl

Ala Qumsieh

Tad McClellan

Joe Smith

Jim

Jim

Jim

Big and Blue

peter pilsl

xhoster

Anno Siegel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads