Speeding up glob?

J

Jim

Hi

I have a very simple perl program that runs _very_ slowly. Here's my
code:

#!/usr/local/bin/perl
#
# script to keep only a weeks worth of files
#
use File::stat;

$time = time;

# get list of all files in the backup directory
@files = glob ("/backup/output.log*");

unless (@files[0]) {
print "No files to process\n";
exit;
}

while (<@files>) {
$filename = $_;
$st = stat($_);

$mod_time = $time - $st->mtime;

# if file edit time is greater than x days, delete the file
# 1440 minutes in a day
# 86400 seconds in a day
# 604800 seconds in a week
# 2419200 seconds in a month
# 7257600 seconds in 90 days

if ($mod_time > 7257600) {
print "Deleting file $filename\n";
unlink ($filename);
}
else {
#do nothing
}
}

There are several thousand files (~21K) in this directory and many
thousands of those files fit the criteria to delete. It takes a really
long time to run this program. What's the holdup? Is it glob? My OS
(Solaris 8)? IO? Any way to speed this up? Thanks.

Jim
 
M

Mark Clements

Jim said:
Hi

I have a very simple perl program that runs _very_ slowly. Here's my
code:

#!/usr/local/bin/perl
#
# script to keep only a weeks worth of files
#
You need to run with strict and warnings turned on. Please read the
posting guidelines (subject "Posting Guidelines for
comp.lang.perl.misc"), which are posted regularly.
use File::stat;

$time = time;

# get list of all files in the backup directory
@files = glob ("/backup/output.log*");

unless (@files[0]) {
print "No files to process\n";
exit;
}

while (<@files>) {
$filename = $_;
$st = stat($_);
There are several thousand files (~21K) in this directory and many
thousands of those files fit the criteria to delete. It takes a really
long time to run this program. What's the holdup? Is it glob? My OS
(Solaris 8)? IO? Any way to speed this up? Thanks.
Solaris doesn't (or didn't - I stand open to correction) perform very
well with large directories on ufs. How long does eg ls take to complete
in this directory?

Secondly, you can benchmark your programs using a number of different
methods to work out where bottlenecks are. check out

Benchmark::Timer
Devel::DProf

regards,

Mark
 
J

J. Gleixner

Jim said:
There are several thousand files (~21K) in this directory and many
thousands of those files fit the criteria to delete. It takes a really
long time to run this program. What's the holdup? Is it glob? My OS
(Solaris 8)? IO? Any way to speed this up? Thanks.

Using File::Find or readdir, and process & unlinking each one, if it
passes your test, would probably be better. Similar to reading &
processing each line of a file, compared to slurping in the entire file
and then iterating through each line.

The "fastest" option would be to just used find (man find), and you'll
probably need to use xargs also (man xargs).
 
X

xhoster

Jim said:
Hi

I have a very simple perl program that runs _very_ slowly. Here's my
code:
....
@files = glob ("/backup/output.log*");

while (<@files>) {

You are double globbing. Don't do that.

Xho
 
T

Tintin

Jim said:
Hi

I have a very simple perl program that runs _very_ slowly. Here's my
code:

#!/usr/local/bin/perl
#
# script to keep only a weeks worth of files
#
use File::stat;

$time = time;

# get list of all files in the backup directory
@files = glob ("/backup/output.log*");

unless (@files[0]) {
print "No files to process\n";
exit;
}

while (<@files>) {
$filename = $_;
$st = stat($_);

$mod_time = $time - $st->mtime;

# if file edit time is greater than x days, delete the file
# 1440 minutes in a day
# 86400 seconds in a day
# 604800 seconds in a week
# 2419200 seconds in a month
# 7257600 seconds in 90 days

if ($mod_time > 7257600) {
print "Deleting file $filename\n";
unlink ($filename);
}
else {
#do nothing
}
}

There are several thousand files (~21K) in this directory and many
thousands of those files fit the criteria to delete. It takes a really
long time to run this program. What's the holdup? Is it glob? My OS
(Solaris 8)? IO? Any way to speed this up? Thanks.

The bottleneck is mostly going to be OS & I/O, but you could try

find /backups -name "output.log*" -mtime +7 | xargs rm -f
 
P

peter pilsl

Jim said:
Hi

I have a very simple perl program that runs _very_ slowly. Here's my
code:

#!/usr/local/bin/perl
#
# script to keep only a weeks worth of files
#
use File::stat;

$time = time;

# get list of all files in the backup directory
@files = glob ("/backup/output.log*");

unless (@files[0]) {
print "No files to process\n";
exit;
}

while (<@files>) {
$filename = $_;
$st = stat($_);

$mod_time = $time - $st->mtime;

# if file edit time is greater than x days, delete the file
# 1440 minutes in a day
# 86400 seconds in a day
# 604800 seconds in a week
# 2419200 seconds in a month
# 7257600 seconds in 90 days

if ($mod_time > 7257600) {
print "Deleting file $filename\n";
unlink ($filename);
}
else {
#do nothing
}
}

There are several thousand files (~21K) in this directory and many
thousands of those files fit the criteria to delete. It takes a really
long time to run this program. What's the holdup? Is it glob? My OS
(Solaris 8)? IO? Any way to speed this up? Thanks.

Jim


just for comparison:

I just wrote a small script that creates 20k empty files and get the
stat of the files and deletes them again.
Its pretty fast on my machine: a linux 2.4.x on Athlon1800XP with 1GB
Ram and IDE with softwareraid level1 and loads of daemons running on it.
So definitely not a machine with fast I/O.


# time ./p.pl create
0.18user 0.71system 0:00.88elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (357major+76minor)pagefaults 0swaps

# time ./p.pl delete
0.12user 1.18system 0:01.29elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (364major+820minor)pagefaults 0swaps

So its not the globbing, but maybe the double-globbing as Xho pointed
out already !!


Try the following on your machine:


#!/usr/bin/perl -w
use strict;

if ($ARGV[0]=~/create/) {
foreach (0..20000) {
open (FH,">x$_"); close FH;
}
}

if ($ARGV[0]=~/delete/) {
my @files=glob ("x*");
foreach(@files) {
stat($_);
unlink($_);
}
}


best,
peter
 
A

Ala Qumsieh

Jim said:
I have a very simple perl program that runs _very_ slowly. Here's my
code:
unless (@files[0]) {

This works. But you probably meant:

unless (@files) {

or
unless ($files[0]) {

Type this for more info on the diff between $files[0] and @files[0]:

perldoc -q 'difference.*\$array'
while (<@files>) {

This is doing much more work than you think it is. Change it to:

foreach (@files) {

--Ala
 
T

Tad McClellan

unless (@files[0]) {


You should always enable warnings when developing Perl code!

# if file edit time is greater than x days, delete the file
# 1440 minutes in a day
# 86400 seconds in a day
# 604800 seconds in a week
# 2419200 seconds in a month


You do not need "in a week" nor "in a month".

You already have how many in a day, multiply by 90 to get how
many are in 90 days.

# 7257600 seconds in 90 days


Wrong answer...

if ($mod_time > 7257600) {


if ($mod_time > 60 * 60 * 24 * 90) {


Perl will constant-fold it for you.

There are several thousand files (~21K) in this directory


Then the largest bottleneck is probably the OS and filesystem,
not the programming language (though your algorithm seems
sub-optimal too).

It takes a really
long time to run this program. What's the holdup?


There are several thousand files (~21K) in that directory.
 
J

Joe Smith

Jim said:
I have a very simple perl program that runs _very_ slowly.

You posted this question earlier and have already gotten and
answer. Why are you not accepting the answers already given?
while (<@files>) {

Big error right there. ' said:
$st = stat($_);
$mod_time = $time - $st->mtime;
# 1440 minutes in a day

Why are you doing that way? Have you not heard of -M()>

if (-M $_ > 7 ) { print 'File $_ is older than 7.000 days\n"; }

-Joe
 
J

Jim

Type this for more info on the diff between $files[0] and @files[0]:

perldoc -q 'difference.*\$array'
while (<@files>) {

This is doing much more work than you think it is. Change it to:

foreach (@files) {


Changing my while to a foreach has sped up the program considerably.
Thanks to those for the help.

Jim
 
J

Jim

You posted this question earlier and have already gotten and
answer. Why are you not accepting the answers already given?




Why are you doing that way? Have you not heard of -M()>

if (-M $_ > 7 ) { print 'File $_ is older than 7.000 days\n"; }

-Joe
I hadn't posted this question earlier. This is the first time I've
posted it. I hope. Unless I'm losing my mind. :)

The foreach speeds things up noticeably. I'm using that now.

No, I have not heard of -M(). Thanks for pointing it out to me.

Jim
 
J

Jim

You already have how many in a day, multiply by 90 to get how
many are in 90 days.




Wrong answer...

I would argue that it's not the "wrong answer", just a different answer
than you would've used. Six of one, half dozen of the other it seems to
me.

if ($mod_time > 60 * 60 * 24 * 90) {


Perl will constant-fold it for you.




Then the largest bottleneck is probably the OS and filesystem,
not the programming language (though your algorithm seems
sub-optimal too).




There are several thousand files (~21K) in that directory.

Using a foreach instead of my while (double globbing, I believe it was
referred to as) sped things up noticeably.

Thanks for your help.

Jim
 
B

Big and Blue

Jim said:

Which is a cause of your problem. And it will cause you problems every
time you try to access any of the files too.

If you had 21000 documents you wouldn't throw them all into one drawer
and expect to find one quickly. You'd arrange them by some category and
put each category into a separate drawer. File system have directory
hierarchy for just such an arrangement. If you use the facility you will
find your code runs much faster.
 
P

peter pilsl

Big said:
Which is a cause of your problem. And it will cause you problems
every time you try to access any of the files too.

If you had 21000 documents you wouldn't throw them all into one
drawer and expect to find one quickly. You'd arrange them by some
category and put each category into a separate drawer. File system
have directory hierarchy for just such an arrangement. If you use the
facility you will find your code runs much faster.

21000 files in one folder ist not a big deal for modern filesystems.
Like I wrote in my posting:

creating 21k files is about 1second on my old server. getting the stat
of each and unlink it, is about 2seconds.

best,
peter
 
X

xhoster

Big and Blue said:
Which is a cause of your problem.

Actually, that wasn't the cause of his problems.
If you had 21000 documents you wouldn't throw them all into one
drawer and expect to find one quickly.

Lots of things are done on computers differently than they are done
by hand.
You'd arrange them by some
category and put each category into a separate drawer.

Assuming that there are different categories to arrange them into.
Maybe the name of the file is the optimal level of categorization
that exists. In which case I would break them into different drawers only
for the artificial reason that drawers only have a certain physical
capacity.
File system have
directory hierarchy for just such an arrangement.

Good file systems will let you use the natural arrangement rather
than an artificial one. And if that means 20,000 files in a directory,
so be it. On a good file system, it makes a neglibible difference. Even
on a bad file system, I suspect it makes far, far less difference than the
double globbing issue does.

Xho
 
A

Anno Siegel

Jim said:
Type this for more info on the diff between $files[0] and @files[0]:

perldoc -q 'difference.*\$array'
while (<@files>) {

This is doing much more work than you think it is. Change it to:

foreach (@files) {


Changing my while to a foreach has sped up the program considerably.
Thanks to those for the help.

You're missing the point. The speed difference between while and foreach
is marginal. Globbing all the filenames (again) in

while ( <@files> ) {

is what kills it.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top