reading a directory, first files the newest ones

J

jordilin

When I read a huge directory with opendir,
opendir(DIR,"dirname");
my $file;
while($file=readdir(DIR))
whatever...
it loads the oldest ones first. I would like the newest files first,
instead of the oldest. Taking into account that I am only interested
in the newest files, this takes a lot of time, as the directory is
really huge. I am talking about thousands and thousands of files. I
need to process the files that are two hours old from now. I am not
interested in those older than two hours ago. I know that because I
check the modification time with stat.
any idea?
Thanks in advance
 
X

xhoster

jordilin said:
When I read a huge directory with opendir,
opendir(DIR,"dirname");
my $file;
while($file=readdir(DIR))
whatever...
it loads the oldest ones first. I would like the newest files first,
instead of the oldest.

That is completely up to your OS and your file system. Perl just provides
a fairly simple conduit for their behavior to reach you.
Taking into account that I am only interested
in the newest files, this takes a lot of time, as the directory is
really huge. I am talking about thousands and thousands of files. I
need to process the files that are two hours old from now. I am not
interested in those older than two hours ago. I know that because I
check the modification time with stat.
any idea?

Come up with a better directory structure; one that doesn't involve keeping
thousands and thousands of file in one directory that has to be scanned
over and over again. Or make whatever puts the files into that directory
to make a log, or to also create a symbolic link in another directory
pointing to the new file, which link can be deleted after 2 hours or so.

Its possible that your OS and your file system provide other tools for
inspecting very large directories more efficiently, but I rather doubt it.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
G

Gunnar Hjalmarsson

jordilin said:
When I read a huge directory with opendir,
opendir(DIR,"dirname");
my $file;
while($file=readdir(DIR))
whatever...
it loads the oldest ones first. I would like the newest files first,
instead of the oldest. Taking into account that I am only interested
in the newest files, this takes a lot of time,

How much time is that?
as the directory is
really huge. I am talking about thousands and thousands of files. I
need to process the files that are two hours old from now. I am not
interested in those older than two hours ago.

You may want to use grep() to assign to an array the files you are
interested in.

my @files = grep -M $_ <= 2/24, readdir DIR;
 
J

John W. Krahn

jordilin said:
When I read a huge directory with opendir,
opendir(DIR,"dirname");

You should *always* verify that the directory opened successfully:

opendir DIR, 'dirname' or die "Cannot open 'dirname' $!";
my $file;
while($file=readdir(DIR))
whatever...
it loads the oldest ones first.

No, it reads the file names in the order that they are stored in the
directory. It is just a coincidence that the older ones appear before
the newer ones. :)
I would like the newest files first, instead of the oldest.

Then you will have to sort them yourself.

perldoc -f sort
Taking into account that I am only interested
in the newest files, this takes a lot of time, as the directory is
really huge. I am talking about thousands and thousands of files. I
need to process the files that are two hours old from now. I am not
interested in those older than two hours ago. I know that because I
check the modification time with stat.
any idea?

The only thing you can do is read all the file names in the directory
and stat() each one.



John
 
J

jordilin

How much time is that?


You may want to use grep() to assign to an array the files you are
interested in.

my @files = grep -M $_ <= 2/24, readdir DIR;

To grab the files that are from two hours ago till now, I have to
process each file to check the modification time. Obviously, if the
while checks the oldest files first, it can take more than 10 minutes
to arrive for those files I am interested in. This directory has a
huge amount of files.
 
G

Gunnar Hjalmarsson

jordilin said:
When I read a huge directory with opendir,
opendir(DIR,"dirname");
my $file;
while($file=readdir(DIR))
whatever...
it loads the oldest ones first. I would like the newest files first,
instead of the oldest. Taking into account that I am only interested
in the newest files, this takes a lot of time, as the directory is
really huge. I am talking about thousands and thousands of files. I
need to process the files that are two hours old from now. I am not
interested in those older than two hours ago.

Maybe you should let the system do the desired sorting. On *nix that
might be:

chomp( my @files = qx(ls -t $dir) );
foreach my $file (@files) {
last if -M "$dir/$file" > 2/24;
print "$file\n";
}
 
J

Jürgen Exner

jordilin said:
To grab the files that are from two hours ago till now, I have to
process each file to check the modification time.

Yes. That is what the -M does.
Obviously, if the
while checks the oldest files first, it can take more than 10 minutes
to arrive for those files I am interested in.

That is exactly why Gunnar suggest not to use a while() loop but grep() in
the first place.

jue
 
J

jordilin

Maybe you should let the system do the desired sorting. On *nix that
might be:

chomp( my @files = qx(ls -t $dir) );
foreach my $file (@files) {
last if -M "$dir/$file" > 2/24;
print "$file\n";
}

With this code, and taking into account that the directory is huge,
memory usage would be a problem as we are going to use a huge array
@files, and the Unix server is a very important one. Don't know if
that could be achieved by means of a while. The real problem is having
to process many files before arriving to the interesting ones. The
solution would be reading the newest ones first. I think there is no
solution. We have, either to slurp all the files into an array (which
is going to take time and memory), or process the whole directory
through a while (one file at a time) till we get the proper files,
which in this case is going to take a lot of time as well.
 
J

jordilin

Yes. That is what the -M does.


That is exactly why Gunnar suggest not to use a while() loop but grep() in
the first place.

jue

Yeah, it seems that this would be a solution.
 
G

Gunnar Hjalmarsson

jordilin said:
With this code, and taking into account that the directory is huge,

How big is "huge"?
memory usage would be a problem as we are going to use a huge array
@files, and the Unix server is a very important one. Don't know if
that could be achieved by means of a while. The real problem is having
to process many files before arriving to the interesting ones.

With the above suggestion you wouldn't _process_ any files but the
interesting ones; you'd just store their names in an array.
The solution would be reading the newest ones first.

And that's what the -t option achieves...
I think there is no solution.
??

We have, either to slurp all the files into an array (which
is going to take time and memory), or process the whole directory
through a while (one file at a time) till we get the proper files,
which in this case is going to take a lot of time as well.

Have you measured the time for various options? You may want to study
the Benchmark module.
 
X

xhoster

Gunnar Hjalmarsson said:
And that's what the -t option achieves...

No, the -t option tells ls to *present* the newest ones first, not to
read them first. To present them in that order, it first needs to read all
of the directory entries in whatever order the file system deigns to
deliver them, stat them all, and sort the results based on time. There is
no reason to think that ls is going to be meaningfully faster about this
than perl will.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
G

Gunnar Hjalmarsson

No, the -t option tells ls to *present* the newest ones first, not to
read them first. To present them in that order, it first needs to read all
of the directory entries in whatever order the file system deigns to
deliver them, stat them all, and sort the results based on time.

So far I agree, but ...
There is no reason to think that ls is going to be meaningfully
faster about this than perl will.

.... my benchmark (see below) indicates otherwise. The difference seems
to increase when the directory size increases.

$ cat sortdir.pl
use Benchmark 'cmpthese';
my $dir = '/usr/lib';
cmpthese -5, {
Linux => sub {
chomp( my @files = qx(ls -t $dir) );
},
Perl => sub {
chdir $dir;
opendir( my $DH, '.' );
my @files = map { $_->[0] }
sort { $a->[1] <=> $b->[1] } map { [ $_, -M ] }
grep substr($_, 0, 1) ne '.', readdir $DH;
},
};

$ perl sortdir.pl
Rate Perl Linux
Perl 174/s -- -75%
Linux 693/s 297% --
 
J

Juha Laiho

jordilin said:
Yeah, it seems that this would be a solution.

If you're concerned over the memory use, you wouldn't use this -- it'll
implicitly _first_ load all the directory entries into the memory, and
will start filtering the modification dates only after the directory
scan has been completed.

A small benchmark, to run stat() on all files in a directory (a directory
with one million files, names 000000 .. 999999):

$ time perl -e 'opendir($DH,"."); while ($f=readdir($DH)) { stat($f); }; closedir($DH);'

real 8m0.728s
user 0m2.367s
sys 0m21.523s

While running this, the memory usage (according to "top") was rather
constant 7MB.


Another way to do the same - which would behave like the "grep" example:

$ time perl -e 'opendir($DH,"."); foreach $f (readdir($DH)) { stat($f); }; closedir($DH);'

real 11m9.247s
user 0m1.957s
sys 0m21.953s

With this approach, the memory usage initially climbed to reach approx. 50MB,
and remained there until the completion.
 
P

Peter J. Holzer

With this code, and taking into account that the directory is huge,
memory usage would be a problem as we are going to use a huge array
@files, and the Unix server is a very important one.

That would be easily remedied by reading from a pipe. But I don't think
Gunnar's suggestion is really faster. It needs to stat read the
directory and stat all the files (which takes the same time as your
code), *then* it needs to sort them (which takes additional time),
*then* your code needs to read the sorted list.
Don't know if that could be achieved by means of a while. The real
problem is having to process many files before arriving to the
interesting ones. The solution would be reading the newest ones first.
I think there is no solution.

The solution, as Xho suggested, is to come up with a better directory
structure.

If you can't do that:

Is there any way you can deduce the age of the files from the file name?
If you can avoid stat'ing all these files it will be a lot faster. You
don't need the exact age - if you can determine, from the filename
alone, that a file is surely older than two hours you don't have to stat
it.

Do these files get written once, or are they constantly updated? If it's
the former, you can cache their last-modified-dates. Reading them from
a file or memcached is likely to be a lot faster than a stat.

hp
 
P

Peter J. Holzer

There is no reason to think that ls is going to be meaningfully
faster about this than perl will.

... my benchmark (see below) indicates otherwise. The difference seems
to increase when the directory size increases.

$ cat sortdir.pl
use Benchmark 'cmpthese';
my $dir = '/usr/lib';
cmpthese -5, {
Linux => sub {
chomp( my @files = qx(ls -t $dir) );
},
Perl => sub {
chdir $dir;
opendir( my $DH, '.' );
my @files = map { $_->[0] }
sort { $a->[1] <=> $b->[1] } map { [ $_, -M ] }
grep substr($_, 0, 1) ne '.', readdir $DH;
},
};

$ perl sortdir.pl
Rate Perl Linux
Perl 174/s -- -75%
Linux 693/s 297% --

Your benchmark isn't valid: You are processing the complete directory
several hundred times per second, which indicates that it fits
completely into the buffer cache. After the first time you are measuring
mostly the processing time of ls and perl, not disk accesses.
Jordilin wrote that it takes about 10 minutes to process the directory
just once, which indicates that it either doesn't fit into the cache, or
that it is evicted from the cache between runs (which is quite likely on
a busy system), so he does have to access the disk for every file.

hp
 
M

Mark Clements

jordilin said:
When I read a huge directory with opendir,
opendir(DIR,"dirname");
my $file;
while($file=readdir(DIR))
whatever...
it loads the oldest ones first. I would like the newest files first,
instead of the oldest. Taking into account that I am only interested
in the newest files, this takes a lot of time, as the directory is
really huge. I am talking about thousands and thousands of files. I
need to process the files that are two hours old from now. I am not
interested in those older than two hours ago. I know that because I
check the modification time with stat.
any idea?
Thanks in advance
Maybe you could do something with File::Monitor, although I know nothing
of its efficiency with large directories. FAM may also be worth a look.

Mark
 
G

Gunnar Hjalmarsson

Peter said:
There is no reason to think that ls is going to be meaningfully
faster about this than perl will.
... my benchmark (see below) indicates otherwise. The difference seems
to increase when the directory size increases.

$ cat sortdir.pl
use Benchmark 'cmpthese';
my $dir = '/usr/lib';
cmpthese -5, {
Linux => sub {
chomp( my @files = qx(ls -t $dir) );
},
Perl => sub {
chdir $dir;
opendir( my $DH, '.' );
my @files = map { $_->[0] }
sort { $a->[1] <=> $b->[1] } map { [ $_, -M ] }
grep substr($_, 0, 1) ne '.', readdir $DH;
},
};

$ perl sortdir.pl
Rate Perl Linux
Perl 174/s -- -75%
Linux 693/s 297% --

Your benchmark isn't valid: You are processing the complete directory
several hundred times per second, which indicates that it fits
completely into the buffer cache. After the first time you are measuring
mostly the processing time of ls and perl, not disk accesses.

And that's what we were discussing, so I can't see that the benchmark
wouldn't be valid.
 
P

Peter J. Holzer

Peter said:
(e-mail address removed) wrote:
There is no reason to think that ls is going to be meaningfully
faster about this than perl will.
... my benchmark (see below) indicates otherwise. The difference seems
to increase when the directory size increases. [...]
$ perl sortdir.pl
Rate Perl Linux
Perl 174/s -- -75%
Linux 693/s 297% --

Your benchmark isn't valid: You are processing the complete directory
several hundred times per second, which indicates that it fits
completely into the buffer cache. After the first time you are measuring
mostly the processing time of ls and perl, not disk accesses.

And that's what we were discussing,

If you have been discussing this, you totally missed the OPs problem.

The OP has to read a directory which - and I repeat this - takes more
than ten *minutes* to read. Your benchmar reads the directory in 5.7
(perl) or 1.4 (ls) *milliseconds*. That's a difference of more than four
orders of magnitude!

That tells us that the OP reads from a cold cache, while you read from a
hot cache: In you case CPU time is the dominant factor, so ls (being
written in C) will be faster. In the OPs case disk access time will be
the dominant factor, and any CPU usage advantage from ls will be
completely insignificant (and indeed ls may take more time because it
has to sort the files *after* having stat'ed them).

The only way to speed up this program is to reduce the number of disk
accesses. Xho already suggested way with the most potential: Change the
directory structure - having directories with hundredthousands or
millions of files is not a good idea, even of filesystems with tree- or
hash-structured directories. I asked if another one is feasible:
Estimating the age from the filename so you don't have to stat each one.

There is a third one, but I don't think this works in pure perl, because
you need access to information readdir doesn't deliver: Read the
directory, sort by inode number, then stat the files in order. This
doesn't reduce the number of stat calls, but it reduces the number of
disk seeks which may provide a major speedup (there's a patch for mutt
which uses this technique for maildirs, and it's really a lot faster for
large mailboxes).

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top