most efficient way to get number of files in a directory

G

guba

Hello,

I am searching the most efficient way to get the number of files
in a directory (up to 10^6 files). I will use the nr as a stop
condition
of of generation process so the method must be applied during this
process
a lot of times. Therefore it must be efficient and opendir is not the
choice.

I am thinking about the bash command "ls | wc -l"
but I don't know how to get this in a perl variable.

Thank you very much for any help!
 
J

Jürgen Exner

I am searching the most efficient way to get the number of files
in a directory (up to 10^6 files). I will use the nr as a stop
condition
of of generation process so the method must be applied during this
process
a lot of times. Therefore it must be efficient and opendir is not the
choice.

opendir() or glob() would have been my first suggestion. But you will
have to run your own benchmark tests, I doubt that anyone has ever
investigated performance in such a scenario before.
I am thinking about the bash command "ls | wc -l"
but I don't know how to get this in a perl variable.

Use backticks:
my $captured = `ls | wc -l`;

Of course, if launching two external processes and initiating IPC is
indeed faster than using Perl's buildin functions has to be tested.

jue
 
U

Uri Guttman

JE> opendir() or glob() would have been my first suggestion. But you will
JE> have to run your own benchmark tests, I doubt that anyone has ever
JE> investigated performance in such a scenario before.

how would opendir be slower than any other method (perl, shell, ls, glob
or other)? they ALL must do a system call to opendir underneath as that
is the only normal way to read a dir (you can 'open' a dir as a file but
then you have to parse it out yourself which can be painful).

JE> Of course, if launching two external processes and initiating IPC is
JE> indeed faster than using Perl's buildin functions has to be tested.

i can't see how they would ever be faster unless they can buffer the
dirnames better than perl's opendir can (when assigning to an
array). the fork overhead should easily lose out in this case but i
won't benchmark it with 10k files in a dir! :)

uri
 
D

Dr.Ruud

I am searching the most efficient way to get the number of files
in a directory (up to 10^6 files). I will use the nr as a stop
condition
of of generation process so the method must be applied during this
process
a lot of times. Therefore it must be efficient and opendir is not the
choice.

I am thinking about the bash command "ls | wc -l"
but I don't know how to get this in a perl variable.

Why have so many files in a directory? You could create them in
subdirectories named after the first few characters of the filename.

Or maybe you are looking for a database solution?

Or add a byte to a metafile, each time a new file is created, and check
the size of that file?
 
W

Wanna-Be Sys Admin

Jürgen Exner said:
opendir() or glob() would have been my first suggestion. But you will
have to run your own benchmark tests, I doubt that anyone has ever
investigated performance in such a scenario before.

Hmm, I've not looked, so you might be right, but I'd think someone
probably had benchmarked the results before, but then again, maybe
you're right, considering the number of files in the directory itself
is ridiculously large, so someone may have not bothered and used a
better directory structure for the files instead. Daily, I see this as
a common issue with clients, asking why their FTP program doesn't show
files after the 2000th one, and ask if they can have use modify FTP to
allow the listing of 10-20K files. That's when the education has to
begin for the client.
 
J

John Bokma

Dr.Ruud said:
Why have so many files in a directory? You could create them in
subdirectories named after the first few characters of the filename.

I've used the first few characters of the md5 hex digest of the
filename, depending on how the files are named [1], this might
distribute the files more evenly.

(e.g. if a lot of files start with the you might end up with a lot of
files in the "the" directory).
 
I

Ilya Zakharevich

Hmm, I've not looked, so you might be right, but I'd think someone
probably had benchmarked the results before, but then again, maybe
you're right, considering the number of files in the directory itself
is ridiculously large, so someone may have not bothered and used a
better directory structure for the files instead. Daily, I see this as
a common issue with clients, asking why their FTP program doesn't show
files after the 2000th one, and ask if they can have use modify FTP to
allow the listing of 10-20K files. That's when the education has to
begin for the client.

???? Just upgrade the server to use some non-brain-damaged
filesystem. 100K files in a directory should not be a big deal...
E.g., AFAIK, with HPFS386 1Mfile would not be much user-noticable.

Ilya

P.S. Of course, if one uses some brain-damaged API (like POSIX, which
AFAIK does not allow "merged" please_do_readdir_and_stat()
call), this may significantly slow down things even with
average-intelligence FSes...
 
W

Wanna-Be Sys Admin

Ilya said:
???? Just upgrade the server to use some non-brain-damaged
filesystem. 100K files in a directory should not be a big deal...
E.g., AFAIK, with HPFS386 1Mfile would not be much user-noticable.


A lot of systems I have to fix things on, are not one's I make the call
for. ext3 is about as good as it gets, which is fine, but... Anyway,
this is also about programs users are limited to use by management,
such as pure-ftpd, where it becomes a resource issue if it has to list
20K+ files in each directory. But, I do understand what you're getting
at.
 
M

Martijn Lievaart

I am thinking about the bash command "ls | wc -l" but I don't know how
to get this in a perl variable.

Perls opendir is better, but if you use ls, you probably want to use the
unsorted flag to ls.

M4
 
S

smallpond

Hello,

I am searching the most efficient way to get the number of files
in a directory  (up to 10^6 files). I will use the nr as a stop
condition
of of generation process so the method must be applied during this
process
a lot of times. Therefore it must be efficient and opendir is not the
choice.

I am thinking about the bash command "ls | wc -l"
but I don't know how to get this in a perl variable.

Thank you very much for any help!


What file system and OS?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top