most efficient way to get number of files in a directory

Discussion in 'Perl Misc' started by guba@vi-anec.de, Jan 3, 2010.

  1. Guest

    Hello,

    I am searching the most efficient way to get the number of files
    in a directory (up to 10^6 files). I will use the nr as a stop
    condition
    of of generation process so the method must be applied during this
    process
    a lot of times. Therefore it must be efficient and opendir is not the
    choice.

    I am thinking about the bash command "ls | wc -l"
    but I don't know how to get this in a perl variable.

    Thank you very much for any help!
     
    , Jan 3, 2010
    #1
    1. Advertising

  2. "" <> wrote:
    >I am searching the most efficient way to get the number of files
    >in a directory (up to 10^6 files). I will use the nr as a stop
    >condition
    >of of generation process so the method must be applied during this
    >process
    >a lot of times. Therefore it must be efficient and opendir is not the
    >choice.


    opendir() or glob() would have been my first suggestion. But you will
    have to run your own benchmark tests, I doubt that anyone has ever
    investigated performance in such a scenario before.

    >I am thinking about the bash command "ls | wc -l"
    >but I don't know how to get this in a perl variable.


    Use backticks:
    my $captured = `ls | wc -l`;

    Of course, if launching two external processes and initiating IPC is
    indeed faster than using Perl's buildin functions has to be tested.

    jue
     
    Jürgen Exner, Jan 3, 2010
    #2
    1. Advertising

  3. Uri Guttman Guest

    >>>>> "JE" == Jürgen Exner <> writes:

    JE> "" <> wrote:
    >> I am searching the most efficient way to get the number of files
    >> in a directory (up to 10^6 files). I will use the nr as a stop
    >> condition
    >> of of generation process so the method must be applied during this
    >> process
    >> a lot of times. Therefore it must be efficient and opendir is not the
    >> choice.


    JE> opendir() or glob() would have been my first suggestion. But you will
    JE> have to run your own benchmark tests, I doubt that anyone has ever
    JE> investigated performance in such a scenario before.

    how would opendir be slower than any other method (perl, shell, ls, glob
    or other)? they ALL must do a system call to opendir underneath as that
    is the only normal way to read a dir (you can 'open' a dir as a file but
    then you have to parse it out yourself which can be painful).

    JE> Of course, if launching two external processes and initiating IPC is
    JE> indeed faster than using Perl's buildin functions has to be tested.

    i can't see how they would ever be faster unless they can buffer the
    dirnames better than perl's opendir can (when assigning to an
    array). the fork overhead should easily lose out in this case but i
    won't benchmark it with 10k files in a dir! :)

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
     
    Uri Guttman, Jan 3, 2010
    #3
  4. Dr.Ruud Guest

    wrote:

    > I am searching the most efficient way to get the number of files
    > in a directory (up to 10^6 files). I will use the nr as a stop
    > condition
    > of of generation process so the method must be applied during this
    > process
    > a lot of times. Therefore it must be efficient and opendir is not the
    > choice.
    >
    > I am thinking about the bash command "ls | wc -l"
    > but I don't know how to get this in a perl variable.


    Why have so many files in a directory? You could create them in
    subdirectories named after the first few characters of the filename.

    Or maybe you are looking for a database solution?

    Or add a byte to a metafile, each time a new file is created, and check
    the size of that file?

    --
    Ruud
     
    Dr.Ruud, Jan 3, 2010
    #4
  5. Jürgen Exner wrote:

    > opendir() or glob() would have been my first suggestion. But you will
    > have to run your own benchmark tests, I doubt that anyone has ever
    > investigated performance in such a scenario before.


    Hmm, I've not looked, so you might be right, but I'd think someone
    probably had benchmarked the results before, but then again, maybe
    you're right, considering the number of files in the directory itself
    is ridiculously large, so someone may have not bothered and used a
    better directory structure for the files instead. Daily, I see this as
    a common issue with clients, asking why their FTP program doesn't show
    files after the 2000th one, and ask if they can have use modify FTP to
    allow the listing of 10-20K files. That's when the education has to
    begin for the client.
    --
    Not really a wanna-be, but I don't know everything.
     
    Wanna-Be Sys Admin, Jan 3, 2010
    #5
  6. John Bokma Guest

    "Dr.Ruud" <> writes:

    > wrote:
    >
    >> I am searching the most efficient way to get the number of files
    >> in a directory (up to 10^6 files). I will use the nr as a stop
    >> condition
    >> of of generation process so the method must be applied during this
    >> process
    >> a lot of times. Therefore it must be efficient and opendir is not the
    >> choice.
    >>
    >> I am thinking about the bash command "ls | wc -l"
    >> but I don't know how to get this in a perl variable.

    >
    > Why have so many files in a directory? You could create them in
    > subdirectories named after the first few characters of the filename.


    I've used the first few characters of the md5 hex digest of the
    filename, depending on how the files are named [1], this might
    distribute the files more evenly.

    (e.g. if a lot of files start with the you might end up with a lot of
    files in the "the" directory).

    --
    John Bokma

    Read my blog: http://johnbokma.com/
    Hire me (Perl/Python): http://castleamber.com/
     
    John Bokma, Jan 3, 2010
    #6
  7. On 2010-01-03, Wanna-Be Sys Admin <> wrote:
    > J?rgen Exner wrote:
    >
    >> opendir() or glob() would have been my first suggestion. But you will
    >> have to run your own benchmark tests, I doubt that anyone has ever
    >> investigated performance in such a scenario before.

    >
    > Hmm, I've not looked, so you might be right, but I'd think someone
    > probably had benchmarked the results before, but then again, maybe
    > you're right, considering the number of files in the directory itself
    > is ridiculously large, so someone may have not bothered and used a
    > better directory structure for the files instead. Daily, I see this as
    > a common issue with clients, asking why their FTP program doesn't show
    > files after the 2000th one, and ask if they can have use modify FTP to
    > allow the listing of 10-20K files. That's when the education has to
    > begin for the client.


    ???? Just upgrade the server to use some non-brain-damaged
    filesystem. 100K files in a directory should not be a big deal...
    E.g., AFAIK, with HPFS386 1Mfile would not be much user-noticable.

    Ilya

    P.S. Of course, if one uses some brain-damaged API (like POSIX, which
    AFAIK does not allow "merged" please_do_readdir_and_stat()
    call), this may significantly slow down things even with
    average-intelligence FSes...
     
    Ilya Zakharevich, Jan 4, 2010
    #7
  8. Ilya Zakharevich wrote:

    > On 2010-01-03, Wanna-Be Sys Admin <> wrote:
    >> J?rgen Exner wrote:
    >>
    >>> opendir() or glob() would have been my first suggestion. But you
    >>> will have to run your own benchmark tests, I doubt that anyone has
    >>> ever investigated performance in such a scenario before.

    >>
    >> Hmm, I've not looked, so you might be right, but I'd think someone
    >> probably had benchmarked the results before, but then again, maybe
    >> you're right, considering the number of files in the directory itself
    >> is ridiculously large, so someone may have not bothered and used a
    >> better directory structure for the files instead. Daily, I see this
    >> as a common issue with clients, asking why their FTP program doesn't
    >> show files after the 2000th one, and ask if they can have use modify
    >> FTP to
    >> allow the listing of 10-20K files. That's when the education has to
    >> begin for the client.

    >
    > ???? Just upgrade the server to use some non-brain-damaged
    > filesystem. 100K files in a directory should not be a big deal...
    > E.g., AFAIK, with HPFS386 1Mfile would not be much user-noticable.
    >



    A lot of systems I have to fix things on, are not one's I make the call
    for. ext3 is about as good as it gets, which is fine, but... Anyway,
    this is also about programs users are limited to use by management,
    such as pure-ftpd, where it becomes a resource issue if it has to list
    20K+ files in each directory. But, I do understand what you're getting
    at.
    --
    Not really a wanna-be, but I don't know everything.
     
    Wanna-Be Sys Admin, Jan 4, 2010
    #8
  9. On Sun, 03 Jan 2010 14:46:50 -0800, wrote:

    > I am thinking about the bash command "ls | wc -l" but I don't know how
    > to get this in a perl variable.


    Perls opendir is better, but if you use ls, you probably want to use the
    unsorted flag to ls.

    M4
     
    Martijn Lievaart, Jan 4, 2010
    #9
  10. smallpond Guest

    On Jan 3, 5:46 pm, "" <> wrote:
    > Hello,
    >
    > I am searching the most efficient way to get the number of files
    > in a directory  (up to 10^6 files). I will use the nr as a stop
    > condition
    > of of generation process so the method must be applied during this
    > process
    > a lot of times. Therefore it must be efficient and opendir is not the
    > choice.
    >
    > I am thinking about the bash command "ls | wc -l"
    > but I don't know how to get this in a perl variable.
    >
    > Thank you very much for any help!



    What file system and OS?
     
    smallpond, Jan 4, 2010
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Brent Minder
    Replies:
    3
    Views:
    417
    Brent
    Dec 28, 2003
  2. Peter
    Replies:
    1
    Views:
    382
    Steve C. Orr [MVP, MCSD]
    Nov 9, 2004
  3. Linus Nikander
    Replies:
    5
    Views:
    558
  4. Arash Nikkar
    Replies:
    8
    Views:
    592
    Arash Nikkar
    Nov 27, 2006
  5. defn noob
    Replies:
    2
    Views:
    301
    defn noob
    Jul 2, 2008
Loading...

Share This Page