Iterating over files of a huge directory

Discussion in 'Python' started by Gilles Lenfant, Dec 17, 2012.

  1. Hi,

    I have googled but did not find an efficient solution to my problem. My customer provides a directory with a huuuuge list of files (flat, potentially 100000+) and I cannot reasonably use os.listdir(this_path) unless creating a big memory footprint.

    So I'm looking for an iterator that yields the file names of a directory and does not make a giant list of what's in.

    i.e :

    for filename in enumerate_files(some_directory):
    # My cooking...

    Many thanks by advance.
    --
    Gilles Lenfant
     
    Gilles Lenfant, Dec 17, 2012
    #1
    1. Advertising

  2. On Tue, Dec 18, 2012 at 2:28 AM, Gilles Lenfant
    <> wrote:
    > Hi,
    >
    > I have googled but did not find an efficient solution to my problem. My customer provides a directory with a huuuuge list of files (flat, potentially 100000+) and I cannot reasonably use os.listdir(this_path) unless creating a big memory footprint.
    >
    > So I'm looking for an iterator that yields the file names of a directory and does not make a giant list of what's in.


    Sounds like you want os.walk. But... a hundred thousand files? I know
    the Zen of Python says that flat is better than nested, but surely
    there's some kind of directory structure that would make this
    marginally manageable?

    http://docs.python.org/3.3/library/os.html#os.walk

    ChrisA
     
    Chris Angelico, Dec 17, 2012
    #2
    1. Advertising

  3. Gilles Lenfant

    Tim Golden Guest

    On 17/12/2012 15:41, Chris Angelico wrote:
    > On Tue, Dec 18, 2012 at 2:28 AM, Gilles Lenfant
    > <> wrote:
    >> Hi,
    >>
    >> I have googled but did not find an efficient solution to my
    >> problem. My customer provides a directory with a huuuuge list of
    >> files (flat, potentially 100000+) and I cannot reasonably use
    >> os.listdir(this_path) unless creating a big memory footprint.
    >>
    >> So I'm looking for an iterator that yields the file names of a
    >> directory and does not make a giant list of what's in.

    >
    > Sounds like you want os.walk. But... a hundred thousand files? I
    > know the Zen of Python says that flat is better than nested, but
    > surely there's some kind of directory structure that would make this
    > marginally manageable?
    >
    > http://docs.python.org/3.3/library/os.html#os.walk


    Unfortunately all of the built-in functions (os.walk, glob.glob,
    os.listdir) rely on the os.listdir functionality which produces a list
    first even if (as in glob.iglob) it later iterates over it.

    There are external functions to iterate over large directories in both
    Windows & Linux. I *think* the OP is on *nix from his previous posts, in
    which case someone else will have to produce the Linux-speak for this.
    If it's Windows, you can use the FindFilesIterator in the pywin32 package.

    TJG
     
    Tim Golden, Dec 17, 2012
    #3
  4. Gilles Lenfant

    marduk Guest

    On Mon, Dec 17, 2012, at 10:28 AM, Gilles Lenfant wrote:
    > Hi,
    >
    > I have googled but did not find an efficient solution to my problem. My
    > customer provides a directory with a huuuuge list of files (flat,
    > potentially 100000+) and I cannot reasonably use os.listdir(this_path)
    > unless creating a big memory footprint.
    >
    > So I'm looking for an iterator that yields the file names of a directory
    > and does not make a giant list of what's in.
    >
    > i.e :
    >
    > for filename in enumerate_files(some_directory):
    > # My cooking...
    >



    You could try using opendir[1] which is a binding to the posix call. I
    believe that it returns an iterator (file-like) of the entries in the
    directory.

    [1] http://pypi.python.org/pypi/opendir/
     
    marduk, Dec 17, 2012
    #4
  5. On 17 December 2012 15:28, Gilles Lenfant <> wrote:
    > I have googled but did not find an efficient solution to my problem. My customer provides a directory with a huuuuge list of files (flat, potentially 100000+) and I cannot reasonably use os.listdir(this_path) unless creating a big memory footprint.
    >
    > So I'm looking for an iterator that yields the file names of a directory and does not make a giant list of what's in.
    >
    > i.e :
    >
    > for filename in enumerate_files(some_directory):
    > # My cooking...


    In the last couple of months there has been a lot of discussion (on
    python-list or python-dev - not sure) about creating a library to more
    efficiently iterate over the files in a directory. The result so far
    is this library on github:
    https://github.com/benhoyt/betterwalk

    It says there that
    """
    Somewhat relatedly, many people have also asked for a version of
    os.listdir() that yields filenames as it iterates instead of returning
    them as one big list.

    So as well as a faster walk(), BetterWalk adds iterdir_stat() and
    iterdir(). They're pretty easy to use, but see below for the full API
    docs.
    """

    Does that code work for you? If so, I imagine the author would be
    interested to get some feedback on how well it works.

    Alternatively, perhaps consider calling an external utility.


    Oscar
     
    Oscar Benjamin, Dec 17, 2012
    #5
  6. Le lundi 17 décembre 2012 16:52:19 UTC+1, Oscar Benjamin a écrit :
    > On 17 December 2012 15:28, Gilles Lenfant <...> wrote:
    >
    >
    > In the last couple of months there has been a lot of discussion (on
    >
    > python-list or python-dev - not sure) about creating a library to more
    >
    > efficiently iterate over the files in a directory. The result so far
    >
    > is this library on github:
    >
    > https://github.com/benhoyt/betterwalk
    >
    >
    >
    > It says there that
    >
    > """
    >
    > Somewhat relatedly, many people have also asked for a version of
    >
    > os.listdir() that yields filenames as it iterates instead of returning
    >
    > them as one big list.
    >
    >
    >
    > So as well as a faster walk(), BetterWalk adds iterdir_stat() and
    >
    > iterdir(). They're pretty easy to use, but see below for the full API
    >
    > docs.
    >
    > """
    >
    >
    >
    > Does that code work for you? If so, I imagine the author would be
    >
    > interested to get some feedback on how well it works.
    >
    >
    >
    > Alternatively, perhaps consider calling an external utility.
    >


    Many thanks for this pointer Oscar.

    "betterwalk" is exactly what I was looking for. More particularly iterdir(....) and iterdir_stat(...)
    I'll get a deeper look at betterwalk and provide (hopefully successful) feedback.

    Cheers
    --
    Gilles Lenfant
     
    Gilles Lenfant, Dec 17, 2012
    #6
  7. Le lundi 17 décembre 2012 16:52:19 UTC+1, Oscar Benjamin a écrit :
    > On 17 December 2012 15:28, Gilles Lenfant <...> wrote:
    >
    >
    > In the last couple of months there has been a lot of discussion (on
    >
    > python-list or python-dev - not sure) about creating a library to more
    >
    > efficiently iterate over the files in a directory. The result so far
    >
    > is this library on github:
    >
    > https://github.com/benhoyt/betterwalk
    >
    >
    >
    > It says there that
    >
    > """
    >
    > Somewhat relatedly, many people have also asked for a version of
    >
    > os.listdir() that yields filenames as it iterates instead of returning
    >
    > them as one big list.
    >
    >
    >
    > So as well as a faster walk(), BetterWalk adds iterdir_stat() and
    >
    > iterdir(). They're pretty easy to use, but see below for the full API
    >
    > docs.
    >
    > """
    >
    >
    >
    > Does that code work for you? If so, I imagine the author would be
    >
    > interested to get some feedback on how well it works.
    >
    >
    >
    > Alternatively, perhaps consider calling an external utility.
    >


    Many thanks for this pointer Oscar.

    "betterwalk" is exactly what I was looking for. More particularly iterdir(....) and iterdir_stat(...)
    I'll get a deeper look at betterwalk and provide (hopefully successful) feedback.

    Cheers
    --
    Gilles Lenfant
     
    Gilles Lenfant, Dec 17, 2012
    #7
  8. Gilles Lenfant

    Paul Rudin Guest

    Chris Angelico <> writes:

    > On Tue, Dec 18, 2012 at 2:28 AM, Gilles Lenfant
    > <> wrote:
    >> Hi,
    >>
    >> I have googled but did not find an efficient solution to my
    >> problem. My customer provides a directory with a huuuuge list of
    >> files (flat, potentially 100000+) and I cannot reasonably use
    >> os.listdir(this_path) unless creating a big memory footprint.
    >>
    >> So I'm looking for an iterator that yields the file names of a
    >> directory and does not make a giant list of what's in.

    >
    > Sounds like you want os.walk.


    But doesn't os.walk call listdir() and that creates a list of the
    contents of a directory, which is exactly the initial problem?

    > But... a hundred thousand files? I know the Zen of Python says that
    > flat is better than nested, but surely there's some kind of directory
    > structure that would make this marginally manageable?
    >


    Sometimes you have to deal with things other people have designed, so
    the directory structure is not something you can control. I've run up
    against exactly the same problem and made something in C that
    implemented an iterator.

    It would probably be better if listdir() made an iterator rather than a
    list.
     
    Paul Rudin, Dec 17, 2012
    #8
  9. Gilles Lenfant

    MRAB Guest

    On 2012-12-17 17:27, Paul Rudin wrote:
    > Chris Angelico <> writes:
    >
    >> On Tue, Dec 18, 2012 at 2:28 AM, Gilles Lenfant
    >> <> wrote:
    >>> Hi,
    >>>
    >>> I have googled but did not find an efficient solution to my
    >>> problem. My customer provides a directory with a huuuuge list of
    >>> files (flat, potentially 100000+) and I cannot reasonably use
    >>> os.listdir(this_path) unless creating a big memory footprint.
    >>>
    >>> So I'm looking for an iterator that yields the file names of a
    >>> directory and does not make a giant list of what's in.

    >>
    >> Sounds like you want os.walk.

    >
    > But doesn't os.walk call listdir() and that creates a list of the
    > contents of a directory, which is exactly the initial problem?
    >
    >> But... a hundred thousand files? I know the Zen of Python says that
    >> flat is better than nested, but surely there's some kind of directory
    >> structure that would make this marginally manageable?
    >>

    >
    > Sometimes you have to deal with things other people have designed, so
    > the directory structure is not something you can control. I've run up
    > against exactly the same problem and made something in C that
    > implemented an iterator.
    >

    <Off topic>
    Years ago I had to deal with an in-house application that was written
    using a certain database package. The package stored each predefined
    query in a separate file in the same directory.

    I found that if I packed all the predefined queries into a single file
    and then called an external utility to extract the desired query from
    the file every time it was needed into a file for the package to use,
    not only did it save a significant amount of disk space (hard disks
    were a lot smaller then), I also got a significant speed-up!

    It wasn't as bad as 100000 in one directory, but it was certainly too
    many...
    </Off topic>
    > It would probably be better if listdir() made an iterator rather than a
    > list.
    >
     
    MRAB, Dec 17, 2012
    #9
  10. Re: Re: Iterating over files of a huge directory

    On 12/17/2012 09:52 AM, Oscar Benjamin wrote:
    > In the last couple of months there has been a lot of discussion (on
    > python-list or python-dev - not sure) about creating a library to more
    > efficiently iterate over the files in a directory. The result so far
    > is this library on github:
    > https://github.com/benhoyt/betterwalk


    This is very useful to know about; thanks.

    I actually wrote something very similar on my own (I wanted to get
    information about whether each directory entry was a file, directory,
    symlink, etc. without separate stat() calls). I'm guessing that the
    library you linked is more mature than mine (I only have a Linux
    implementation at present, for instance) so I'm happy to see that I
    could probably switch to something better... and even happier that it
    sounds like it's aiming for inclusion in the standard library.


    (Also just for the record and anyone looking for other posts, I'd guess
    said discussion was on Python-dev. I don't look at even remotely
    everything on python-list (there's just too much), but I do skim most
    subject lines and I haven't noticed any discussion on it before now.)

    Evan




    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.14 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

    iQEcBAEBAgAGBQJQz2c3AAoJEAOzoR8eZTzgPf0H/AjEzoD2b78DX7Xb9R7LHUfY
    woWEfivsWkjkdA23/5BrAgDGgXKvu/zhi4UCl0MaXSIJHLA1av2x+Li+wSgjLPm9
    8WE7B/sOcMY2qEH04FyBCgAZgpWv4JHOnFdDtarZG8et5AeDm1R2jqrPKGzlD4SI
    EIQtgM1nNpqFLw1fnnGqlm3Bj2aJjinVIS1Mn5WQyePkSW0RtBNzz/7rxaQAMhEp
    vJWyOmiCrHmOSIsaj4IzfQTeegTSvvN20crELVbwM7TMtQoepRPZyCCkWC3Ir3JG
    UYwPY0KoM27me/k7pbtphbIB5xGBrMTHSTV35EAV/Z5VyYBy24f6DmsCaBButPA=
    =pEvG
    -----END PGP SIGNATURE-----
     
    Evan Driscoll, Dec 17, 2012
    #10
  11. Re: Re: Iterating over files of a huge directory

    On 17 December 2012 18:40, Evan Driscoll <> wrote:
    > On 12/17/2012 09:52 AM, Oscar Benjamin wrote:
    >> https://github.com/benhoyt/betterwalk

    >
    > This is very useful to know about; thanks.
    >
    > I actually wrote something very similar on my own (I wanted to get
    > information about whether each directory entry was a file, directory,
    > symlink, etc. without separate stat() calls).


    The initial goal of betterwalk seemed to be the ability to do os.walk
    with fewer stat calls. I think the information you want is part of
    what betterwalk finds "for free" from the underlying OS iteration
    (without the need to call stat()) but I'm not sure.

    > (Also just for the record and anyone looking for other posts, I'd guess
    > said discussion was on Python-dev. I don't look at even remotely
    > everything on python-list (there's just too much), but I do skim most
    > subject lines and I haven't noticed any discussion on it before now.)


    Actually, it was python-ideas:
    http://thread.gmane.org/gmane.comp.python.ideas/17932
    http://thread.gmane.org/gmane.comp.python.ideas/17757
     
    Oscar Benjamin, Dec 17, 2012
    #11
  12. On 12/17/2012 01:50 PM, Oscar Benjamin wrote:
    > On 17 December 2012 18:40, Evan Driscoll <> wrote:
    >> On 12/17/2012 09:52 AM, Oscar Benjamin wrote:
    >>> https://github.com/benhoyt/betterwalk

    >>
    >> This is very useful to know about; thanks.
    >>
    >> I actually wrote something very similar on my own (I wanted to get
    >> information about whether each directory entry was a file, directory,
    >> symlink, etc. without separate stat() calls).

    >
    > The initial goal of betterwalk seemed to be the ability to do os.walk
    > with fewer stat calls. I think the information you want is part of
    > what betterwalk finds "for free" from the underlying OS iteration
    > (without the need to call stat()) but I'm not sure.


    Yes, that's my impression as well.


    >> (Also just for the record and anyone looking for other posts, I'd guess
    >> said discussion was on Python-dev. I don't look at even remotely
    >> everything on python-list (there's just too much), but I do skim most
    >> subject lines and I haven't noticed any discussion on it before now.)

    >
    > Actually, it was python-ideas:
    > http://thread.gmane.org/gmane.comp.python.ideas/17932
    > http://thread.gmane.org/gmane.comp.python.ideas/17757


    Thanks again for the pointers; I'll have to go through that thread. It's
    possible I can contribute something; it sounds like at least at one
    point the implementation was ctypes-based and is sometimes slower, and I
    have both a (now-defunct) C implementation and my current Cython module.
    Ironically I haven't actually benchmarked mine. :)

    Evan


    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.14 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

    iQEcBAEBAgAGBQJQz3wEAAoJEAOzoR8eZTzggJEH/iEAls+NAcfLA1nEt8GzeYFd
    O8IeSJR4nnVJoUilXzFb8MF9sqneg91+fiMnvi9UZGkvkvkKvDqgDdiWBg27l6g7
    iwBwruKlxrrPOm0UvhB+ltOgANo8OlVpM/MhfzWU38cjqYEwo6aBxlvYH9y5wQk0
    HmFzE85x1c9hy1AU138LRDrdoIw6xLkRhB/cO4vPsJMNx5PxUNBMMc8uyJQiZAuC
    QLnZa9PT8g8HFaGvjq1XRN7DYOd+rfoHjuE3EoYdyza0oiSPoevKmub5ovkRT8U+
    NBxcbzjJbWuakvD43MbzhxN5jPM8z+Zpomb7sXk6mXqVbCWNZXQgkuSv9r9hc9Y=
    =obg5
    -----END PGP SIGNATURE-----
     
    Evan Driscoll, Dec 17, 2012
    #12
  13. Gilles Lenfant

    Terry Reedy Guest

    On 12/17/2012 10:28 AM, Gilles Lenfant wrote:
    > Hi,
    >
    > I have googled but did not find an efficient solution to my problem.
    > My customer provides a directory with a huuuuge list of files (flat,
    > potentially 100000+) and I cannot reasonably use
    > os.listdir(this_path) unless creating a big memory footprint.


    Is is really big enough to be a real problem? See below.

    > So I'm looking for an iterator that yields the file names of a
    > directory and does not make a giant list of what's in.
    >
    > i.e :
    >
    > for filename in enumerate_files(some_directory): # My cooking...


    See http://bugs.python.org/issue11406
    As I said there, I personally think (and still do) that listdir should
    have been changed in 3.0 to return an iterator rather than a list.
    Developers who count more than me disagree on the basis that no
    application has the millions of directory entries needed to make space a
    real issue. They also claim that time is a wash either way.

    As for space, 100000 entries x 100 bytes/entry (generous guess at
    average) = 10,000,000 bytes, no big deal with gigabyte memories. So the
    logic goes. A smaller example from my machine with 3.3.

    from sys import getsizeof

    def seqsize(seq):
    "Get size of flat sequence and contents"
    return sum((getsizeof(item) for item in seq), getsizeof(seq))

    import os
    d = os.listdir()
    print(seqsize([1,2,3]), len(d), seqsize(d))
    #
    172 45 3128

    The size per entry is relatively short because the two-level directory
    prefix for each path is only about 15 bytes. By using 3.3 rather than
    3.0-3.2, the all-ascii-char unicode paths only take 1 byte per char
    rather than 2 or 4.

    If you disagree with the responses on the issue, after reading them,
    post one yourself with real numbers.

    --
    Terry Jan Reedy
     
    Terry Reedy, Dec 17, 2012
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jesse Noller

    problems iterating over a files lines

    Jesse Noller, Jan 21, 2004, in forum: Python
    Replies:
    2
    Views:
    409
    James Henderson
    Jan 21, 2004
  2. Replies:
    3
    Views:
    551
  3. carl
    Replies:
    5
    Views:
    2,507
    James Kanze
    Nov 25, 2009
  4. Darie Florin

    huge Dataset over webservice

    Darie Florin, Jun 23, 2004, in forum: ASP .Net Web Services
    Replies:
    2
    Views:
    163
    Darie Florin
    Jun 26, 2004
  5. Replies:
    19
    Views:
    209
    johannes falcone
    Apr 10, 2014
Loading...

Share This Page