Faster os.walk()

Discussion in 'Python' started by fuzzylollipop, Apr 20, 2005.

  1. I am trying to get the number of bytes used by files in a directory.
    I am using a large directory ( lots of stuff checked out of multiple
    large cvs repositories ) and there is lots of wasted time doing
    multiple os.stat() on dirs and files from different methods.
     
    fuzzylollipop, Apr 20, 2005
    #1
    1. Advertising

  2. fuzzylollipop

    Peter Hansen Guest

    Laszlo Zsolt Nagy wrote:
    > fuzzylollipop wrote:
    >
    >> I am trying to get the number of bytes used by files in a directory.
    >> I am using a large directory ( lots of stuff checked out of multiple
    >> large cvs repositories ) and there is lots of wasted time doing
    >> multiple os.stat() on dirs and files from different methods.
    >>
    >>

    > Do you need a precise value, or are you satisfied with approximations too?
    > Under which operating system? The 'du' command can be your firend.


    How can "du" find the sizes without do os.stat() on each
    file?
     
    Peter Hansen, Apr 20, 2005
    #2
    1. Advertising

  3. fuzzylollipop wrote:

    >I am trying to get the number of bytes used by files in a directory.
    >I am using a large directory ( lots of stuff checked out of multiple
    >large cvs repositories ) and there is lots of wasted time doing
    >multiple os.stat() on dirs and files from different methods.
    >
    >

    Do you need a precise value, or are you satisfied with approximations too?
    Under which operating system? The 'du' command can be your firend.

    man du

    Best,

    Laci 2.0



    --
    _________________________________________________________________
    Laszlo Nagy web: http://designasign.biz
    IT Consultant mail:

    Python forever!
     
    Laszlo Zsolt Nagy, Apr 20, 2005
    #3
  4. du is faster than my code that does the same thing in python, it is
    highly optomized at the os level.

    that said, I profiled spawning an external process to call du and over
    the large number of times I need to do this it is actually slower to
    execute du externally than my os.walk() implementation.

    du does not return the value I need anyway, I need files only not raw
    blocks consumed which is what du returns. also I need to filter out
    some files and dirs.

    after extensive profiling I found out that the way that os.walk() is
    implemented it calls os.stat() on the dirs and files multiple times and
    that is where all the time is going.

    I guess I need something like os.statcache() but that is deprecated,
    and probably wouldn't fix my problem. I only walk the dir once and then
    cache all bytes, it is the multiple calls to os.stat() that is kicked
    off by the os.walk() command internally on all the isdir() and
    getsize() and what not.

    just wanted to check and see if anyone had already solved this problem.
     
    fuzzylollipop, Apr 20, 2005
    #4
  5. How about rerouting stdout/err and 'popening" something like

    /bin/find -name '*' -exec
    a_script_or_cmd_that_does_what_i_want_with_the_file {} \;

    ?

    Regards,

    Philippe




    fuzzylollipop wrote:

    > du is faster than my code that does the same thing in python, it is
    > highly optomized at the os level.
    >
    > that said, I profiled spawning an external process to call du and over
    > the large number of times I need to do this it is actually slower to
    > execute du externally than my os.walk() implementation.
    >
    > du does not return the value I need anyway, I need files only not raw
    > blocks consumed which is what du returns. also I need to filter out
    > some files and dirs.
    >
    > after extensive profiling I found out that the way that os.walk() is
    > implemented it calls os.stat() on the dirs and files multiple times and
    > that is where all the time is going.
    >
    > I guess I need something like os.statcache() but that is deprecated,
    > and probably wouldn't fix my problem. I only walk the dir once and then
    > cache all bytes, it is the multiple calls to os.stat() that is kicked
    > off by the os.walk() command internally on all the isdir() and
    > getsize() and what not.
    >
    > just wanted to check and see if anyone had already solved this problem.
     
    Philippe C. Martin, Apr 20, 2005
    #5
  6. fuzzylollipop

    Kent Johnson Guest

    fuzzylollipop wrote:
    > after extensive profiling I found out that the way that os.walk() is
    > implemented it calls os.stat() on the dirs and files multiple times and
    > that is where all the time is going.


    os.walk() is pretty simple, you could copy it and make your own version that calls os.stat() just
    once for each item. The dirnames and filenames lists it yields could be lists of (name,
    os.stat(path)) tuples so you would have the sizes available.

    Kent
     
    Kent Johnson, Apr 20, 2005
    #6
  7. fuzzylollipop <> wrote:
    > I am trying to get the number of bytes used by files in a directory.
    > I am using a large directory ( lots of stuff checked out of multiple
    > large cvs repositories ) and there is lots of wasted time doing
    > multiple os.stat() on dirs and files from different methods.


    I presume you are saying that the os.walk() has to stat() each file to
    see whether it is a directory or not, and that you are stat()-ing each
    file to count its bytes?

    If you want to just get away with the one stat() you'll have to
    re-implement os.walk yourself.

    Another trick for speeding up lots of stats is to chdir() to the
    directory you are processing, and then just use the leafnames in
    stat(). The OS then doesn't have to spend ages parsing lots of paths.

    However even if you implement both the above, I don't reckon you'll
    see a lot of improvement given that decent OSes have a very good cache
    for stat results, and that parsing file names is very quick too,
    compared to python.

    --
    Nick Craig-Wood <> -- http://www.craig-wood.com/nick
     
    Nick Craig-Wood, Apr 20, 2005
    #7
  8. If you're trying to track changes to files on (e.g. by comparing
    current size with previously recorded size), fam might obviate a lot of
    filesystem traversal.

    http://python-fam.sourceforge.net/
     
    Lonnie Princehouse, Apr 20, 2005
    #8
  9. ding, ding, ding, we have a winner.

    One of the guys on the team did just this, he re-implemented the
    os.walk() logic and embedded the logic to the S_IFDIR, S_IFMT and
    S_IFREG directly into the transversal code.

    This is all going to run on unix or linux machines in production so
    this is not a big deal.
    All in all we went from 64+k function calls for 7070 files/dirs to 1
    PER dir/file.

    the new code is just a little bit more than twice as fast.

    Huge improvement!
     
    fuzzylollipop, Apr 21, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. SD
    Replies:
    1
    Views:
    465
  2. pembed2003

    how to walk a binary tree

    pembed2003, Apr 19, 2004, in forum: C++
    Replies:
    7
    Views:
    7,162
    pembed2003
    Apr 20, 2004
  3. Andrew
    Replies:
    2
    Views:
    446
    Jonathan Turkanis
    Aug 1, 2004
  4. WIWA

    pySNMP: SNMP walk

    WIWA, Aug 21, 2003, in forum: Python
    Replies:
    0
    Views:
    2,144
  5. Marcus Alves Grando
    Replies:
    7
    Views:
    495
    Marcus Alves Grando
    Nov 14, 2007
Loading...

Share This Page