Optimizing tips for os.listdir

T

Thomas

I'm doing this :

[os.path.join(path, p) for p in os.listdir(path) if \
os.path.isdir(os.path.join(path, p))]

to get a list of folders in a given directory, skipping all plain
files. When used on folders with lots of files, it takes rather long
time to finish. Just doing a listdir, filtering out all plain files
and a couple of joins, I didn't think this would take so long.

Is there a faster way of doing stuff like this?

Best regards,
Thomas
 
N

Nick Craig-Wood

Thomas said:
I'm doing this :

[os.path.join(path, p) for p in os.listdir(path) if \
os.path.isdir(os.path.join(path, p))]

to get a list of folders in a given directory, skipping all plain
files. When used on folders with lots of files, it takes rather long
time to finish. Just doing a listdir, filtering out all plain files
and a couple of joins, I didn't think this would take so long.

How many files, what OS and what filing system?

Under a unix based OS the above will translate to 1
opendir()/readdir()/closedir() and 1 stat() for each file. There
isn't a quicker way in terms of system calls AFAIK.

However some filing systems handle lots of files in a directory better
than others. Eg reiserfs is much better than ext2/3 for this purpose.
(ext3 has a dirhash module to fix this in the works though).

Eg on my linux box, running ext3, with various numbers of files in a
directory :-

/usr/lib/python2.3/timeit.py -s 'import os; path="."' \
'[os.path.join(path, p) for p in os.listdir(path) if \
os.path.isdir(os.path.join(path, p))]'

Files Time

10 3.01e+02 usec per loop
100 2.74e+03 usec per loop
1000 2.73e+04 usec per loop
10000 2.76e+05 usec per loop
100000 2.81e+06 usec per loop

Which is pretty linear... much more so than I expected!

The above timings ignore the effect of caching - will the directory
you are enumerating be hot in the cache?

Something similar may apply under Windows but I don't know ;-)
 
P

Peter Otten

Thomas said:
I'm doing this :

[os.path.join(path, p) for p in os.listdir(path) if \
os.path.isdir(os.path.join(path, p))]

to get a list of folders in a given directory, skipping all plain
files. When used on folders with lots of files, it takes rather long
time to finish. Just doing a listdir, filtering out all plain files
and a couple of joins, I didn't think this would take so long.

Is there a faster way of doing stuff like this?

If you are on windows, Michael Peuser reported a significant speed up
obtained by using win32api.FindFiles() instead of the portable stuff:

http://groups.google.com/[email protected]

Peter
 
B

Bengt Richter

Thomas said:
I'm doing this :

[os.path.join(path, p) for p in os.listdir(path) if \
os.path.isdir(os.path.join(path, p))]
You ought to be able to gain a little by hoisting the os.path.xxx
attribute lookups for join and isdir out of the loop. E.g, (not tested)

opj=os.path.join; oisd=os.path.isdir
[opj(path, p) for p in os.listdir(path) if oisd(opj(path, p))]

But it seems like you are asking the os to chase through full paths at
every isdir operation, rather than just telling it to make its current working
directory the directory you are interested in and doing it there. E.g., (untested)

savedir = os.getcwd()
os.chdir(path)
dirs = [opj(path, p) for p in os.listdir('.') if oisd(p)]
os.chdir(savedir)

I'd be curious to know how much difference the above would make.
How many files, what OS and what filing system?

Under a unix based OS the above will translate to 1
opendir()/readdir()/closedir() and 1 stat() for each file. There
isn't a quicker way in terms of system calls AFAIK.
Except IWT chdir could help there too?
However some filing systems handle lots of files in a directory better
than others. Eg reiserfs is much better than ext2/3 for this purpose.
(ext3 has a dirhash module to fix this in the works though).

Eg on my linux box, running ext3, with various numbers of files in a
directory :-

/usr/lib/python2.3/timeit.py -s 'import os; path="."' \
'[os.path.join(path, p) for p in os.listdir(path) if \
os.path.isdir(os.path.join(path, p))]'
path="." might be a special case in some of that though. I would
try a long absolute path for comparison. (What did the OP have as
actual use case?)
Files Time

10 3.01e+02 usec per loop
100 2.74e+03 usec per loop
1000 2.73e+04 usec per loop
10000 2.76e+05 usec per loop
100000 2.81e+06 usec per loop

Which is pretty linear... much more so than I expected!

The above timings ignore the effect of caching - will the directory
you are enumerating be hot in the cache?
Even if so, I doubt the os finds it via a hash of the full path instead
of checking that every element of the path exists and is a subdirectory.
IWT that could be a dangerous short cut, whereas chdir and using the cwd
should be fast and safe and most likely guarantee cached content availability.

Just guessing, though ;-)
Something similar may apply under Windows but I don't know ;-)

Regards,
Bengt Richter
 
B

Brian Beck

Thomas said:
I'm doing this :

[os.path.join(path, p) for p in os.listdir(path) if \
os.path.isdir(os.path.join(path, p))]

I don't have much time to experiment, but I came up with a method using
os.walk:

import os
def childdirs(path):
for root, dirs, files in os.walk(path):
return [os.path.join(root, name) for name in dirs]

This appears to be about the same speed as your code -- for all I know,
os.walk could be doing the same exact thing under the hood. But perhaps
someone could use this and come up with a one-liner -- as of now I can't
think of a way to do it properly without containing it in a function.
 
P

Peter Hansen

Bengt said:
But it seems like you are asking the os to chase through full paths at
every isdir operation, rather than just telling it to make its current working
directory the directory you are interested in and doing it there. E.g., (untested)

savedir = os.getcwd()
os.chdir(path)
dirs = [opj(path, p) for p in os.listdir('.') if oisd(p)]
os.chdir(savedir)

One aspect of this which needs mentioning is that it could
cause incorrect behaviour of other parts of the application
if there are multiple threads relying on using the current
working directory... the CWD is global to the application, AFAIK.

-Peter
 
G

G. S. Hayes

Nick Craig-Wood said:
Thomas said:
[os.path.join(path, p) for p in os.listdir(path) if \
os.path.isdir(os.path.join(path, p))]

to get a list of folders in a given directory, skipping all plain
files. When used on folders with lots of files, it takes rather long
time to finish. Just doing a listdir, filtering out all plain files
and a couple of joins, I didn't think this would take so long.
How many files, what OS and what filing system?

Under a unix based OS the above will translate to 1
opendir()/readdir()/closedir() and 1 stat() for each file. There
isn't a quicker way in terms of system calls AFAIK.

Under Linux, readdir() returns a struct dirent that has a d_type
member indicating the file type (DT_DIR for directories) so you can
avoid calling stat() on each file. I thought some BSD systems did
this as well.

I don't see how to get at this information from Python without making
the extra syscalls.
 
K

Kjetil Torgrim Homme

[Nick Craig-Wood]:
[Thomas]:
I'm doing this :

[os.path.join(path, p) for p in os.listdir(path) \
if os.path.isdir(os.path.join(path, p))]

Under a unix based OS the above will translate to 1
opendir()/readdir()/closedir() and 1 stat() for each file. There
isn't a quicker way in terms of system calls AFAIK.

there's a well known trick for Unix file systems[1]:

if a directory has a link count of exactly 2, there are no
subdirectories. if there was a subdirectory, it would contain a ".."
refering to its parent, thereby raising the parent's link count to 3.

(it's not clear to me whether this is helpful for the OP, though.)


[1] the behaviour is not mandated by POSIX. e.g., a mounted ISO
9660-file system will not necessarily obey this "rule". GNU find
relies on it unless you explicitly specify "-noleaf", so it works
pretty well in practice.
 
N

Nick Craig-Wood

Bengt Richter said:
You ought to be able to gain a little by hoisting the os.path.xxx
attribute lookups for join and isdir out of the loop. E.g, (not tested)

opj=os.path.join; oisd=os.path.isdir
[opj(path, p) for p in os.listdir(path) if oisd(opj(path, p))]

But it seems like you are asking the os to chase through full paths at
every isdir operation, rather than just telling it to make its current working
directory the directory you are interested in and doing it there. E.g., (untested)

savedir = os.getcwd()
os.chdir(path)
dirs = [opj(path, p) for p in os.listdir('.') if oisd(p)]
os.chdir(savedir)

with 1000 files in the directory

# 1) Original using '.'
/usr/lib/python2.3/timeit.py -s 'import os; path="."' \
'[os.path.join(path, p) for p in os.listdir(path) if os.path.isdir(os.path.join(path, p))]'
10 loops, best of 3: 2.69e+04 usec per loop

# 2) Original with long path
/usr/lib/python2.3/timeit.py -s 'import os;
path="/tmp/a/b/c/d/e/f/g/h/i/j/k/l/m/n/o/p/q/r/s/t/u/v/w/x/y/z"' \
'[os.path.join(path, p) for p in os.listdir(path) if os.path.isdir(os.path.join(path, p))]'
10 loops, best of 3: 4.16e+04 usec per loop

# 3) Using cwd
/usr/lib/python2.3/timeit.py -s 'import os' \
'[os.path.join(path, p) for p in os.listdir(".") if os.path.isdir(p)]'
100 loops, best of 3: 1.85e+04 usec per loop
Even if so, I doubt the os finds it via a hash of the full path instead
of checking that every element of the path exists and is a subdirectory.
IWT that could be a dangerous short cut

It is. Linux will look through each path entry. However they will be
hot in the dcache. It doesn't take much time hence the relatively
small difference between 1) and 2).

I expect the main difference between 1) and 3) is the fact it contains
one less os.path.join()

/usr/lib/python2.3/timeit.py -s 'import os;' 'os.path.join("a", "b")'
100000 loops, best of 3: 7.34 usec per loop

Its executed 1000 times above which is 7340 usec. The difference
between 1) and 3) is 8400 usec - pretty close!
 
G

G. S. Hayes

Under Linux, readdir() returns a struct dirent that has a d_type
member indicating the file type (DT_DIR for directories) so you can
avoid calling stat() on each file. I thought some BSD systems did
this as well.

Offtopic since it's really not Python related, (though I guess Python
might want to consider exposing this functionality in a portable way
eventually):

As a quick followup, with 10000 files on my machine it takes about
twice as long to use stat to get this information as to access the
d_type field. And it costs an extra 10000 syscalls (the d_type one is
about 93 syscalls total, mostly standard program startup/shutdown
costs like mapping in shared libs, flushing output on exit, etc).

On the other hand, they both execute in under a second. So for most
programs the difference in speed is probably negligible, and the
programming cost of portably choosing which method you want to use
probably isn't worth it in general (I could maybe see it for
specialized applications).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top