waling a directory with very many files

tom · Jun 14, 2009

i can traverse a directory using os.listdir() or os.walk(). but if a
directory has a very large number of files, these methods produce very
large objects talking a lot of memory.

in other languages one can avoid generating such an object by walking
a directory as a liked list. for example, in c, perl or php one can
use opendir() and then repeatedly readdir() until getting to the end
of the file list. it seems this could be more efficient in some
applications.

is there a way to do this in python? i'm relatively new to the
language. i looked through the documentation and tried googling but
came up empty.

Tim Golden · Jun 14, 2009

tom said:
i can traverse a directory using os.listdir() or os.walk(). but if a
directory has a very large number of files, these methods produce very
large objects talking a lot of memory.

in other languages one can avoid generating such an object by walking
a directory as a liked list. for example, in c, perl or php one can
use opendir() and then repeatedly readdir() until getting to the end
of the file list. it seems this could be more efficient in some
applications.

is there a way to do this in python? i'm relatively new to the
language. i looked through the documentation and tried googling but
came up empty.

If you're on Windows, you can use the win32file.FindFilesIterator
function from the pywin32 package. (Which wraps the Win32 API
FindFirstFile / FindNextFile pattern).

TJG

tom · Jun 14, 2009

If you're on Windows, you can use the win32file.FindFilesIterator
function from the pywin32 package. (Which wraps the Win32 API
FindFirstFile / FindNextFile pattern).

thanks, tim.

however, i'm not using windows. freebsd and os x.

Tim Golden · Jun 14, 2009

tom said:
thanks, tim.

however, i'm not using windows. freebsd and os x.

Presumably, if Perl etc. can do it then it should be simple
enough to drop into ctypes and call the same library code, no?
(I'm not a BSD / OS X person, I'm afraid, so perhaps this isn't
so easy...)

TJG

Andre Engels · Jun 14, 2009

i can traverse a directory using os.listdir() or os.walk(). but if a
directory has a very large number of files, these methods produce very
large objects talking a lot of memory.

in other languages one can avoid generating such an object by walking
a directory as a liked list. for example, in c, perl or php one can
use opendir() and then repeatedly readdir() until getting to the end
of the file list. it seems this could be more efficient in some
applications.

is there a way to do this in python? i'm relatively new to the
language. i looked through the documentation and tried googling but
came up empty.

What kind of directories are those that just a list of files would
result in a "very large" object? I don't think I have ever seen
directories with more than a few thousand files...

Terry Reedy · Jun 15, 2009

tom said:
i can traverse a directory using os.listdir() or os.walk(). but if a
directory has a very large number of files, these methods produce very
large objects talking a lot of memory.

in other languages one can avoid generating such an object by walking
a directory as a liked list. for example, in c, perl or php one can
use opendir() and then repeatedly readdir() until getting to the end
of the file list. it seems this could be more efficient in some
applications.

is there a way to do this in python? i'm relatively new to the
language. i looked through the documentation and tried googling but
came up empty.

You did not specify version. In Python3, os.walk has become a generater
function. So, to answer your question, use 3.1.

tjr

MRAB · Jun 15, 2009

Christian said:
Some time ago we had a discussion about turning os.listdir() into a
generator. No conclusion was agreed on. We also thought about exposing
the functions opendir(), readdir(), closedir() and friends but as far as
I know and as far as I've checked the C code in Modules/posixmodule.c
none of the functions as been added.

Perhaps if there's a generator it should be called iterdir(). Or would
it be unPythonic to have listdir() and iterdir()? Probably.

Lawrence D'Oliveiro · Jun 15, 2009

I suppose it depends how well-liked it is. Nerdy lists may work better, but
they tend not to be liked.

What kind of directories are those that just a list of files would
result in a "very large" object? I don't think I have ever seen
directories with more than a few thousand files...

I worked on an application system which, at one point, routinely dealt with
directories containing hundreds of thousands of files. But even that kind of
directory contents only adds up to a few megabytes.

Tim Chase · Jun 15, 2009

i can traverse a directory using os.listdir() or os.walk(). but if a

You did not specify version. In Python3, os.walk has become a generater
function. So, to answer your question, use 3.1.

Since at least 2.4, os.walk has itself been a generator.
However, the contents of the directory (the 3rd element of the
yielded tuple) is a list produced by listdir() instead of a
generator. Unless listdir() has been changed to a generator
instead of a list (which other respondents seem to indicate has
not been implemented), this doesn't address the OP's issue of
"lots of files in a single directory".

-tkc

Steven D'Aprano · Jun 15, 2009

What kind of directories are those that just a list of files would
result in a "very large" object? I don't think I have ever seen
directories with more than a few thousand files...

You haven't looked very hard

$ pwd
/home/steve/.thumbnails/normal
$ ls | wc -l
33956

And I periodically delete thumbnails, to prevent the number of files
growing to hundreds of thousands.

Hrvoje Niksic · Jun 15, 2009

Terry Reedy said:
You did not specify version. In Python3, os.walk has become a
generater function. So, to answer your question, use 3.1.

os.walk has been a generator function all along, but that doesn't help
OP because it still uses os.listdir internally. This means that it
both creates huge lists for huge directories, and holds on to those
lists until the iteration over the directory (and all subdirectories)
is finished.

In fact, os.walk is not suited for this kind of memory optimization
because yielding a *list* of files (and a separate list of
subdirectories) is specified in its interface. This hasn't changed in
Python 3.1:

dirs, nondirs = [], []
for name in names:
if isdir(join(top, name)):
dirs.append(name)
else:
nondirs.append(name)

if topdown:
yield top, dirs, nondirs

Hrvoje Niksic · Jun 15, 2009

Nick Craig-Wood said:
Here is a ctypes generator listdir for unix-like OSes.

ctypes code scares me with its duplication of the contents of system
headers. I understand its use as a proof of concept, or for hacks one
needs right now, but can anyone seriously propose using this kind of
code in a Python program? For example, this seems much more
"Linux-only", or possibly even "32-bit-Linux-only", than "unix-like":

Diez B. Roggisch · Jun 15, 2009

tom said:
i can traverse a directory using os.listdir() or os.walk(). but if a
directory has a very large number of files, these methods produce very
large objects talking a lot of memory.

if we assume the number of files to be a million (which certainly qualifies
as one of the larger directory sizes one encounters...), and the average
filename length with 20, you'd end up with 20 megs of data.

Is that really a problem on nowadays several gigabyte machines? And we are
talking a rather freakish case here.

Diez

Terry Reedy · Jun 15, 2009

Christian said:
I'm sorry to inform you that Python 3.x still returns a list, not a
generator.

<class 'generator'>

However, it is a generator of directory tuples that include a filename
list produced by listdir, rather than a generator of filenames
themselves, as I was thinking. I wish listdir had been changed in 3.0
along with map, filter, and range, but I made no effort and hence cannot
complain.

tjr

Mike Kazantsev · Jun 16, 2009

<class 'generator'>

However, it is a generator of directory tuples that include a filename
list produced by listdir, rather than a generator of filenames
themselves, as I was thinking. I wish listdir had been changed in 3.0
along with map, filter, and range, but I made no effort and hence cannot
complain.

Why? We have itertools.imap, itertools.ifilter and xrange already.

--
Mike Kazantsev // fraggod.net

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAko22e4ACgkQASbOZpzyXnE5vgCfSvvCbBrI8815JQlH1hAS3QmL
IIwAoO+PgEIuZpHJ3BzW994BWW6PMd2o
=Mfnq
-----END PGP SIGNATURE-----

Hrvoje Niksic · Jun 16, 2009

Nick Craig-Wood said:
It can be done properly with gccxml though which converts structures
into ctypes definitions.

That sounds interesting.

That said the dirent struct is specified by POSIX so if you get the
correct types for all the individual members then it should be
correct everywhere. Maybe ;-)

AFAIK POSIX specifies the names and types of the members, but not
their order in the structure, nor alignment.

thebjorn · Jun 16, 2009

You haven't looked very hard

$ pwd
/home/steve/.thumbnails/normal
$ ls | wc -l
33956

And I periodically delete thumbnails, to prevent the number of files
growing to hundreds of thousands.

Steven

Not proud of this, but...:

[django] www4:~/datakortet/media$ ls bfpbilder|wc -l
174197

all .jpg files between 40 and 250KB with the path stored in a database
field... *sigh*

Oddly enough, I'm a relieved that others have had similar folder sizes
(I've been waiting for this burst to the top of my list for a while
now).

Bjorn

Lawrence D'Oliveiro · Jun 17, 2009

In message

thebjorn said:
Not proud of this, but...:

[django] www4:~/datakortet/media$ ls bfpbilder|wc -l
174197

all .jpg files between 40 and 250KB with the path stored in a database
field... *sigh*

Why not put the images themselves into database fields?

Oddly enough, I'm a relieved that others have had similar folder sizes ...

One of my past projects had 400000-odd files in a single folder. They were
movie frames, to allow assembly of movie sequences on demand.

Mike Kazantsev · Jun 17, 2009

In message

Not proud of this, but...:

[django] www4:~/datakortet/media$ ls bfpbilder|wc -l
174197

all .jpg files between 40 and 250KB with the path stored in a
database field... *sigh*

Click to expand...

Why not put the images themselves into database fields?

Oddly enough, I'm a relieved that others have had similar folder
sizes ...

Click to expand...

One of my past projects had 400000-odd files in a single folder. They
were movie frames, to allow assembly of movie sequences on demand.

For both scenarios:
Why not use hex representation of md5/sha1-hashed id as a path,
arranging them like /path/f/9/e/95ea4926a4 ?

That way, you won't have to deal with many-files-in-path problem, and,
since there's thousands of them anyway, name readability shouldn't
matter.

In fact, on modern filesystems it doesn't matter whether you accessing
/path/f9e95ea4926a4 with million files in /path or /path/f/9/e/95ea
with only hundred of them in each path. Former case (all-in-one-path)
would even outperform the latter with ext3 or reiserfs by a small
margin.
Sadly, that's not the case with filesystems like FreeBSD ufs2 (at least
in sixth branch), so it's better to play safe and create subdirs if the
app might be run on different machines than keeping everything in one
path.

--
Mike Kazantsev // fraggod.net

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAko4YKYACgkQASbOZpzyXnGrzgCgqFcDRGNRsojqx8O6v9eq+oq6
N1UAnjUHdvQK6uQyo5Fs2fx39As9H+Ys
=UVXk
-----END PGP SIGNATURE-----

Lie Ryan · Jun 17, 2009

Mike said:
In message

Not proud of this, but...:

[django] www4:~/datakortet/media$ ls bfpbilder|wc -l
174197

all .jpg files between 40 and 250KB with the path stored in a
database field... *sigh*

Click to expand...

Why not put the images themselves into database fields?

Oddly enough, I'm a relieved that others have had similar folder
sizes ...

Click to expand...

One of my past projects had 400000-odd files in a single folder. They
were movie frames, to allow assembly of movie sequences on demand.

Click to expand...

For both scenarios:
Why not use hex representation of md5/sha1-hashed id as a path,
arranging them like /path/f/9/e/95ea4926a4 ?

That way, you won't have to deal with many-files-in-path problem, and,
since there's thousands of them anyway, name readability shouldn't
matter.

In fact, on modern filesystems it doesn't matter whether you accessing
/path/f9e95ea4926a4 with million files in /path or /path/f/9/e/95ea
with only hundred of them in each path. Former case (all-in-one-path)
would even outperform the latter with ext3 or reiserfs by a small
margin.
Sadly, that's not the case with filesystems like FreeBSD ufs2 (at least
in sixth branch), so it's better to play safe and create subdirs if the
app might be run on different machines than keeping everything in one
path.

It might not matter for the filesystem, but the file explorer (and ls)
would still suffer. Subfolder structure would be much better, and much
easier to navigate manually when you need to.

Help to script a very easy program to manipulate timecodes (srt files)	0	Aug 13, 2022
Iterating over files of a huge directory	12	Dec 17, 2012
How do I get number of files in a particular directory.	10	Aug 13, 2010
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Include Files Directory Structure	7	Jun 10, 2012
walk directory & ignore all files/directories begin with '.'	3	May 13, 2010
very large graph	4	Jun 24, 2008
synching with os.walk()	8	Nov 24, 2006

waling a directory with very many files

tom

Tim Golden

tom

Tim Golden

Andre Engels

Terry Reedy

MRAB

Lawrence D'Oliveiro

Tim Chase

Steven D'Aprano

Hrvoje Niksic

Hrvoje Niksic

Diez B. Roggisch

Terry Reedy

Mike Kazantsev

Hrvoje Niksic

thebjorn

Lawrence D'Oliveiro

Mike Kazantsev

Lie Ryan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads