waling a directory with very many files

M

Mike Kazantsev

It might not matter for the filesystem, but the file explorer (and ls)
would still suffer. Subfolder structure would be much better, and much
easier to navigate manually when you need to.

It's an insane idea to navigate any structure with hash-based names
and hundreds of thousands files *manually*: "What do we have here?
Hashies?" ;)

--
Mike Kazantsev // fraggod.net

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAko4ef0ACgkQASbOZpzyXnF9CgCgsJEC16AJ+qIpYXDRd9iFf2aX
yugAn3SA3Z3Hd/9CzgE5HnQW2v1wHd/5
=FMzl
-----END PGP SIGNATURE-----
 
L

Lawrence D'Oliveiro

In message
Not proud of this, but...:

[django] www4:~/datakortet/media$ ls bfpbilder|wc -l
174197

all .jpg files between 40 and 250KB with the path stored in a
database field... *sigh*

Why not put the images themselves into database fields?
Oddly enough, I'm a relieved that others have had similar folder
sizes ...

One of my past projects had 400000-odd files in a single folder. They
were movie frames, to allow assembly of movie sequences on demand.

For both scenarios:
Why not use hex representation of md5/sha1-hashed id as a path,
arranging them like /path/f/9/e/95ea4926a4 ?

That way, you won't have to deal with many-files-in-path problem ...

Why is that a problem?
 
M

Mike Kazantsev

Why is that a problem?

So you can os.listdir them?
Don't ask me what for, however, since that's the original question.
Also not every fs still in use handles this situation effectively, see
my original post.

--
Mike Kazantsev // fraggod.net

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAko4qEcACgkQASbOZpzyXnHeUwCfd3WPMKBXzBb0b4ucKvEZC5MI
e7IAnRvWXR1zbGkmqr2RqCc/3TOsz2bA
=4TVr
-----END PGP SIGNATURE-----
 
M

Mike Kazantsev

Why should you have a problem os.listdir'ing lots of files?

I shouldn't, and I don't ;)

--
Mike Kazantsev // fraggod.net

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAko5D6MACgkQASbOZpzyXnGyYQCglSN4NQhuI3yTnW9bfndExrqf
ZasAnjjbG6GLtMI3+FuuK8HpS7zokfk+
=imDt
-----END PGP SIGNATURE-----
 
M

Mike Kazantsev

Then why did you suggest that there was a problem being able to os.listdir
them?

I didn't, OP did, and that's what the topic "walking directory with
many files" is about.
I wonder whether you're unable to read past the first line, trying to
make some point or just some kind of alternatively-gifted (i.e.
brain-handicapped) person to keep interpreting posts w/o context like
that.

--
Mike Kazantsev // fraggod.net

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAko5owgACgkQASbOZpzyXnF5awCgoP8sIZkWXSNYeNEDDBf9L/j4
5ysAn2h37A+i6NMFDZZF4aDK/2ZAcdDW
=ac+u
-----END PGP SIGNATURE-----
 
A

Asun Friere

What kind of directories are those that just a list of files would
result in a "very large" object? I don't think I have ever seen
directories with more than a few thousand files...


(asun@lucrezia:~/pit/lsa/act:5)$ ls -1 | wc -l
142607

There, you've seen one with 142 thousand now! :p
 
L

Lie Ryan

Mike said:
It's an insane idea to navigate any structure with hash-based names
and hundreds of thousands files *manually*: "What do we have here?
Hashies?" ;)

Like... when you're trying to debug a code that generates an error with
a specific file...

Yeah, it might be possible to just mv the file from outside, but not
being able to enter a directory just because you've got too many files
in it is kind of silly.
 
L

Lawrence D'Oliveiro

I didn't, OP did ...

Then why did you reply to my question "Why is that a problem?" with "So that
you can os.listdir them?", if you didn't think there was a problem (see
above)?
 
L

Lawrence D'Oliveiro

Ethan said:
He didn't ...

He replied to my question "Why is that a problem?" with "So you can
os.listdir them?". Why reply with an explanation of why it's a problem if
you don't think it's a problem?
 
L

Lawrence D'Oliveiro

Lie Ryan said:
Yeah, it might be possible to just mv the file from outside, but not
being able to enter a directory just because you've got too many files
in it is kind of silly.

Sounds like a problem with your file/directory-manipulation tools.
 
M

Mike Kazantsev

Then why did you reply to my question "Why is that a problem?" with
"So that you can os.listdir them?", if you didn't think there was a
problem (see above)?

Why do you think that if I didn't suggest there is a problem, I think
there is no problem?

I do think there might be such a problem and even I may have to face it
someday. So, out of sheer curiosity how more rediculous this topic can
be I'll try to rephrase and extend what I wrote in the first place:


Why would you want to listdir them?
I can imagine at least one simple scenario: you had some nasty crash
and you want to check that every file has corresponding, valid db
record.

What's the problem with listdir if there's 10^x of them?
Well, imagine that db record also holds file modification time (say,
the files are some kind of cache), so not only you need to compare
listdir results with db, but also do os.stat on every file and some
filesystems will do it very slowly with so many of them in one place.


Now, I think I made this point in the first answer, no?

Of course you can make it more rediculous by your
I-can-talk-away-any-problem-I-can't-see-or-solve approach by asking "why
would you want to use such filesystems?", "why do you have to use
FreeBSD?", "why do you have to work for such employer?", "why do you
have to eat?" etc, but you know, sometimes it's easier and better for
the project/work just to solve it, than talk everyone else away from it
just because you don't like otherwise acceptable solution.

--
Mike Kazantsev // fraggod.net

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)

iEYEARECAAYFAko7QOQACgkQASbOZpzyXnEM5QCgqbrvQyltwvwzKNHFmIxkIKlv
6IMAn3D6XimiRiAdBbJOG5pz56r9s2oV
=X8V9
-----END PGP SIGNATURE-----
 
L

Lie Ryan

Lawrence said:
Sounds like a problem with your file/directory-manipulation tools.

try an `ls` on a folder with 10000+ files.

See how long is needed to print all the files.

Ok, now pipe ls to less, take three days to browse through all the
filenames to locate the file you want to see.

The file manipulation tool may not have problems with it; it's the user
that would have a hard time sorting through the huge amount of files.

Even with glob and grep, some types of queries are just too difficult or
is plain silly to write a full fledged one-time use program just to
locate a few files.
 
L

Lawrence D'Oliveiro

Why do you think that if I didn't suggest there is a problem, I think
there is no problem?

It wasn't that you didn't suggest there was a problem, but that you
suggested a "solution" as though there was a problem.
Why would you want to listdir them?

It's a common need, to find out what's in a directory.
I can imagine at least one simple scenario: you had some nasty crash
and you want to check that every file has corresponding, valid db
record.

But why would that be relevant to this case?
 
L

Lawrence D'Oliveiro

Lie Ryan said:
try an `ls` on a folder with 10000+ files.

See how long is needed to print all the files.

As I've mentioned elsewhere, I had scripts routinely dealing with
directories containing around 400,000 files.
Ok, now pipe ls to less, take three days to browse through all the
filenames to locate the file you want to see.

Sounds like you're approaching the issue with a GUI-centric mentality, which
is completely hopeless at dealing with this sort of situation.
 
S

Steven D'Aprano

Lawrence said:
Sounds like you're approaching the issue with a GUI-centric mentality,
which is completely hopeless at dealing with this sort of situation.

Piping the output of ls to less is a GUI-centric mentality?
 
R

rkl

i can traverse adirectoryusing os.listdir() or os.walk(). but if adirectoryhas a very large number of files, these methods produce very
large objects talking a lot of memory.

in other languages one can avoid generating such an object by walking
adirectoryas a liked list. for example, in c, perl or php one can
use opendir() and then repeatedly readdir() until getting to the end
of the file list. it seems this could be more efficient in some
applications.

is there a way to do this in python? i'm relatively new to the
language. i looked through the documentation and tried googling but
came up empty.

I might be a little late with my comment here.

David Beazley in his PyCon'2008 presentation "Generator Tricks
For Systems Programmers" had this very elegant example of handling an
unlimited numbers of files:


import os, fnmatch

def gen_find(filepat,top):
"""gen_find(filepat,top) - find matching files in directory tree,
start searching from top

expects: a file pattern as string, and a directory path as string
yields: a sequence of filenames (including paths)
"""
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)


for file in gen_find('*.py', '/'):
print file
 
T

Tim Golden

rkl said:
I might be a little late with my comment here.

David Beazley in his PyCon'2008 presentation "Generator Tricks
For Systems Programmers" had this very elegant example of handling an
unlimited numbers of files:


David Beazley's generator stuff is definitely worth recommending
on. I think the issue here is that: anything which ultimately uses
os.listdir (and os.walk does) is bound by the fact that it will
create a long list of every file before handing it back. Certainly
there are techniques (someone posted a ctypes wrapper for opendir;
I recommended FindFirst/NextFile on Windows) which could be applied,
but those are all outside the stdlib.

TJG
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top