Limitation of os.walk

K

kj

I want implement a function that walks through a directory tree
and performs an analsysis of all the subdirectories found. The
task has two essential requirements that, AFAICT, make it impossible
to use os.walk for this:

1. I need to be able to prune certain directories from being visited.

2. The analysis on each directory can be performed only after it
has been performed on all its subdirectories.

Unless I'm missing something, to do (1), os.walk must be run with
topdown=True, whereas to do (2) it must be run with topdown=False.

Is there a work around that I'm missing?

TIA!

~K

PS: I never understood why os.walk does not support hooks for key
events during such a tree traversal.
 
T

Tim Chase

I want implement a function that walks through a directory tree
and performs an analsysis of all the subdirectories found. The
task has two essential requirements that, AFAICT, make it impossible
to use os.walk for this:

1. I need to be able to prune certain directories from being visited.

2. The analysis on each directory can be performed only after it
has been performed on all its subdirectories.

Unless I'm missing something, to do (1), os.walk must be run with
topdown=True, whereas to do (2) it must be run with topdown=False.

I don't think there's a way to coerce os.walk into doing what you
want. That said, the core source for os.walk() is a whole 23
lines of code, it's easy enough to just clone it and add what you
need...
PS: I never understood why os.walk does not support hooks for key
events during such a tree traversal.

including hooks for calling pre/post hooks.

-tkc
 
K

kj

In said:
That said, the core source for os.walk() is a whole 23
lines of code, it's easy enough to just clone it and add what you
need...

Thanks, that was a good idea.

~K
 
T

Terry Reedy

I want implement a function that walks through a directory tree
and performs an analsysis of all the subdirectories found. The
task has two essential requirements that, AFAICT, make it impossible
to use os.walk for this:

1. I need to be able to prune certain directories from being visited.

2. The analysis on each directory can be performed only after it
has been performed on all its subdirectories.

Unless I'm missing something, to do (1), os.walk must be run with
topdown=True, whereas to do (2) it must be run with topdown=False.

Is there a work around that I'm missing?

(I was going to say, 'Copy the code from os.py and revise to suit' --
but I see this has been posted since I wrote it ;-)
PS: I never understood why os.walk does not support hooks for key
events during such a tree traversal.

Either 1) it is intentionally simple, with the expectation that people
would write there own code for more complicated uses or 2) no one has
submitted a 'full-featured' version or 3) both.

If os.walk were rewritten, it should be as an iterator (generator).
Directory entry and exit functions could still be added as params.

Terry Jan Reedy
 
T

Tim Chase

Either 1) it is intentionally simple, with the expectation that people
would write there own code for more complicated uses or 2) no one has
submitted a 'full-featured' version or 3) both.

If os.walk were rewritten, it should be as an iterator (generator).
Directory entry and exit functions could still be added as params.

It *is* an iterator/generator. However, I suspect you mean that
it should slurp the dirs/files iteratively instead of using
listdir() as was discussed on c.l.p a few months back.

The patch to os.py would be fairly simple, something like
--------------------------------------------
--- /usr/lib/python2.5/os.py
+++ ~/tmp/os.py
@@ -220,7 +220,7 @@

__all__.extend(["makedirs", "removedirs", "renames"])

-def walk(top, topdown=True, onerror=None):
+def walk(top, topdown=True, onerror=None, pre=None, post=None):
"""Directory tree generator.

For each directory in the directory tree rooted at top
(including top
@@ -296,15 +296,19 @@
else:
nondirs.append(name)

+ if pre is not None:
+ top, dirs, nondirs = pre(top, dirs, nondirs)
if topdown:
yield top, dirs, nondirs
for name in dirs:
path = join(top, name)
if not islink(path):
- for x in walk(path, topdown, onerror):
+ for x in walk(path, topdown, onerror, pre=pre,
post=post):
yield x
if not topdown:
yield top, dirs, nondirs
+ if post is not None:
+ post(top, dirs, nondirs)

--------------------------------------------

which would allow you to do things like

def pre(top, dirs, nondirs):
dirs = [d for d in dirs
if d not in ('RCS', '.svn', '.hg', '.git')]
return top, dirs, nondirs
def post(top, dirs, nondirs):
complex_process(top)
for top, dirs, nondirs in my.walk(PATH, pre=pre, post=post):
do_stuff(...)

I suspect if I thought about it much longer, only one would
really be needed, the other accommodated by the "topdown" parameter.

-tkc
 
K

kj

It *is* an iterator/generator. However, I suspect you mean that
it should slurp the dirs/files iteratively instead of using
listdir() as was discussed on c.l.p a few months back.

Thanks for mentioning this thread. Very interesting stuff. Apropos
the implementability of an iterative listdir, I wonder if some
variation of glob.iglob() would fit the bill. (Maybe it's too
slow, though.)
I suspect if I thought about it much longer, only one would
really be needed, the other accommodated by the "topdown" parameter.

Yeah, I think one only needs a post hook. The fact that it's a
generator obviates need for a pre hook, since the yield returns
control to the calling function right around where the pre-hook
would run anyway. For the same reason, the post hook is needed
only for the case topdown=True.

Another useful callback would be one that replaced listdir in the
generation of the current directory's contents.

~K
 
K

kj

Either 1) it is intentionally simple, with the expectation that people
would write there own code for more complicated uses or 2) no one has
submitted a 'full-featured' version or 3) both.

I hope it's not (1): I want the language I use to include more
"batteries" not fewer.

<petpeeve>It seems that a similar "simplicity argument" was invoked
to strip the cmp option from sort in Python 3. Grrrr. Simplicity
is great, but when the drive for it starts causing useful functionality
to be thrown out, then it is going too far. Yes, I know that it
is possible to do everything that sort's cmp option does through
clever tricks with key, but grokking and coding this maneuver
requires a *lot* more Python-fu than cmp did, which makes this
functionality a lot less accessible to beginners that the intrinsic
complexity of the problem warrants. And for what? To get rid of
an *option* that could be easily disregarded by anyone who found
it too "complex"? It makes no sense to me.</petpeeve>
 
P

Patrick Maupin

<petpeeve>It seems that a similar "simplicity argument" was invoked
to strip the cmp option from sort in Python 3.  Grrrr.  Simplicity
is great, but when the drive for it starts causing useful functionality
to be thrown out, then it is going too far.  Yes, I know that it
is possible to do everything that sort's cmp option does through
clever tricks with key, but grokking and coding this maneuver
requires a *lot* more Python-fu than cmp did, which makes this
functionality a lot less accessible to beginners that the intrinsic
complexity of the problem warrants.  And for what?  To get rid of
an *option* that could be easily disregarded by anyone who found
it too "complex"? It makes no sense to me.</petpeeve>

I didn't follow the discussion on cmp and Python 3, but I would assume
one reason it was removed was for performance. cmp is very slow
compared to key, and if there should be one obvious way to do it, it
should probably be the way that runs faster. Also, while in many
cases cmp can be easier to use than key, it would be possible for a
naive person with a complicated data structure to come up with a cmp
function that, e.g. sorted A > B, B > C, and C > A. It's impossible
to come up with a key function that will break sort in that fashion.
So, while I understand the frustration of having a favorite tool
removed, I have no trouble believing that there were good reasons for
the removal.

Regards,
Pat
 
S

Steven D'Aprano

In <[email protected]> Terry Reedy



I hope it's not (1): I want the language I use to include more
"batteries" not fewer.

How many more? Ten? Twenty? A hundred? Ten thousand? Ten million? Where
do you draw the line?

Every extra battery is (say) a hundred extra lines of code, which means
fifty thousand more potential bugs. It means more burden on the
maintainers, and more burden on people learning the language. More
batteries can make it more, not less, difficult to code:

# Python 1.1
"I need a collection of values, should I use a list, a tuple or a dict?"

vs

# Python 5000:
"I need a collection of values, should I use a list, a tuple, a dict, an
array, a bag, a SortedBag, a table, a SortedTable, a NamedTuple, a
Record, a stack, a queue, a deque, a linked list, a HashTable, a doubly-
linked list, a set, a frozenset, a skip list, a jump list, a binary tree,
a B tree, a B+ tree, a SortedList, a red-black tree, an A* tree, a B*
tree, a SortedList, a StringList, a CaseInsensitiveSortedStringList, a
CaseInsensitiveRedBlackTree, a HashSet, a CaseInsensitiveHashSet, an
ArrayList, a ConcurrentQueue, a ConcurrentStack, a KeyedCollection, an
EnumerableBag, a MultiSet, a StringMultiSet, a SortedSet, a
DoubleEndedQueue, a Buffer, a CircularQueue, a heap, a binary search
tree, an AssociatedArray, a Vector, a SparseArray, an XOR-linked list, a
weight-balanced tree, a RandomizedBinarySearchTree, a ThreadedBinaryTree,
an AVL tree, a splay tree, a rope, a binomial heap, a fibonacci heap, a
trie, a B-trie, a judy array, an and/or tree, an enfilade, a treap, a
dancing tree, a queap, or a fusion tree?" (Whew!)


There's nothing wrong with the existence of all these data structures. If
you need one, you can use it. But they don't all need to be included in
the standard library. That just increases the cognitive load on
programmers without really adding that much benefit. I've seen people
stress over the choice of a tuple or a list.


<petpeeve>It seems that a similar "simplicity argument" was invoked to
strip the cmp option from sort in Python 3. Grrrr. Simplicity is
great, but when the drive for it starts causing useful functionality to
be thrown out, then it is going too far. Yes, I know that it is
possible to do everything that sort's cmp option does through clever
tricks with key,

Not that clever. In general, key-based sorting is simpler than cmp-based
sorting. In addition, the use of a key function is a basic technique
which every coder should have in their mental toolbox, and it is
applicable to more than just sorting. E.g. max() and min() also support
key functions in Python 2.6 and better. On the other hand a cmp function
is specific to sorting, and nothing but sorting.


but grokking and coding this maneuver requires a *lot*
more Python-fu than cmp did, which makes this functionality a lot less
accessible to beginners that the intrinsic complexity of the problem
warrants.

I disagree that key-based sorting requires any more Python-fu than cmp
sorting. I believe they're equivalent. There are cmp functions you can
write which aren't (easily?) convertible to key, but most of them (all?)
aren't useful or sensible. E.g. they rely on side-effects, don't define a
sensible sort order, or break stability of sorting:
text = "here is a number of words in some order".split()

sorted(text, cmp=lambda a,b: cmp(a.upper(), b.lower())) ['order', 'some', 'in', 'words', 'of', 'number', 'a', 'is', 'here']

sorted(text[2:] + text[:2], cmp=lambda a,b: cmp(a.upper(), b.lower()))
['is', 'here', 'order', 'some', 'in', 'words', 'of', 'number', 'a']


Key-based sorting doesn't let you do this, but since I can't think of any
application where you want the sort order to be dependent on the initial
order, I believe this is a good thing. It is much harder to write buggy
sorts using key than with cmp.


And for what? To get rid of an *option* that could be easily
disregarded by anyone who found it too "complex"? It makes no sense to
me.</petpeeve>


It's not that cmp is too complex, but that it's mere existence adds
complexity to the list data type, it is usually slow and inefficient, and
it can introduce subtle sorting bugs.


In the past, I've suggested that if people really believe that there is a
use-case for sorting with a comparison function, they should fork it from
the built-in list object and release it on PyPI as a third-party package.
I still believe this.
 
T

Terry Reedy

Yes, I was thinking of that thread.
Thanks for mentioning this thread. Very interesting stuff. Apropos
the implementability of an iterative listdir, I wonder if some
variation of glob.iglob() would fit the bill. (Maybe it's too
slow, though.)


Yeah, I think one only needs a post hook. The fact that it's a
generator obviates need for a pre hook, since the yield returns
control to the calling function right around where the pre-hook
would run anyway. For the same reason, the post hook is needed
only for the case topdown=True.

Once you have determined and tested the minimal needed change for
greater functionality, you could either
a) post a suggestion and the revised os.walk to python-ideas
b) submit a feature request to the tracker and attach the revised
function and, if possible, a diff patch
c) both.

I have no idea of the response.

Terry Jan Reedy
 
A

Aahz

[nitpicking one specific point]

On the other hand a cmp function is specific to sorting, and nothing
but sorting.

Not quite. cmp() is useful any time you have an expensive comparison
operation and you need to take three different codepaths depending on
the comparison result. I don't know what the current code does, but I
used it in the first cut Decimal() class.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,060
Latest member
BuyKetozenseACV

Latest Threads

Top