os.path.walk not pruning descent tree (and I'm not happy with that behavior?)


J

Joe Ardent

Good day, everybody! From what I can tell from the archives, this is
everyone's favorite method from the standard lib, and everyone loves
answering questions about it. Right? :)

Anyway, my question regards the way that the visit callback modifies
the names list. Basically, my simple example is:

##############################
def listUndottedDirs( d ):
dots = re.compile( '\.' )

def visit( arg, dirname, names ):
for f in names:
if dots.match( f ):
i = names.index( f )
del names
else:
print "%s: %s" % ( dirname, f )

os.path.walk( d, visit, None )
###############################

Basically, I don't want to visit any hidden subdirs (this is a unix
system), nor am I interested in dot-files. If I call the function
like, "listUndottedDirs( '/usr/home/ardent' )", however, EVEN THOUGH
IT IS REMOVING DOTTED DIRS AND FILES FROM names, it will recurse into
the dotted directories; eg, if I have ".kde3/" in that directory, it
will begin listing the contents of /usr/home/ardent/.kde3/ . Here's
what the documentation says about this method:

"The visit function may modify names to influence the set of
directories visited below dirname, e.g. to avoid visiting certain
parts of the tree. (The object referred to by names must be modified
in place, using del or slice assignment.)"

So... What am I missing? Any help would be greatly appreciated.
 
Ad

Advertisements

P

Peter Otten

Joe said:
Good day, everybody! From what I can tell from the archives, this is
everyone's favorite method from the standard lib, and everyone loves
answering questions about it. Right? :)

I don't know what to make of the smiley, so I'll be explicit: use os.walk()
instead of os.path.walk().
Anyway, my question regards the way that the visit callback modifies
the names list. Basically, my simple example is:

##############################
def listUndottedDirs( d ):
dots = re.compile( '\.' )

def visit( arg, dirname, names ):
for f in names:
if dots.match( f ):
i = names.index( f )
del names
else:
print "%s: %s" % ( dirname, f )

os.path.walk( d, visit, None )
###############################

Basically, I don't want to visit any hidden subdirs (this is a unix
system), nor am I interested in dot-files. If I call the function
like, "listUndottedDirs( '/usr/home/ardent' )", however, EVEN THOUGH
IT IS REMOVING DOTTED DIRS AND FILES FROM names, it will recurse into
the dotted directories; eg, if I have ".kde3/" in that directory, it
will begin listing the contents of /usr/home/ardent/.kde3/ . Here's
what the documentation says about this method:

"The visit function may modify names to influence the set of
directories visited below dirname, e.g. to avoid visiting certain
parts of the tree. (The object referred to by names must be modified
in place, using del or slice assignment.)"

So... What am I missing? Any help would be greatly appreciated.


Your problem is that you are deleting items from a list while iterating over
it:

# WRONG
names = [".alpha", ".beta", "gamma"]
for name in names:
.... if name.startswith("."):
.... del names[names.index(name)]
....['.beta', 'gamma']

Here's one way to avoid that mess:
names = [".alpha", ".beta", "gamma"]
names[:] = [name for name in names if not name.startswith(".")]
names
['gamma']

The slice [:] on the left side is necessary to change the list in-place.

Peter
 
G

Gabriel Genellina

Good day, everybody! From what I can tell from the archives, this is
everyone's favorite method from the standard lib, and everyone loves
answering questions about it. Right? :)

Well, in fact, the preferred (and easier) way is to use os.walk - but
os.path.walk is fine too.
Anyway, my question regards the way that the visit callback modifies
the names list. Basically, my simple example is:

##############################
def listUndottedDirs( d ):
dots = re.compile( '\.' )

def visit( arg, dirname, names ):
for f in names:
if dots.match( f ):
i = names.index( f )
del names
else:
print "%s: %s" % ( dirname, f )

os.path.walk( d, visit, None )
###############################


There is nothing wrong with os.walk - you are iterating over the names
list *and* removing elements from it at the same time, and that's not
good... Some ways to avoid it:

- iterate over a copy (the [:] is important):

for fname in names[:]:
if fname[:1]=='.':
names.remove(fname)

- iterate backwards:

for i in range(len(names)-1, -1, -1):
fname = names
if fname[:1]=='.':
names.remove(fname)

- collect first and remove later:

to_be_deleted = [fname for fname in names if fname[:1]=='.']
for fname in to_be_deleted:
names.remove[fname]

- filter and reassign in place (the [:] is important):

names[:] = [fname for fname in names if fname[:1]!='.']

(Notice that I haven't used a regular expression, and the remove method)
 
M

Maric Michaud

I'm really sorry, for all that private mails, thunderbird is awfully
stupid dealing with mailing lists folder.


Gabriel Genellina a écrit :
En Sun, 27 May 2007 22:39:32 -0300, Joe Ardent <[email protected]> escribió:


- iterate backwards:

for i in range(len(names)-1, -1, -1):
fname = names
if fname[:1]=='.':
names.remove(fname)


This is not about iterating backward, this is about iterating over the
index of each element instead of iterating over the element (which must
be done begining by the end). In fact this code is both inefficient and
contains a subtle bug. If two objects compare equals in the list, you
will remove the wrong one.

It should be :

for i in range(len(names)-1, -1, -1):
if names[:1]=='.':
del names

- filter and reassign in place

Seems the best here.
(the [:] is important):

Not so. Unless "names" is referenced in another namespace, simple
assignment is enough.
names[:] = [fname for fname in names if fname[:1]!='.']

(Notice that I haven't used a regular expression, and the remove method)
 
Ad

Advertisements

G

Gabriel Genellina

Gabriel Genellina a écrit :
- iterate backwards:

for i in range(len(names)-1, -1, -1):
fname = names
if fname[:1]=='.':
names.remove(fname)


This is not about iterating backward, this is about iterating over the
index of each element instead of iterating over the element (which must
be done begining by the end). In fact this code is both inefficient and
contains a subtle bug. If two objects compare equals in the list, you
will remove the wrong one.

It should be :

for i in range(len(names)-1, -1, -1):
if names[:1]=='.':
del names


Yes, sure, this is what I should have written. Thanks for the correction!
- filter and reassign in place

Seems the best here.
(the [:] is important):

Not so. Unless "names" is referenced in another namespace, simple
assignment is enough.

But this is exactly the case; the visit function is called from inside the
os.path.walk code, and you have to modify the names parameter in-place for
the caller to notice it (and skip the undesided files and folders).
 

Top