os.walk walks too much

M

Marcello Pietrobon

Hello,
I am using Pyton 2.3
I desire to walk a directory without recursion

this only partly works:
def walk_files() :
for root, dirs, files in os.walk(top, topdown=True):
for filename in files:
print( "file:" + os.path.join(root, filename) )
for dirname in dirs:
dirs.remove( dirname )
because it skips all the subdirectories but one.

this *does not* work at all
def walk_files() :
for root, dirs, files in os.walk(top, topdown=True):
for filename in files:
print( "file:" + os.path.join(root, filename) )
dirs = []

This is surprizing to me.
Is this a glitch ?

How should I implement this ?
Maybe it would be good to put it in the os. walk documentation ?

Cheers,
Marcello
 
J

Josef Meile

Hi,
Hello,
I am using Pyton 2.3
I desire to walk a directory without recursion

this only partly works:
def walk_files() :
for root, dirs, files in os.walk(top, topdown=True):
for filename in files:
print( "file:" + os.path.join(root, filename) )
for dirname in dirs:
dirs.remove( dirname )
I don't know what this walk function does, but anyway, I
think one problem here is that you are iterating over a
variable that you are changing later. There was a similar
message weeks ago and the solution was to copy the list
and remove the elements of this duplicate. I don't remember
which was the function used to copy the list, but I'm sure
that you can't use:
dirs2=dirs
because they reffer to the same memory address.

Regards,
Josef
 
P

Peter Otten

Marcello said:
I am using Pyton 2.3
I desire to walk a directory without recursion

this only partly works:
def walk_files() :
for root, dirs, files in os.walk(top, topdown=True):
for filename in files:
print( "file:" + os.path.join(root, filename) )

This is *bad*. If you want to change a list while you iterate over it, use a
copy (there may be worse side effects than you have seen):
for dirname in dirs[:]:
for dirname in dirs:
dirs.remove( dirname )
because it skips all the subdirectories but one.

this *does not* work at all
def walk_files() :
for root, dirs, files in os.walk(top, topdown=True):
for filename in files:
print( "file:" + os.path.join(root, filename) )

You are rebinding dirs to a newly created list, leaving the old one (to
which os.walk() still holds a reference) unaltered. Using

dirs[:] = []

instead should work as desired.
dirs = []

Here's what I do:

def walk_files(root, recursive=False):
for path, dirs, files in os.walk(root):
for fn in files:
yield os.path.join(path, fn)
if not recursive:
break

Peter
 
E

Edward C. Jones

Marcello said:
Hello,
I am using Pyton 2.3
I desire to walk a directory without recursion

I am not sure what this means. Do you want to iterate over the
non-directory files in directory top? For this job I would use:

def walk_files(top):
names = os.listdir(top)
for name in names:
if os.path.isfile(name):
yield name
this only partly works:
def walk_files() :
for root, dirs, files in os.walk(top, topdown=True):
for filename in files:
print( "file:" + os.path.join(root, filename) )
for dirname in dirs:
dirs.remove( dirname )
because it skips all the subdirectories but one.

Replace
for dirname in dirs:
dirs.remove( dirname )
with
for i in range(len(dirs)-1, -1, -1):
del dirs
to make it work. Run

seq = [0,1,2,3,4,5]
for x in seq:
seq.remove(x)
print seq

to see the problem. If you are iterating through a list selectively
removing members, you should iterate in reverse. Never change the
positions in the list of elements that have not yet been reached by the
iterator.
this *does not* work at all
def walk_files() :
for root, dirs, files in os.walk(top, topdown=True):
for filename in files:
print( "file:" + os.path.join(root, filename) )
dirs = []

There is a subtle point in the documentation.

"When topdown is true, the caller can modify the dirnames list in-place
(perhaps using del or slice assignment), and walk() will only recurse
into the subdirectories whose names remain in dirnames; ..."

The key word is "in-place". "dirs = []" does not change "dirs" in-place.
It replaces "dirs" with a different list. Either use "del"
for i in range(len(dirs)-1, -1, -1):
del dirs
as I did above or use "slice assignment"
dirs[:] = []
 
M

Marcello Pietrobon

Thank you everybody for all the answers.
They all have been useful :)

I have only two question reguarding Peter Otten's answer

1)
What is the difference between

for dirname in dirs:
dirs.remove( dirname )

and

for dirname in dirs[:]:
dirs.remove( dirname )

( I understand and agree that there are better ways, and at list a reverse iterator should be used )



2)

def walk_files(root, recursive=False):
for path, dirs, files in os.walk(root):
for fn in files:
yield os.path.join(path, fn)
if not recursive:
break

seems not correct to me:

because I tend to assimilate yield to a very special return statement
so I think the following is correct

def walk_files(root, recursive=False):
for path, dirs, files in os.walk(root):
for fn in files:
yield os.path.join(path, fn)
if not recursive:
break


is that right ?

Thank you very much,
Marcello



Peter said:
Marcello Pietrobon wrote:


I am using Pyton 2.3
I desire to walk a directory without recursion

this only partly works:
def walk_files() :
for root, dirs, files in os.walk(top, topdown=True):
for filename in files:
print( "file:" + os.path.join(root, filename) )

This is *bad*. If you want to change a list while you iterate over it, use a
copy (there may be worse side effects than you have seen):
for dirname in dirs[:]:

for dirname in dirs:
dirs.remove( dirname )
because it skips all the subdirectories but one.

this *does not* work at all
def walk_files() :
for root, dirs, files in os.walk(top, topdown=True):
for filename in files:
print( "file:" + os.path.join(root, filename) )

You are rebinding dirs to a newly created list, leaving the old one (to
which os.walk() still holds a reference) unaltered. Using

dirs[:] = []

instead should work as desired.


dirs = []

Here's what I do:

def walk_files(root, recursive=False):
for path, dirs, files in os.walk(root):
for fn in files:
yield os.path.join(path, fn)
if not recursive:
break

Peter
 
P

Peter Otten

Marcello said:
What is the difference between

for dirname in dirs:
dirs.remove( dirname )

and

for dirname in dirs[:]:
dirs.remove( dirname )

dirs[:] makes a slice containing all elements, i. e. a shallow copy of the
complete list, so the loop is not affected by changes to the original:
dirs = ["alpha", "beta", "gamma"]
dirs == dirs[:] # equal True
dirs is dirs[:] # but not the same list
False

def walk_files(root, recursive=False):
for path, dirs, files in os.walk(root):
for fn in files:
yield os.path.join(path, fn)
if not recursive:
break

seems not correct to me:

because I tend to assimilate yield to a very special return statement
so I think the following is correct

def walk_files(root, recursive=False):
for path, dirs, files in os.walk(root):
for fn in files:
yield os.path.join(path, fn)
if not recursive:
break


is that right ?

Oops, of course you're right.

Peter
 
S

Steve Lamb

dirs[:] makes a slice containing all elements, i. e. a shallow copy of the
complete list, so the loop is not affected by changes to the original:
dirs = ["alpha", "beta", "gamma"]
dirs == dirs[:] # equal True
dirs is dirs[:] # but not the same list
False

Better way to make it crystal clear.

{grey@teleute:~} python
Python 2.3.3 (#2, Jan 13 2004, 00:47:05)
[GCC 3.3.3 20040110 (prerelease) (Debian)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
real = [1, 2, 3]
copy = real[:]
real [1, 2, 3]
copy [1, 2, 3]
for x in real:
.... print(x)
.... real.remove(x)
....
1
3
real [2]
real = [1, 2, 3]
for x in real[:]: # IE, same as using copy
.... print(x)
.... real.remove(x)
....
1
2
3[]

To the original poster, the reason changing the list you're iterating over
is because the index doesn't move along with the data in it. So in the first
loop 1 and 3 are printed, 2 is left. Why? Assign indexes do the data. This
is most likely not how Python does it internally but this is good to show what
happened.

[1, 2, 3]
0 1 2

First run through x is one but it got it from the first index, 0. So x is
1. Then you remove 1 from the data set so now it looks like this.

[2, 3]
0 1

So now Python moves on in the loop, it grabs the next index which is 1.
However, since you've changed the list that index now points to 3, not 2. It
grabs 3, prints it then removes it. So now we're left with:

[2]
0

Since it's already done 0 the loop ends. By using a copy you're using the
copy to preserve the indexing to the data while you manipulate the data. Hope
this clears it up. :)
 
M

Marcello Pietrobon

Hi Steve,

Steve said:
dirs[:] makes a slice containing all elements, i. e. a shallow copy of the
complete list, so the loop is not affected by changes to the original:



dirs = ["alpha", "beta", "gamma"]
dirs == dirs[:] # equal

True


dirs is dirs[:] # but not the same list
False

Better way to make it crystal clear.

{grey@teleute:~} python
Python 2.3.3 (#2, Jan 13 2004, 00:47:05)
[GCC 3.3.3 20040110 (prerelease) (Debian)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

real = [1, 2, 3]
copy = real[:]
real
[1, 2, 3]

[1, 2, 3]

... print(x)
... real.remove(x)
...
1
3

[2]

real = [1, 2, 3]
for x in real[:]: # IE, same as using copy
... print(x)
... real.remove(x)
...
1
2
3

[]

To the original poster, the reason changing the list you're iterating over
is because the index doesn't move along with the data in it. So in the first
loop 1 and 3 are printed, 2 is left. Why? Assign indexes do the data. This
is most likely not how Python does it internally but this is good to show what
happened.

[1, 2, 3]
0 1 2

First run through x is one but it got it from the first index, 0. So x is
1. Then you remove 1 from the data set so now it looks like this.

[2, 3]
0 1

So now Python moves on in the loop, it grabs the next index which is 1.
However, since you've changed the list that index now points to 3, not 2. It
grabs 3, prints it then removes it. So now we're left with:

[2]
0

Since it's already done 0 the loop ends. By using a copy you're using the
copy to preserve the indexing to the data while you manipulate the data. Hope
this clears it up. :)

I thought intuitively something like that, but your help has been...
helpful ! :)

Can I ask you one more thing ?

It is surprizing to me that in

for x in real[:]

dirs[:] creates a copy of dirs

while

dirs[:] = [] - empty the original list
and
dirs = [] - empty a copy of the original list

I understand ( I think ) the concept of slicing, but this is stil
surprizing to me.
Like to say that when I do

for x in real[:]

this is not using slicing

While
dirs[:] = []
is using slicing


Maybe I just making a big mess in my mind.
It looks like assignments in Python and C++ are pretty different


Cheers,
Marcello
 
J

Jeff Epler

dirs[:] creates a copy of dirs

This creates a new list which contains the same items as the list named
by dirs
dirs[:] = [] - empty the original list

This changes the items in the list named by dirs. It replaces (mutates)
the range named on the left-hand of = with the items on the right-hand.
and
dirs = [] - empty a copy of the original list

This makes dirs name a different list than it did before, but the value
of the list that dirs named a moment ago is unchanged.

In the case of os.walk (or anywhere you do something by mutating an item
passed in) you have to change the items in a particular list ("mutate
the list") , not change the list a particular local name refers to.

Jeff
 
S

Steve Lamb

Can I ask you one more thing ?

Sure. However I am a Python neophyte who happens to have a few years
experience so take everything I say with a large heaping of salt. :)
It is surprizing to me that in

Ah, took me a minote to see what you were saying.
for x in real[:]
dirs[:] creates a copy of dirs

Well, creating a copy is the shorthand. What both of these are doing is
"output the values from the array x from y to z." Since y and z are not
specified you get the whole array (or string, or directory or any other
slicable object).
dirs[:] = [] - empty the original list

This is "assign the range of x to y the list given". A better way to see
it would be to do this:


= [1, 2, 3, 4]
foo [1, 2, 3, 4]
foo[1:2] = [3, 2, 5]
foo
[1, 3, 2, 5, 3, 4]

Hmmm, ok, even I'm scratching my head at that since I expected 1, 3, 2, 5
4. Erm, but you get the idea. :)
and
dirs = [] - empty a copy of the original list

This is because here you're assigning the name to a new object.

So in order...

for x in real[:] - iterate over the results of the slice of real from y to z.

foo = dirs[:] - Assign foo to the results of the slice of dirs from y to z.

dirs[:] = [] - Assign the the area of dirs defined by slice y to z with an
emptry array.

dirs = [] - Assign the name dirs to a new, empty array.

Where most people get hung up is the different between strings, which are
immutable, and lists/dictionaries, etc. which are mutable. :)
I understand ( I think ) the concept of slicing, but this is stil
surprizing to me. Like to say that when I do

for x in real[:]
this is not using slicing

Yes, it is. Take foo from above...
1075943980
foo points to object 1075943980.
1075943980
foo still points to object 1075943980.
1075943308
However this is a different object, 1075943308.

So in the above example it is using a slice. real[:] is returning a slice
and it is that object which x is iterating over. Just because that slice
doesn't have a name assigned to it doesn't mean it doesn't exist. :)
While
dirs[:] = []
is using slicing

Well, it is using it in a different manner. Above you're using slicing to
tell Python what to return. Here you're using slicing to tell Python what to
replace.
Maybe I just making a big mess in my mind.
It looks like assignments in Python and C++ are pretty different

Never touched C++ so I cannot say. :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top