newb question: file searching

J

jaysherby

Here's my code:

def getFileList():
import os
imageList = []
for dirpath, dirnames, filenames in os.walk(os.getcwd()):
for filename in filenames:
for dirname in dirnames:
if not dirname.startswith('.'):
if filename.lower().endswith('.jpg') and not
filename.startswith('.'):
imageList.append(os.path.join(dirpath, filename))
return imageList

I've adapted it around all the much appreciated suggestions. However,
I'm running into two very peculiar logical errors. First, I'm getting
repeated entries. That's no good. One image, one entry in the list.
The other is that if I run the script from my Desktop folder, it won't
find any files, and I make sure to have lots of jpegs in the Desktop
folder for the test. Can anyone figure this out?
 
J

jaysherby

Something's really not reliable in my logic. I say this because if I
change the extension to .png then a file in a hidden directory (one the
starts with '.') shows up! The most frustrating part is that there are
..jpg files in the very same directory that don't show up when it
searches for jpegs.

I tried os.walk('.') and it works, so I'll be using that instead.

Here's my code:

def getFileList():
import os
imageList = []
for dirpath, dirnames, filenames in os.walk(os.getcwd()):
for filename in filenames:
for dirname in dirnames:
if not dirname.startswith('.'):
if filename.lower().endswith('.jpg') and not
filename.startswith('.'):
imageList.append(os.path.join(dirpath, filename))
return imageList

I've adapted it around all the much appreciated suggestions. However,
I'm running into two very peculiar logical errors. First, I'm getting
repeated entries. That's no good. One image, one entry in the list.
The other is that if I run the script from my Desktop folder, it won't
find any files, and I make sure to have lots of jpegs in the Desktop
folder for the test. Can anyone figure this out?

I'm new at Python and I need a little advice. Part of the script I'm
trying to write needs to be aware of all the files of a certain
extension in the script's path and all sub-directories. Can someone
set me on the right path to what modules and calls to use to do that?
You'd think that it would be a fairly simple proposition, but I can't
find examples anywhere. Thanks.
 
G

Gabriel Genellina

At Tuesday 8/8/2006 21:11, (e-mail address removed) wrote:

Here's my code:

def getFileList():
import os
imageList = []
for dirpath, dirnames, filenames in os.walk(os.getcwd()):
for filename in filenames:
for dirname in dirnames:
if not dirname.startswith('.'):
if
filename.lower().endswith('.jpg') and not
filename.startswith('.'):

imageList.append(os.path.join(dirpath, filename))
return imageList

I've adapted it around all the much appreciated suggestions. However,
I'm running into two very peculiar logical errors. First, I'm getting
repeated entries. That's no good. One image, one entry in the list.

That's because of the double iteration. dirnames and filenames are
two distinct, complementary, lists. (If a directory entry is a
directory it goes into dirnames; if it's a file it goes into
filenames). So you have to process them one after another.
def getFileList():
import os
imageList = []
for dirpath, dirnames, filenames in os.walk(os.getcwd()):
for filename in filenames:
if filename.lower().endswith('.jpg') and
not filename.startswith('.'):

imageList.append(os.path.join(dirpath, filename))
for i in reversed(range(len(dirnames))):
if dirnames.startswith('.'): del dirnames
return imageList


reversed() because you need to modify dirnames in-place, so it's
better to process the list backwards.



Gabriel Genellina
Softlab SRL





__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
 
J

jaysherby

I've narrowed down the problem. All the problems start when I try to
eliminate the hidden files and directories. Is there a better way to
do this?
 
J

jaysherby

That worked perfectly. Thank you. That was exactly what I was looking
for. However, can you explain to me what the following code actually
does?

reversed(range(len(dirnames)))


Gabriel said:
At Tuesday 8/8/2006 21:11, (e-mail address removed) wrote:

Here's my code:

def getFileList():
import os
imageList = []
for dirpath, dirnames, filenames in os.walk(os.getcwd()):
for filename in filenames:
for dirname in dirnames:
if not dirname.startswith('.'):
if
filename.lower().endswith('.jpg') and not
filename.startswith('.'):

imageList.append(os.path.join(dirpath, filename))
return imageList

I've adapted it around all the much appreciated suggestions. However,
I'm running into two very peculiar logical errors. First, I'm getting
repeated entries. That's no good. One image, one entry in the list.

That's because of the double iteration. dirnames and filenames are
two distinct, complementary, lists. (If a directory entry is a
directory it goes into dirnames; if it's a file it goes into
filenames). So you have to process them one after another.
def getFileList():
import os
imageList = []
for dirpath, dirnames, filenames in os.walk(os.getcwd()):
for filename in filenames:
if filename.lower().endswith('.jpg') and
not filename.startswith('.'):

imageList.append(os.path.join(dirpath, filename))
for i in reversed(range(len(dirnames))):
if dirnames.startswith('.'): del dirnames
return imageList


reversed() because you need to modify dirnames in-place, so it's
better to process the list backwards.



Gabriel Genellina
Softlab SRL





__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
 
J

Justin Azoff

I've narrowed down the problem. All the problems start when I try to
eliminate the hidden files and directories. Is there a better way to
do this?

Well you almost have it, but your problem is that you are trying to do
too many things in one function. (I bet I am starting to sound like a
broken record :)) The four distinct things you are doing are:

* getting a list of all files in a tree
* combining a files directory with its name to give the full path
* ignoring hidden directories
* matching files based on their extension

If you split up each of those things into their own function you will
end up with smaller easier to test pieces, and separate, reusable
functions.

The core function would be basically what you already have:

def get_files(directory, include_hidden=False):
"""Return an expanded list of files for a directory tree
optionally not ignoring hidden directories"""
for path, dirs, files in os.walk(directory):
for fn in files:
full = os.path.join(path, fn)
yield full

if not include_hidden:
remove_hidden(dirs)

and remove_hidden is a short, but tricky function since the directory
list needs to be edited in place:

def remove_hidden(dirlist):
"""For a list containing directory names, remove
any that start with a dot"""

dirlist[:] = [d for d in dirlist if not d.startswith('.')]

at this point, you can play with get_files on it's own, and test
whether or not the include_hidden parameter works as expected.

For the final step, I'd use an approach that pulls out the extension
itself, and checks to see if it is in a list(or better, a set) of
allowed filenames. globbing (*.foo) works as well, but if you are only
ever matching on the extension, I believe this will work better.

def get_files_by_ext(directory, ext_list, include_hidden=False):
"""Return an expanded list of files for a directory tree
where the file ends with one of the extensions in ext_list"""
ext_list = set(ext_list)

for fn in get_files(directory, include_hidden):
_, ext = os.path.splitext(fn)
ext=ext[1:] #remove dot
if ext.lower() in ext_list:
yield fn

notice at this point we still haven't said anything about images! The
task of finding files by extension is pretty generic, so it shouldn't
be concerned about the actual extensions.

once that works, you can simply do

def get_images(directory, include_hidden=False):
image_exts = ('jpg','jpeg','gif','png','bmp')
return get_files_by_ext(directory, image_exts, include_hidden)

Hope this helps :)
 
J

jaysherby

I do appreciate the advice, but I've got a 12 line function that does
all of that. And it works! I just wish I understood a particular line
of it.

def getFileList(*extensions):
import os
imageList = []
for dirpath, dirnames, files in os.walk('.'):
for filename in files:
name, ext = os.path.splitext(filename)
if ext.lower() in extensions and not filename.startswith('.'):
imageList.append(os.path.join(dirpath, filename))
for dirname in reversed(range(len(dirnames))):
if dirnames[dirname].startswith('.'):
del dirnames[dirname]

return imageList

print getFileList('.jpg', '.gif', '.png')

The line I don't understand is:
reversed(range(len(dirnames)))

I've narrowed down the problem. All the problems start when I try to
eliminate the hidden files and directories. Is there a better way to
do this?

Well you almost have it, but your problem is that you are trying to do
too many things in one function. (I bet I am starting to sound like a
broken record :)) The four distinct things you are doing are:

* getting a list of all files in a tree
* combining a files directory with its name to give the full path
* ignoring hidden directories
* matching files based on their extension

If you split up each of those things into their own function you will
end up with smaller easier to test pieces, and separate, reusable
functions.

The core function would be basically what you already have:

def get_files(directory, include_hidden=False):
"""Return an expanded list of files for a directory tree
optionally not ignoring hidden directories"""
for path, dirs, files in os.walk(directory):
for fn in files:
full = os.path.join(path, fn)
yield full

if not include_hidden:
remove_hidden(dirs)

and remove_hidden is a short, but tricky function since the directory
list needs to be edited in place:

def remove_hidden(dirlist):
"""For a list containing directory names, remove
any that start with a dot"""

dirlist[:] = [d for d in dirlist if not d.startswith('.')]

at this point, you can play with get_files on it's own, and test
whether or not the include_hidden parameter works as expected.

For the final step, I'd use an approach that pulls out the extension
itself, and checks to see if it is in a list(or better, a set) of
allowed filenames. globbing (*.foo) works as well, but if you are only
ever matching on the extension, I believe this will work better.

def get_files_by_ext(directory, ext_list, include_hidden=False):
"""Return an expanded list of files for a directory tree
where the file ends with one of the extensions in ext_list"""
ext_list = set(ext_list)

for fn in get_files(directory, include_hidden):
_, ext = os.path.splitext(fn)
ext=ext[1:] #remove dot
if ext.lower() in ext_list:
yield fn

notice at this point we still haven't said anything about images! The
task of finding files by extension is pretty generic, so it shouldn't
be concerned about the actual extensions.

once that works, you can simply do

def get_images(directory, include_hidden=False):
image_exts = ('jpg','jpeg','gif','png','bmp')
return get_files_by_ext(directory, image_exts, include_hidden)

Hope this helps :)
 
J

Justin Azoff

I do appreciate the advice, but I've got a 12 line function that does
all of that. And it works! I just wish I understood a particular line
of it.

You miss the point. The functions I posted, up until get_files_by_ext
which is the equivalent of your getFileList, total 17 actual lines.
The 5 extra lines give 3 extra features. Maybe in a while when you
need to do a similar file search you will realize why my way is better.

[snip]
The line I don't understand is:
reversed(range(len(dirnames)))

This is why I wrote and documented a separate remove_hidden function,
it can be tricky. If you broke it up into multiple lines, and added
print statements it would be clear what it does.

l = len(dirnames) # l is the number of elements in dirnames, e.g. 6
r = range(l) # r contains the numbers 0,1,2,3,4,5
rv = reversed(r) # rv contains the numbers 5,4,3,2,1,0

The problem arises from how to remove elements in a list as you are
going through it. If you delete element 0, element 1 then becomes
element 0, and funny things happen. That particular solution is
relatively simple, it just deletes elements from the end instead. That
complicated expression arises because python doesn't have "normal" for
loops. The version of remove_hidden I wrote is simpler, but relies on
the even more obscure lst[:] construct for re-assigning a list. Both
of them accomplish the same thing though, so if you wanted, you should
be able to replace those 3 lines with just

dirnames[:] = [d for d in dirnames if not d.startswith('.')]
 
J

John Machin

I do appreciate the advice, but I've got a 12 line function that does
all of that. And it works! I just wish I understood a particular line
of it.

def getFileList(*extensions):
import os
imageList = []
for dirpath, dirnames, files in os.walk('.'):
for filename in files:
name, ext = os.path.splitext(filename)
if ext.lower() in extensions and not filename.startswith('.'):
imageList.append(os.path.join(dirpath, filename))
for dirname in reversed(range(len(dirnames))):
if dirnames[dirname].startswith('.'):
del dirnames[dirname]

return imageList

print getFileList('.jpg', '.gif', '.png')

The line I don't understand is:
reversed(range(len(dirnames)))

For a start, change "dirname" to "dirindex" (without changing
"dirnames"!) in that line and the next two lines -- this may help your
understanding.

The purpose of that loop is to eliminate from dirnames any entries
which start with ".". This needs to be done in-situ -- concocting a new
list and binding the name "dirnames" to it won't work.
The safest understandable way to delete entries from a list while
iterating over it is to do it backwards.

Doing it forwards doesn't always work; example:

#>>> dirnames = ['foo', 'bar', 'zot']
#>>> for x in range(len(dirnames)):
.... if dirnames[x] == 'bar':
.... del dirnames[x]
....
Traceback (most recent call last):
File "<stdin>", line 2, in ?
IndexError: list index out of range

HTH,
John
 
J

jaysherby

I'm sorry. I didn't mean to offend you. I never thought your way was
inferior.

I do appreciate the advice, but I've got a 12 line function that does
all of that. And it works! I just wish I understood a particular line
of it.

You miss the point. The functions I posted, up until get_files_by_ext
which is the equivalent of your getFileList, total 17 actual lines.
The 5 extra lines give 3 extra features. Maybe in a while when you
need to do a similar file search you will realize why my way is better.

[snip]
The line I don't understand is:
reversed(range(len(dirnames)))

This is why I wrote and documented a separate remove_hidden function,
it can be tricky. If you broke it up into multiple lines, and added
print statements it would be clear what it does.

l = len(dirnames) # l is the number of elements in dirnames, e.g. 6
r = range(l) # r contains the numbers 0,1,2,3,4,5
rv = reversed(r) # rv contains the numbers 5,4,3,2,1,0

The problem arises from how to remove elements in a list as you are
going through it. If you delete element 0, element 1 then becomes
element 0, and funny things happen. That particular solution is
relatively simple, it just deletes elements from the end instead. That
complicated expression arises because python doesn't have "normal" for
loops. The version of remove_hidden I wrote is simpler, but relies on
the even more obscure lst[:] construct for re-assigning a list. Both
of them accomplish the same thing though, so if you wanted, you should
be able to replace those 3 lines with just

dirnames[:] = [d for d in dirnames if not d.startswith('.')]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top