Case insensitive exists()?

L

Larry Martell

I have the need to check for a files existence against a string, but I
need to do case-insensitively. I cannot efficiently get the name of
every file in the dir and compare each with my string using lower(),
as I have 100's of strings to check for, each in a different dir, and
each dir can have 100's of files in it. Does anyone know of an
efficient way to do this? There's no switch for os.path that makes
exists() check case-insensitively is there?
 
R

Roy Smith

Larry Martell said:
I have the need to check for a files existence against a string, but I
need to do case-insensitively. I cannot efficiently get the name of
every file in the dir and compare each with my string using lower(),
as I have 100's of strings to check for, each in a different dir, and
each dir can have 100's of files in it.

I'm not quite sure what you're asking. Do you need to match the
filename, or find the string in the contents of the file? I'm going to
assume you're asking the former.

One way or another, you need to iterate over all the directories and get
all the filenames in each. The time to do that is going to totally
swamp any processing you do in terms of converting to lower case and
comparing to some set of strings.

I would put all my strings into a set, then use os.walk() traverse the
directories and for each path os.walk() returns, do "path.lower() in
strings".
 
L

Larry Martell

I'm not quite sure what you're asking. Do you need to match the
filename, or find the string in the contents of the file? I'm going to
assume you're asking the former.

Yes, match the file names. e.g. if my match string is "ABC" and
there's a file named "Abc" then it would be match.
One way or another, you need to iterate over all the directories and get
all the filenames in each. The time to do that is going to totally
swamp any processing you do in terms of converting to lower case and
comparing to some set of strings.

I would put all my strings into a set, then use os.walk() traverse the
directories and for each path os.walk() returns, do "path.lower() in
strings".

The issue is that I run a database query and get back rows, each with
a file path (each in a different dir). And I have to check to see if
that file exists. Each is a separate search with no correlation to the
others. I have the full path, so I guess I'll have to do dir name on
it, then a listdir then compare each item with .lower with my string
..lower. It's just that the dirs have 100's and 100's of files so I'm
really worried about efficiency.
 
R

Roy Smith

Larry Martell said:
The issue is that I run a database query and get back rows, each with
a file path (each in a different dir). And I have to check to see if
that file exists. Each is a separate search with no correlation to the
others. I have the full path, so I guess I'll have to do dir name on
it, then a listdir then compare each item with .lower with my string
.lower. It's just that the dirs have 100's and 100's of files so I'm
really worried about efficiency.

Oh, my, this is a much more complicated problem than you originally
described.

Is the whole path case-insensitive, or just the last component? In
other words, if the search string is "/foo/bar/my_file_name", do all of
these paths match?

/FOO/BAR/MY_FILE_NAME
/foo/bar/my_file_name
/FoO/bAr/My_FiLe_NaMe

Can you give some more background as to *why* you're doing this?
Usually, if a system considers filenames to be case-insensitive, that's
something that's handled by the operating system itself.
 
L

Larry Martell

Oh, my, this is a much more complicated problem than you originally
described.

I try not to bother folks with simple problems ;-)
Is the whole path case-insensitive, or just the last component? In
other words, if the search string is "/foo/bar/my_file_name", do all of
these paths match?

/FOO/BAR/MY_FILE_NAME
/foo/bar/my_file_name
/FoO/bAr/My_FiLe_NaMe

Just the file name (the basename).
Can you give some more background as to *why* you're doing this?
Usually, if a system considers filenames to be case-insensitive, that's
something that's handled by the operating system itself.

I can't say why it's happening. This is a big complicated system with
lots of parts. There's some program that ftp's image files from an
electron microscope and stores them on the file system with crazy
names like:

2O_TOPO_1_2O_2UM_FOV_M1_FX-2_FY4_DX0_DY0_DZ0_SDX10_SDY14_SDZ0_RR1_TR1_Ver1.jpg

And something (perhaps the same program, perhaps a different one)
records this is a database. In some cases the name recorded in the db
has different cases in some characters then how it was stored in the
db, e.g.:

2O_TOPO_1_2O_2UM_Fov_M1_FX-2_FY4_DX0_DY0_DZ0_SDX10_SDY14_SDZ0_RR1_TR1_Ver1.jpg

These only differ in "FOV" vs. "Fov" but that is just one example.

I am writing something that is part of a django app, that based on
some web entry from the user, I run a query, get back a list of files
and have to go receive them and serve them up back to the browser. My
script is all done and seem to be working, then today I was informed
it was not serving up all the images. Debugging revealed that it was
this case issue - I was matching with exists(). As I've said, coding a
solution is easy, but I fear it will be too slow. Speed is important
in web apps - users have high expectations. Guess I'll just have to
try it and see.
 
C

Chris Angelico

I am writing something that is part of a django app, that based on
some web entry from the user, I run a query, get back a list of files
and have to go receive them and serve them up back to the browser. My
script is all done and seem to be working, then today I was informed
it was not serving up all the images. Debugging revealed that it was
this case issue - I was matching with exists(). As I've said, coding a
solution is easy, but I fear it will be too slow. Speed is important
in web apps - users have high expectations. Guess I'll just have to
try it and see.

Would it be a problem to rename all the files? Then you could simply
lower() the input name and it'll be correct.

ChrisA
 
S

Steven D'Aprano

I have the need to check for a files existence against a string, but I
need to do case-insensitively.

Reading on, I see that your database assumes case-insensitive file names,
while your file system is case-sensitive.

Suggestions:

(1) Move the files onto a case-insensitive file system. Samba, I believe,
can duplicate the case-insensitive behaviour of NTFS even on ext3 or ext4
file systems. (To be pedantic, NTFS can also optionally be case-
sensitive, although that it rarely used.) So if you stick the files on a
samba file share set to case-insensitivity, samba will behave the way you
want. (Although os.path.exists won't, you'll have to use nt.path.exists
instead.)

(2) Normalize the database and the files. Do a one-off run through the
files on disk, lowercasing the file names, followed by a one-off run
through the database, doing the same. (Watch out for ambiguous names like
"Foo" and "FOO".) Then you just need to ensure new files are always named
in lowercase.


Also, keep in mind that just because os.path.exists reports a file exists
*right now*, doesn't mean it will still exist a millisecond later when
you go to use it. Consider avoiding os.path.exists altogether, and just
trying to open the file. (Although I see you still have the problem that
you don't know *which* directory the file will be found in.
I cannot efficiently get the name of
every file in the dir and compare each with my string using lower(), as
I have 100's of strings to check for, each in a different dir, and each
dir can have 100's of files in it. Does anyone know of an efficient way
to do this? There's no switch for os.path that makes exists() check
case-insensitively is there?

Try nt.path.exists, although I'm not certain it will do what you want
since it probably assumes the file system is case-insensitive.

It really sounds like you have a hard problem to solve here. I strongly
recommend that you change the problem, by renaming the files, or at least
moving them into a consistent location, rather than have to repeatedly
search multiple directories. Good luck!
 
G

Grant Edwards

I have the need to check for a files existence against a string, but I
need to do case-insensitively. I cannot efficiently get the name of
every file in the dir and compare each with my string using lower(),
as I have 100's of strings to check for, each in a different dir, and
each dir can have 100's of files in it. Does anyone know of an
efficient way to do this? There's no switch for os.path that makes
exists() check case-insensitively is there?

If you're on Unix, you could use os.popen() to run a find command
using -iname.
 
O

Oscar Benjamin

I am writing something that is part of a django app, that based on
some web entry from the user, I run a query, get back a list of files
and have to go receive them and serve them up back to the browser. My
script is all done and seem to be working, then today I was informed
it was not serving up all the images. Debugging revealed that it was
this case issue - I was matching with exists(). As I've said, coding a
solution is easy, but I fear it will be too slow. Speed is important
in web apps - users have high expectations. Guess I'll just have to
try it and see.

How long does it actually take to serve a http request? I would expect it to
be orders of magnitudes slower than calling os.listdir on a directory
containing hundreds of files.

Here on my Linux system there are 2000+ files in /usr/bin. Calling os.listdir
takes 1.5 milliseconds (warm cache):

$ python -m timeit -s 'import os' 'os.listdir("/usr/bin")'
1000 loops, best of 3: 1.42 msec per loop

Converting those to upper case takes a further .5 milliseconds:

$ python -m timeit -s 'import os' 'map(str.upper, os.listdir("/usr/bin"))'
1000 loops, best of 3: 1.98 msec per loop

Checking a string against that list takes .05 milliseconds:

$ python -m timeit -s 'import os' \
'"WHICH" in map(str.upper, os.listdir("/usr/bin"))'
1000 loops, best of 3: 2.03 msec per loop


Oscar
 
L

Larry Martell

Would it be a problem to rename all the files? Then you could simply
lower() the input name and it'll be correct.

So it turned out that in the django model definition for this object
there was code that was doing some character mapping that was causing
this. That code was added to 'fix' another problem, but the mapping
strings were not qualified enough and it was doing some unintended
mapping. Changing those strings to be more specific fixed my problem.

Thanks to all for the replies.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top