making a valid file name...

S

SpreadTooThin

Hi I'm writing a python script that creates directories from user
input.
Sometimes the user inputs characters that aren't valid characters for a
file or directory name.
Here are the characters that I consider to be valid characters...

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '

if I have a string called fname I want to go through each character in
the filename and if it is not a valid character, then I want to replace
it with a space.

This is what I have:

def fixfilename(fname):
valid =
':.\,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
for i in range(len(fname)):
if valid.find(fname) < 0:
fname = ' '
return fname

Anyone think of a simpler solution?
 
J

Jon Clements

SpreadTooThin said:
Hi I'm writing a python script that creates directories from user
input.
Sometimes the user inputs characters that aren't valid characters for a
file or directory name.
Here are the characters that I consider to be valid characters...

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '

if I have a string called fname I want to go through each character in
the filename and if it is not a valid character, then I want to replace
it with a space.

This is what I have:

def fixfilename(fname):
valid =
':.\,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
for i in range(len(fname)):
if valid.find(fname) < 0:
fname = ' '
return fname

Anyone think of a simpler solution?


If you want to strip 'em:
'lasfjalsfjdlasfjasfdsomethingelse.dat'

If you want to replace them with something, be careful of the regex
string being built (ie a space character).
import re
re.sub(r'[^%s]' % valid,' ',filename)
' lasfjalsfjdlasfjasfd somethingelse.dat'


Jon.
 
D

Dennis Lee Bieber

if I have a string called fname I want to go through each character in
the filename and if it is not a valid character, then I want to replace
it with a space.
said:
Anyone think of a simpler solution?

string method: translate()

Initializing the translate table may be painful, but only needs to
be done once (and you could probably use repr() of it to cut&paste so
you don't need to initialize it later).
ttable = [' '] * 256
for c in ":.\,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ":
.... ttable[ord(c)] = c
.... --
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
T

Tim Chase

Sometimes the user inputs characters that aren't valid
characters for a file or directory name. Here are the
characters that I consider to be valid characters...

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '

Just a caveat, as colons and slashes can give grief on various
operating systems...combined with periods, it may be possible to
cause trouble too...
This is what I have:

def fixfilename(fname):
valid =
':.\,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
for i in range(len(fname)):
if valid.find(fname) < 0:
fname = ' '
return fname

Anyone think of a simpler solution?


I don't know if it's simpler, but you can use
'this is a test it ain t expen ive.py'

It does use the "it's almost a ternary operator, but not quite"
method concurrently being discussed/lambasted in another thread.
Treat accordingly, with all that may entail. Should be good in
this case though.

If you're doing it on a time-critical basis, it might help to
make "valid" a set, which should have O(1) membership testing,
rather than using the "in" test with a string. I don't know how
well the find() method of a string performs in relationship to
"in" testing of a set. Test and see, if it's important.

-tkc
 
E

Edgar Matzinger

Hi,

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '

not specifying the OS platform, these are not all the characters
that may occur in a filename: '[]{}-=", etc. And '/' is NOT valid.
On a unix platform. And it should be easy to scan the filename and
check every character against the 'valid-string'.

HTH, cu l8r, Edgar.
 
N

Neil Cerutti

If you're doing it on a time-critical basis, it might help to
make "valid" a set, which should have O(1) membership testing,
rather than using the "in" test with a string. I don't know
how well the find() method of a string performs in relationship
to "in" testing of a set. Test and see, if it's important.

The find method of (8-bit) strings is really, really fast. My
guess is that set can't beat it. I tried to beat it recently with
a binary search function. Even after applying psyco find was
still faster (though I could beat the bisect functions by a
little bit by replacing a divide with a shift).
 
T

Tim Chase

If you're doing it on a time-critical basis, it might help to
The find method of (8-bit) strings is really, really fast. My
guess is that set can't beat it. I tried to beat it recently with
a binary search function. Even after applying psyco find was
still faster (though I could beat the bisect functions by a
little bit by replacing a divide with a shift).

In "theory" (you know...that little town in west Texas where
everything goes right), a set-membership test should be O(1). A
binary search function would be O(log N). A linear search of a
string for a member should be O(N).

In practice, however, for such small strings as the given
whitelist, the underlying find() operation likely doesn't put a
blip on the radar. If your whitelist were some huge document
that you were searching repeatedly, it could have worse
performance. Additionally, the find() in the underlying C code
is likely about as bare-metal as it gets, whereas the set
membership aspect of things may go through some more convoluted
setup/teardown/hashing and spend a lot more time further from the
processor's op-codes.

And I know that a number of folks have done some hefty
optimization of Python's string-handling abilities. There's
likely a tradeoff point where it's better to use one over the
other depending on the size of the whitelist. YMMV

-tkc
 
N

Neil Cerutti

Hi,

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '

not specifying the OS platform, these are not all the
characters that may occur in a filename: '[]{}-=", etc. And '/'
is NOT valid. On a unix platform. And it should be easy to
scan the filename and check every character against the
'valid-string'.

In the interactive fiction world where I come from, a portable
filename is only 8 chars long and matches the regex
[A-Z][A-Z0-9]*, i.e., capital letters and numbers, with no
extension. That way it'll work on old DOS machines and on
Risc-OS. Wait... is there Python for Risc-OS?
 
F

Fredrik Lundh

Matthew said:
import re
badfilename='£"%^"£^"£$^ihgeroighroeig3645^£$^"knovin98u4#346#1461461'
valid=':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
goodfilename=re.sub('[^'+valid+']',' ',badfilename)

to create arbitrary character sets, it's usually best to run the character string through
re.escape() before passing it to the RE engine.

</F>
 
B

bearophileHUGS

Tim Chase:
In practice, however, for such small strings as the given
whitelist, the underlying find() operation likely doesn't put a
blip on the radar. If your whitelist were some huge document
that you were searching repeatedly, it could have worse
performance. Additionally, the find() in the underlying C code
is likely about as bare-metal as it gets, whereas the set
membership aspect of things may go through some more convoluted
setup/teardown/hashing and spend a lot more time further from the
processor's op-codes.

With this specific test (half good half bad), on Py2.5, on my PC, sets
start to be faster than the string search when the string "good" is
about 5-6 chars long (this means set are quite fast, I presume).

from random import choice, seed
from time import clock

def main(choice=choice):
seed(1)
n = 100000

for good in ("ab", "abc", "abcdef", "abcdefgh",
"abcdefghijklmnopqrstuvwxyz"):
poss = good + good.upper()
data = [choice(poss) for _ in xrange(n)] * 10
print "len(good) = ", len(good)

t = clock()
for c in data:
c in good
print round(clock()-t, 2)

t = clock()
sgood = set(good)
for c in data:
c in sgood
print round(clock()-t, 2), "\n"

main()


Bye,
bearophile
 
N

Neil Cerutti

Tim Chase:
In practice, however, for such small strings as the given
whitelist, the underlying find() operation likely doesn't put a
blip on the radar. If your whitelist were some huge document
that you were searching repeatedly, it could have worse
performance. Additionally, the find() in the underlying C code
is likely about as bare-metal as it gets, whereas the set
membership aspect of things may go through some more convoluted
setup/teardown/hashing and spend a lot more time further from the
processor's op-codes.

With this specific test (half good half bad), on Py2.5, on my PC, sets
start to be faster than the string search when the string "good" is
about 5-6 chars long (this means set are quite fast, I presume).

from random import choice, seed
from time import clock

def main(choice=choice):
seed(1)
n = 100000

for good in ("ab", "abc", "abcdef", "abcdefgh",
"abcdefghijklmnopqrstuvwxyz"):
poss = good + good.upper()
data = [choice(poss) for _ in xrange(n)] * 10
print "len(good) = ", len(good)

t = clock()
for c in data:
c in good
print round(clock()-t, 2)

t = clock()
sgood = set(good)
for c in data:
c in sgood
print round(clock()-t, 2), "\n"

main()

On my Python2.4 for Windows, they are often still neck-and-neck
for len(good) = 26. set's disadvantage of having to be
constructed is heavily amortized over 100,000 membership
tests. Without knowing the usage pattern, it'd be hard to choose
between them.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top