making a valid file name...

SpreadTooThin · Oct 17, 2006

Hi I'm writing a python script that creates directories from user
input.
Sometimes the user inputs characters that aren't valid characters for a
file or directory name.
Here are the characters that I consider to be valid characters...

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '

if I have a string called fname I want to go through each character in
the filename and if it is not a valid character, then I want to replace
it with a space.

This is what I have:

def fixfilename(fname):
valid =
':.\,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
for i in range(len(fname)):
if valid.find(fname) < 0:
fname = ' '
return fname

Anyone think of a simpler solution?

Jerry · Oct 17, 2006

I would suggest something like string.maketrans
http://docs.python.org/lib/node41.html. I don't remember exactly how
it works, but I think it's something like
'123123.txt'

Jon Clements · Oct 17, 2006

SpreadTooThin said:
Hi I'm writing a python script that creates directories from user
input.
Sometimes the user inputs characters that aren't valid characters for a
file or directory name.
Here are the characters that I consider to be valid characters...

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '

if I have a string called fname I want to go through each character in
the filename and if it is not a valid character, then I want to replace
it with a space.

This is what I have:

def fixfilename(fname):
valid =
':.\,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
for i in range(len(fname)):
if valid.find(fname) < 0:
fname = ' '
return fname

Anyone think of a simpler solution?

If you want to strip 'em:
'lasfjalsfjdlasfjasfdsomethingelse.dat'

If you want to replace them with something, be careful of the regex
string being built (ie a space character).
import re

re.sub(r'[^%s]' % valid,' ',filename)

Click to expand...

Click to expand...

Click to expand...

' lasfjalsfjdlasfjasfd somethingelse.dat'

Jon.

Dennis Lee Bieber · Oct 17, 2006

if I have a string called fname I want to go through each character in
the filename and if it is not a valid character, then I want to replace
it with a space.

said:
Anyone think of a simpler solution?

string method: translate()

Initializing the translate table may be painful, but only needs to
be done once (and you could probably use repr() of it to cut&paste so
you don't need to initialize it later).

ttable = [' '] * 256
for c in ":.\,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ":

Click to expand...

Click to expand...

.... ttable[ord(c)] = c
.... --
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Tim Chase · Oct 17, 2006

Sometimes the user inputs characters that aren't valid

characters for a file or directory name. Here are the
characters that I consider to be valid characters...

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '

Just a caveat, as colons and slashes can give grief on various
operating systems...combined with periods, it may be possible to
cause trouble too...

This is what I have:

def fixfilename(fname):
valid =
':.\,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
for i in range(len(fname)):
if valid.find(fname) < 0:
fname = ' '
return fname

Anyone think of a simpler solution?

I don't know if it's simpler, but you can use
'this is a test it ain t expen ive.py'

It does use the "it's almost a ternary operator, but not quite"
method concurrently being discussed/lambasted in another thread.
Treat accordingly, with all that may entail. Should be good in
this case though.

If you're doing it on a time-critical basis, it might help to
make "valid" a set, which should have O(1) membership testing,
rather than using the "in" test with a string. I don't know how
well the find() method of a string performs in relationship to
"in" testing of a set. Test and see, if it's important.

-tkc

Edgar Matzinger · Oct 17, 2006

Hi,

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '

not specifying the OS platform, these are not all the characters
that may occur in a filename: '[]{}-=", etc. And '/' is NOT valid.
On a unix platform. And it should be easy to scan the filename and
check every character against the 'valid-string'.

HTH, cu l8r, Edgar.

Neil Cerutti · Oct 17, 2006

If you're doing it on a time-critical basis, it might help to
make "valid" a set, which should have O(1) membership testing,
rather than using the "in" test with a string. I don't know
how well the find() method of a string performs in relationship
to "in" testing of a set. Test and see, if it's important.

The find method of (8-bit) strings is really, really fast. My
guess is that set can't beat it. I tried to beat it recently with
a binary search function. Even after applying psyco find was
still faster (though I could beat the bisect functions by a
little bit by replacing a divide with a shift).

Tim Chase · Oct 17, 2006

If you're doing it on a time-critical basis, it might help to

The find method of (8-bit) strings is really, really fast. My
guess is that set can't beat it. I tried to beat it recently with
a binary search function. Even after applying psyco find was
still faster (though I could beat the bisect functions by a
little bit by replacing a divide with a shift).

In "theory" (you know...that little town in west Texas where
everything goes right), a set-membership test should be O(1). A
binary search function would be O(log N). A linear search of a
string for a member should be O(N).

In practice, however, for such small strings as the given
whitelist, the underlying find() operation likely doesn't put a
blip on the radar. If your whitelist were some huge document
that you were searching repeatedly, it could have worse
performance. Additionally, the find() in the underlying C code
is likely about as bare-metal as it gets, whereas the set
membership aspect of things may go through some more convoluted
setup/teardown/hashing and spend a lot more time further from the
processor's op-codes.

And I know that a number of folks have done some hefty
optimization of Python's string-handling abilities. There's
likely a tradeoff point where it's better to use one over the
other depending on the size of the whitelist. YMMV

-tkc

Neil Cerutti · Oct 17, 2006

Hi,

valid =
':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '

Click to expand...

not specifying the OS platform, these are not all the
characters that may occur in a filename: '[]{}-=", etc. And '/'
is NOT valid. On a unix platform. And it should be easy to
scan the filename and check every character against the
'valid-string'.

In the interactive fiction world where I come from, a portable
filename is only 8 chars long and matches the regex
[A-Z][A-Z0-9]*, i.e., capital letters and numbers, with no
extension. That way it'll work on old DOS machines and on
Risc-OS. Wait... is there Python for Risc-OS?

Fredrik Lundh · Oct 18, 2006

Matthew said:
import re
badfilename='£"%^"£^"£$^ihgeroighroeig3645^£$^"knovin98u4#346#1461461'
valid=':./,^0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
goodfilename=re.sub('[^'+valid+']',' ',badfilename)

Click to expand...

Click to expand...

to create arbitrary character sets, it's usually best to run the character string through
re.escape() before passing it to the RE engine.

</F>

bearophileHUGS · Oct 19, 2006

Tim Chase:

In practice, however, for such small strings as the given
whitelist, the underlying find() operation likely doesn't put a
blip on the radar. If your whitelist were some huge document
that you were searching repeatedly, it could have worse
performance. Additionally, the find() in the underlying C code
is likely about as bare-metal as it gets, whereas the set
membership aspect of things may go through some more convoluted
setup/teardown/hashing and spend a lot more time further from the
processor's op-codes.

With this specific test (half good half bad), on Py2.5, on my PC, sets
start to be faster than the string search when the string "good" is
about 5-6 chars long (this means set are quite fast, I presume).

from random import choice, seed
from time import clock

def main(choice=choice):
seed(1)
n = 100000

for good in ("ab", "abc", "abcdef", "abcdefgh",
"abcdefghijklmnopqrstuvwxyz"):
poss = good + good.upper()
data = [choice(poss) for _ in xrange(n)] * 10
print "len(good) = ", len(good)

t = clock()
for c in data:
c in good
print round(clock()-t, 2)

t = clock()
sgood = set(good)
for c in data:
c in sgood
print round(clock()-t, 2), "\n"

main()

Bye,
bearophile

Neil Cerutti · Oct 19, 2006

Tim Chase:

In practice, however, for such small strings as the given
whitelist, the underlying find() operation likely doesn't put a
blip on the radar. If your whitelist were some huge document
that you were searching repeatedly, it could have worse
performance. Additionally, the find() in the underlying C code
is likely about as bare-metal as it gets, whereas the set
membership aspect of things may go through some more convoluted
setup/teardown/hashing and spend a lot more time further from the
processor's op-codes.

Click to expand...

With this specific test (half good half bad), on Py2.5, on my PC, sets
start to be faster than the string search when the string "good" is
about 5-6 chars long (this means set are quite fast, I presume).

from random import choice, seed
from time import clock

def main(choice=choice):
seed(1)
n = 100000

for good in ("ab", "abc", "abcdef", "abcdefgh",
"abcdefghijklmnopqrstuvwxyz"):
poss = good + good.upper()
data = [choice(poss) for _ in xrange(n)] * 10
print "len(good) = ", len(good)

t = clock()
for c in data:
c in good
print round(clock()-t, 2)

t = clock()
sgood = set(good)
for c in data:
c in sgood
print round(clock()-t, 2), "\n"

main()

On my Python2.4 for Windows, they are often still neck-and-neck
for len(good) = 26. set's disadvantage of having to be
constructed is heavily amortized over 100,000 membership
tests. Without knowing the usage pattern, it'd be hard to choose
between them.

I need help making a zooming function	11	Dec 14, 2021
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
How to discover a CSS Selector name?	8	Sep 12, 2023
I am making a Snake game and it has a: "raise Terminator/turtle.Terminator" message.	2	Dec 20, 2021
Generating valid identifiers	8	Jul 26, 2012
How to change key name in json file with python	0	Oct 2, 2022
Making safe file names	2	May 7, 2013
cProfile and name spaces.	0	Sep 23, 2011

making a valid file name...

SpreadTooThin

Jerry

Jon Clements

Dennis Lee Bieber

Tim Chase

Edgar Matzinger

Neil Cerutti

Tim Chase

Neil Cerutti

Fredrik Lundh

bearophileHUGS

Neil Cerutti

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads