stripping unwanted chars from string

Discussion in 'Python' started by Edward Elliott, May 4, 2006.

  1. I'm looking for the "best" way to strip a large set of chars from a filename
    string (my definition of best usually means succinct and readable). I
    only want to allow alphanumeric chars, dashes, and periods. This is what I
    would write in Perl (bless me father, for I have sinned...):

    $filename =~ tr/\w.-//cd, or equivalently
    $filename =~ s/[^\w.-]//

    I could just use re.sub like the second example, but that's a bit overkill.
    I'm trying to figure out if there's a good way to do the same thing with
    string methods. string.translate seems to do what I want, the problem is
    specifying the set of chars to remove. Obviously hardcoding them all is a
    non-starter.

    Working with chars seems to be a bit of a pain. There's no equivalent of
    the range function, one has to do something like this:

    >>> [chr(x) for x in range(ord('a'), ord('z')+1)]

    ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
    'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

    Do that twice for letters, once for numbers, add in a few others, and I get
    the chars I want to keep. Then I'd invert the set and call translate.
    It's a mess and not worth the trouble. Unless there's some way to expand a
    compact representation of a char list and obtain its complement, it looks
    like I'll have to use a regex.

    Ideally, there would be a mythical charset module that works like this:

    >>> keep = charset.expand (r'\w.-') # or r'a-zA-Z0-9_.-'
    >>> toss = charset.invert (keep)


    Sadly I can find no such beast. Anyone have any insight? As of now,
    regexes look like the best solution.
     
    Edward Elliott, May 4, 2006
    #1
    1. Advertising

  2. Edward Elliott

    John Machin Guest

    On 4/05/2006 1:36 PM, Edward Elliott wrote:
    > I'm looking for the "best" way to strip a large set of chars from a filename
    > string (my definition of best usually means succinct and readable). I
    > only want to allow alphanumeric chars, dashes, and periods. This is what I
    > would write in **** (bless me father, for I have sinned...):


    [expletives deleted] and it was wrong anyway (according to your
    requirements);
    using \w would keep '_' which is *NOT* alphanumeric.

    > I could just use re.sub like the second example, but that's a bit overkill.
    > I'm trying to figure out if there's a good way to do the same thing with
    > string methods. string.translate seems to do what I want, the problem is
    > specifying the set of chars to remove. Obviously hardcoding them all is a
    > non-starter.
    >
    > Working with chars seems to be a bit of a pain. There's no equivalent of
    > the range function, one has to do something like this:
    >
    >>>> [chr(x) for x in range(ord('a'), ord('z')+1)]

    > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
    > 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


    >>> alphabet = 'qwertyuiopasdfghjklzxcvbnm' # Look, Ma, no thought

    required!! Monkey see, monkey type.
    >>> keepchars = set(alphabet + alphabet.upper() + '1234567890-.')
    >>> fixer = lambda x: ''.join(c for c in x if c in keepchars)
    >>> fixer('qwe!@#456.--Howzat?')

    'qwe456.--Howzat'
    >>>


    >
    > Do that twice for letters, once for numbers, add in a few others, and I get
    > the chars I want to keep. Then I'd invert the set and call translate.
    > It's a mess and not worth the trouble. Unless there's some way to expand a
    > compact representation of a char list and obtain its complement, it looks
    > like I'll have to use a regex.
    >
    > Ideally, there would be a mythical charset module that works like this:
    >
    >>>> keep = charset.expand (r'\w.-') # or r'a-zA-Z0-9_.-'


    Where'd that '_' come from?

    >>>> toss = charset.invert (keep)

    >
    > Sadly I can find no such beast. Anyone have any insight? As of now,
    > regexes look like the best solution.


    I'll leave it to somebody else to dredge up the standard riposte to your
    last sentence :)

    One point on your requirements: replacing unwanted characters instead of
    deleting them may be better -- theoretically possible problems with
    deleting are: (1) duplicates (foo and foo_ become the same) (2) '_'
    becomes '' which is not a valid filename. And a legibility problem: if
    you hate '_' and ' ' so much, why not change them to '-'?

    Oh and just in case the fix was accidentally applied to a path:

    keepchars.update(os.sep)
    if os.altsep: keepchars.update(os.altsep)

    HTH,
    John
     
    John Machin, May 4, 2006
    #2
    1. Advertising

  3. Edward Elliott

    Bryan Guest


    > >>> keepchars = set(alphabet + alphabet.upper() + '1234567890-.')


    or

    >>> keepchars = set(string.letters + string.digits + '-.')


    bryan
     
    Bryan, May 4, 2006
    #3
  4. John Machin wrote:
    > [expletives deleted] and it was wrong anyway (according to your
    > requirements);
    > using \w would keep '_' which is *NOT* alphanumeric.


    Actually the perl is correct, the explanation was the faulty part. When in
    doubt, trust the code. Plus I explicitly allowed _ further down, so the
    mistake should have been fairly obvious.


    > >>> alphabet = 'qwertyuiopasdfghjklzxcvbnm' # Look, Ma, no thought

    > required!! Monkey see, monkey type.


    I won't dignify that with a response. The code that is, I could give a toss
    about the comments. If you enjoy using such verbose, error-prone
    representations in your code, god help anyone maintaining it. Including
    you six months later. Quick, find the difference between these sets at a
    glance:

    'qwertyuiopasdfghjklzxcvbnm'
    'abcdefghijklmnopqrstuvwxyz'
    'abcdefghijklmnopprstuvwxyz'
    'abcdefghijk1mnopqrstuvwxyz'
    'qwertyuopasdfghjklzxcvbnm' # no fair peeking

    And I won't even bring up locales.


    > >>> keepchars = set(alphabet + alphabet.upper() + '1234567890-.')
    > >>> fixer = lambda x: ''.join(c for c in x if c in keepchars)


    Those darn monkeys, always think they're so clever! ;)
    if "you can" == "you should": do(it)
    else: do(not)


    >> Sadly I can find no such beast. Anyone have any insight? As of now,
    >> regexes look like the best solution.

    >
    > I'll leave it to somebody else to dredge up the standard riposte to your
    > last sentence :)


    If the monstrosity above is the best you've got, regexes are clearly the
    better solution. Readable trumps inscrutable any day.


    > One point on your requirements: replacing unwanted characters instead of
    > deleting them may be better -- theoretically possible problems with
    > deleting are: (1) duplicates (foo and foo_ become the same) (2) '_'
    > becomes '' which is not a valid filename.


    Which is why I perform checks for emptiness and uniqueness after the strip.
    I decided long ago that stripping is preferable to replacement here.


    > And a legibility problem: if
    > you hate '_' and ' ' so much, why not change them to '-'?


    _ is allowed. And I do prefer -, but not for legibility. It doesn't
    require me to hit Shift.


    > Oh and just in case the fix was accidentally applied to a path:
    >
    > keepchars.update(os.sep)
    > if os.altsep: keepchars.update(os.altsep)


    Nope, like I said this is strictly a filename. Stripping out path
    components is the first thing I do. But thanks for pointing out these
    common pitfalls for members of our studio audience. Tell him what he's
    won, Johnny! ;)
     
    Edward Elliott, May 4, 2006
    #4
  5. Bryan wrote:
    > >>> keepchars = set(string.letters + string.digits + '-.')


    Now that looks a lot better. Just don't forget the underscore. :)
     
    Edward Elliott, May 4, 2006
    #5
  6. Edward Elliott wrote:
    > Bryan wrote:
    >
    >> >>> keepchars = set(string.letters + string.digits + '-.')

    >
    >
    > Now that looks a lot better. Just don't forget the underscore. :)
    >

    You may also want to have a look at string.translate() and
    string.maketrans()

    --
    bruno desthuilliers
    python -c "print '@'.join(['.'.join([w[::-1] for w in p.split('.')]) for
    p in ''.split('@')])"
     
    bruno at modulix, May 4, 2006
    #6
  7. Edward Elliott

    John Machin Guest

    On 4/05/2006 4:30 PM, Edward Elliott wrote:
    > Bryan wrote:
    >> >>> keepchars = set(string.letters + string.digits + '-.')

    >
    > Now that looks a lot better. Just don't forget the underscore. :)
    >


    *Looks* better than the monkey business. Perhaps I should point out to
    those of the studio audience who are huddled in an ASCII bunker (if any)
    that string.letters provides the characters considered to be alphabetic
    in whatever the locale is currently set to. There is no guarantee that
    the operating system won't permit filenames containing other characters,
    ones that the file's creator would quite reasonably consider to be
    alphabetic. And of course there are languages that have characters that
    one would not want to strip but can scarcely be described as alphanumeric.

    >>> import os
    >>> os.listdir(u'.')

    [u'\xc9t\xe9_et_hiver.doc', u'\u041c\u043e\u0441\u043a\u0432\u0430.txt',
    u'\u5f20\u654f.txt']

    >>> import string
    >>> string.letters

    'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

    Doing
    import locale; locale.setlocale(locale.LC_ALL, '')
    would make string.letters work (for me) with the first file above, but
    that's all.
     
    John Machin, May 4, 2006
    #7
  8. Edward Elliott <nobody@127.0.0.1> wrote:

    > I'm looking for the "best" way to strip a large set of chars from a filename
    > string (my definition of best usually means succinct and readable). I
    > only want to allow alphanumeric chars, dashes, and periods. This is what I
    > would write in Perl (bless me father, for I have sinned...):
    >
    > $filename =~ tr/\w.-//cd, or equivalently
    > $filename =~ s/[^\w.-]//
    >
    > I could just use re.sub like the second example, but that's a bit overkill.
    > I'm trying to figure out if there's a good way to do the same thing with
    > string methods. string.translate seems to do what I want, the problem is
    > specifying the set of chars to remove. Obviously hardcoding them all is a
    > non-starter.


    (untested code, but, the general idea shd be correct)...:

    class KeepOnly(object):
    allchars = ''.join(chr(i) for i in xrange(256))
    identity = string.maketrans('', '')

    def __init__(self, chars_to_keep):
    self.chars_to_delete = self.allchars.translate(
    self.identity, chars_to_keep)

    def __call__(self, some_string):
    return some_string.translate(self.identity,
    self.chars_to_delete)


    Alex
     
    Alex Martelli, May 4, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. et
    Replies:
    3
    Views:
    750
  2. Kosio

    Floats to chars and chars to floats

    Kosio, Sep 16, 2005, in forum: C Programming
    Replies:
    44
    Views:
    1,346
    Tim Rentsch
    Sep 23, 2005
  3. lovecreatesbeauty
    Replies:
    17
    Views:
    611
    Jordan Abel
    Jan 1, 2006
  4. Hongyu
    Replies:
    9
    Views:
    971
    James Kanze
    Aug 8, 2008
  5. Wild Al

    Stripping unwanted html

    Wild Al, Oct 6, 2006, in forum: Ruby
    Replies:
    5
    Views:
    120
    eden li
    Oct 9, 2006
Loading...

Share This Page