Most direct way to strip unoprintable characters out of a string?

S

Steve Bergman

When sanitizing data coming in from HTML forms, I'm doing this (lifted
from the Python Cookbook):

from string import maketrans, translate, printable
allchars = maketrans('','')
delchars = translate(allchars, allchars, printable)
input_string = translate(input_string, allchars, delchars)

Which is OK. But it seems like there should be more straightforward way
that I just haven't figured out. Is there?

Thanks,
Steve Bergman
 
G

George Sakkis

Steve Bergman said:
When sanitizing data coming in from HTML forms, I'm doing this (lifted
from the Python Cookbook):

from string import maketrans, translate, printable
allchars = maketrans('','')
delchars = translate(allchars, allchars, printable)
input_string = translate(input_string, allchars, delchars)

Which is OK. But it seems like there should be more straightforward way
that I just haven't figured out. Is there?

If by straightforward you mean one-liner, there is:
''.join(c for c in input_string if c not in string.printable)

If you care about performance though, string.translate is faster; as always, the best way to decide
on a performance issue is to profile the alternatives on your data and see if it's worth going for
the fastest one at the expense of readability.

George
 
S

Steve Bergman

George said:
If by straightforward you mean one-liner, there is:
''.join(c for c in input_string if c not in string.printable)

If you care about performance though, string.translate is faster; as always, the best way to decide
on a performance issue is to profile the alternatives on your data and see if it's worth going for
the fastest one at the expense of readability.
Thank you for the reply. I was really thinking of some function in the
standard library like:

s = stripUnprintable(s)

When I learned php, I more or less took the route of using whatever I
found that 'worked'. In learning Python, I'm trying to take my time and
learn the 'right' (that's pronounced 'beautiful') way of doing things.

As it stands, I've stashed the string.translate code in a short function
with a comment explaining what it does and how. I mainly didn't want to
use that if there was some trivial built-in that everyone else uses.

Thanks Again,
Steve
 
G

George Sakkis

Steve Bergman said:
Thank you for the reply. I was really thinking of some function in the
standard library like:

s = stripUnprintable(s)

When I learned php, I more or less took the route of using whatever I
found that 'worked'. In learning Python, I'm trying to take my time and
learn the 'right' (that's pronounced 'beautiful') way of doing things.

As it stands, I've stashed the string.translate code in a short function
with a comment explaining what it does and how. I mainly didn't want to
use that if there was some trivial built-in that everyone else uses.

No there's not a stripUnprintable in a standard module AFAIK, and that's a good thing; if every
little function that one might ever wanted made it to the standard library, the language would be
overwhelming.

Make sure you calculate the unprintable characters only the first time it is called, not every time.
Here's a way to encapsulate this in the same function, without polluting the global namespace with
allchars and delchars:

import string

def stripUnprintable(input_string):
try: filterUnprintable = stripUnprintable.filter
except AttributeError: # only the first time it is called
allchars = string.maketrans('','')
delchars = allchars.translate(allchars, string.printable)
filterUnprintable = stripUnprintable.filter = lambda input: input.translate(allchars,
delchars)
return filterUnprintable(input_string)

George
 
F

Fredrik Lundh

George said:
No there's not a stripUnprintable in a standard module AFAIK, and
that's a good thing; if every little function that one might ever wanted
made it to the standard library, the language would be overwhelming.

....and if there was a stripUnprintable function in the standard library that
was based on C's mostly brain-dead locale model, US programmers
would produce even more web applications that just don't work for non-
US users...

("sanitizing" HTML data by running filters over encoded 8-bit data is hardly
ever the right thing to do...)

</F>
 
S

Steve Bergman

Fredrik said:
("sanitizing" HTML data by running filters over encoded 8-bit data is hardly
ever the right thing to do...)
I'm very much open to suggestions as to the right way to do this. I'm
working on this primarily as a learning project and security is my
motivation for wanting to strip the unprintables.

Is there a better way? (This is a mod_python app , just for reference.)

Thanks,
Steve
 
D

Diez B. Roggisch

Steve said:
I'm very much open to suggestions as to the right way to do this. I'm
working on this primarily as a learning project and security is my
motivation for wanting to strip the unprintables.

Is there a better way? (This is a mod_python app , just for reference.)

Deal with encodings properly. That characters are "unprintable" means
that you have an encoding mismatch - your output device (usually a
terminal, but a browser is a sort of device too) can't make sense of
certain byte codes - and pukes on you. But these bytecode come from
somewhere, and aren't "random".

So I suggest you read upon the subjects of unicode, encodings - and this
in the context of python, of course :)

BTW: if that HTML was XHTML, it weren't valid if the contents didn't
match the specified encoding in the header - which doesn't mean that
sometimes these mismatch because of misunderstandings on the programmer
side.

Diez
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top