Delete all not allowed characters..

A

Abandoned

Hi..
I want to delete all now allowed characters in my text.
I use this function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8',u'9',u' ', u'Þ',
u'þ', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'Ý', u'ý', u'Ð', u'ð', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed)
return s1

.....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"
How can i do quickly because the text is very long..
 
A

Adam Donahue

Hi..
I want to delete all now allowed characters in my text.
I use this function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8',u'9',u' ', u'Þ',
u'þ', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'Ý', u'ý', u'Ð', u'ð', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed)
return s1

....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"
How can i do quickly because the text is very long..

Something like:

import re
def clear( s, allowed=[], case_sensitive=True):
flags = ''
if not case_sensitive:
flags = '(?i)'
return re.sub( flags + '[^%s]' % ''.join( allowed ), ' ', s )

And call:

clear( '123abcdefgABCdefg321', [ 'a', 'b', 'c' ] )
clear( '123abcdefgABCdefg321', [ 'a', 'b', 'c' ], False )

And so forth. Or just use re directly!

(This implementation is imperfect in that it's possible to hack the
regular expression, and it may break with mismatched '[]' characters,
but the idea is there.)

Adam
 
M

Michal Bozon

Hi..
I want to delete all now allowed characters in my text.
I use this function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8',u'9',u' ', u'Åž',
u'ş', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'İ', u'ı', u'Ğ', u'ğ', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed)
return s1

....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"
How can i do quickly because the text is very long..

the list comprehension does not allow "else",
but it can be used in a similar form:

s2 = ""
for ch in s1:
s2 += ch if ch in allowed else " "

(maybe this could be written more nicely)
 
T

Tim Chase

I want to delete all now allowed characters in my text.
I use this function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8',u'9',u' ', u'Þ',
u'þ', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'Ý', u'ý', u'Ð', u'ð', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed)
return s1

....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"
How can i do quickly because the text is very long..

Any reason your alphabet is oddly entered?

You can speed it up by using a set. You can also tweak your join
to choose a space if the letter isn't one of your allowed letters:

import string
allowed = set(
string.letters +
string.digits +
' +' +
u'ÞþÖöÜüÇçÝýÐð')
def clear(s):
return "".join(
letter in allowed and letter or " "
for letter in s)

In Python 2.5, there's a ternary operator syntax something like
the following (which I can't test, as I'm not at a PC with 2.5
installed)

def clear(s):
return "".join(
letter
if letter in allowed
else " "
for letter in s)

which some find more readable...I don't particularly care for
either syntax. The latter is 2.5-specific and makes more sense,
but still isn't as readable as I would have liked; while the
former works versions of python back to at least 2.2 which I
still have access to, and is a well documented idiom/hack.

-tkc
 
S

Steven D'Aprano

the list comprehension does not allow "else", but it can be used in a
similar form:

s2 = ""
for ch in s1:
s2 += ch if ch in allowed else " "

(maybe this could be written more nicely)

Repeatedly adding strings together in this way is about the most
inefficient, slow way of building up a long string. (Although I'm sure
somebody can come up with a worse way if they try hard enough.)

Even though recent versions of CPython have a local optimization that
improves the performance hit of string concatenation somewhat, it is
better to use ''.join() rather than add many strings together:

s2 = []
for ch in s1:
s2.append(ch if (ch in allowed) else " ")
s2 = ''.join(s2)

Although even that doesn't come close to the efficiency and speed of
string.translate() and string.maketrans(). Try to find a way to use them.

Here is one way, for ASCII characters.

allowed = "abcdef"
all = string.maketrans('', '')
not_allowed = ''.join(c for c in all if c not in allowed)
table = string.maketrans(not_allowed, ' '*len(not_allowed))
new_string = string.translate(old_string, table)
 
S

Steven D'Aprano

Hi..
I want to delete all now allowed characters in my text. I use this
function:

def clear(s1=""):
if s1:
allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8',u'9',u' ', u'Åž',
u'ş', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'İ', u'ı', u'Ğ', u'ğ', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']
s1 = "".join(ch for ch in s1 if ch in allowed) return s1


You don't need to make allowed a list. Make it a string, it is easier to
read.

allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutwvyxz'

....And my problem this function replace the character to "" but i want
to " "
for example:
input: Exam%^^ple
output: Exam ple


I think the most obvious way is this:

def clear(s):
allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutwvyxz'
L = []
for ch in s:
if ch in allowed: L.append(ch)
else: L.append(" ")
return ''.join(s)


Perhaps a better way is to use a translation table:

def clear(s):
allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutwvyxz'
not_allowed = [i for i in range(0x110000) if unichr(i) not in allowed]
table = dict(zip(not_allowed, u" "*len(not_allowed)))
return s.translate(table)

Even better is to pre-calculate the translation table, so it is
calculated only when needed:

TABLE = None
def build_table():
global TABLE
if TABLE is None:
allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \
u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutwvyxz'
not_allowed = \
[i for i in range(0x110000) if unichr(i) not in allowed]
TABLE = dict(zip(not_allowed, u" "*len(not_allowed)))
return TABLE

def clear(s):
return s.translate(build_table())


The first time you call clear(), it will take a second or so to build the
translation table, but then it will be very fast.
 
M

Michal Bozon

( I was wrong, as Tim Chase have shown )
Repeatedly adding strings together in this way is about the most
inefficient, slow way of building up a long string. (Although I'm sure
somebody can come up with a worse way if they try hard enough.)

Even though recent versions of CPython have a local optimization that
improves the performance hit of string concatenation somewhat, it is
better to use ''.join() rather than add many strings together:

String appending is not tragically slower,
for strings long tens of MB, the speed
makes me a difference in few tens of percents,
so it is not several times slower, or so
s2 = []
for ch in s1:
s2.append(ch if (ch in allowed) else " ")
s2 = ''.join(s2)

Although even that doesn't come close to the efficiency and speed of
string.translate() and string.maketrans(). Try to find a way to use them.

Here is one way, for ASCII characters.

allowed = "abcdef"
all = string.maketrans('', '')
not_allowed = ''.join(c for c in all if c not in allowed)
table = string.maketrans(not_allowed, ' '*len(not_allowed))
new_string = string.translate(old_string, table)

Nice, I did not know that string translation exists, but
Abandoned have defined allowed characters, so making
a translation table for the unallowed characters,
which would take nearly complete unicode character table
would be inefficient.
 
Z

Zentrader

allowed =
[u'+',u'0',u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8',u'9',u' ', u'Þ',
u'þ', u'Ö', u'ö', u'Ü', u'ü', u'Ç', u'ç', u'Ý', u'ý', u'Ð', u'ð', 'A',
'C', 'B', 'E', 'D', 'G', 'F', 'I', 'H', 'K', 'J', 'M', 'L', 'O', 'N',
'Q', 'P', 'S', 'R', 'U', 'T', 'W', 'V', 'Y', 'X', 'Z', 'a', 'c', 'b',
'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p',
's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z']

Using ord() may speed things up. If you want to include A through Z
for example, you can use
ord_chr=ord(chr) ## convert once
if (ord_chr) > 64 and (ord_chr < 91): (On a U.S. English system)
and won't have to check every letter in an 'include it' string or
list. Lower case "a" through "z" would be a range also, and u'0'
through u'9' should be as well. That would leave a few remaining
characters that may have to be searched if they are not contiguous
decimal numbers.
 
S

Steven D'Aprano

String appending is not tragically slower, for strings long tens of MB,
the speed makes me a difference in few tens of percents, so it is not
several times slower, or so

That is a half-truth.

Because strings are immutable, when you concat two strings Python has to
duplicate both of them. This leads to quadratic-time behaviour, where the
time taken is proportional to the SQUARE of the number of characters.
This rapidly becomes very slow.

*However*, as of Python 2.4, CPython has an optimization that can detect
some types of string concatenation and do them in-place, giving (almost)
linear-time performance. But that depends on:

(1) the implementation: it only works for CPython, not Jython or
IronPython or other Python implementations;

(2) the version: it is an implementation detail introduced in Python 2.4,
and is not guaranteed to remain in future versions;

(3) the specific details of how you concat strings: s=s+t will get the
optimization, but s=t+s or s=s+t1+t2 will not.


In other words: while having that optimization in place is a win, you
cannot rely on it. If you care about portable code, the advice to use
join() still stands.


[snip]
Nice, I did not know that string translation exists, but Abandoned have
defined allowed characters, so making a translation table for the
unallowed characters, which would take nearly complete unicode character
table would be inefficient.


The cost of building the unicode translation table is minimal: about 1.5
seconds ONCE, and it is only a couple of megabytes of data:

allowed = u'+0123456789 ŞşÖöÜüÇçİıĞğ' \ .... u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutwvyxz'

timer = timeit.Timer('not_allowed = [i for i in range(0x110000) if
unichr(i) not in allowed]; TABLE = dict(zip(not_allowed, u" "*len
(not_allowed)))', 'from __main__ import allowed')[18.267689228057861, 16.495684862136841, 16.785034894943237]


The translate method runs about ten times faster than anything you can
write in pure Python. If Abandoned has got as much data as he keeps
saying he has, he will save a lot more than 1.5 seconds by using
translate compared to relatively slow Python code.

On the other hand, if he is translating only small strings, with
different sets of allowed chars each time, then there is no advantage to
using the translate method.

And on the third hand... I can't help but feel that the *right* solution
to Abandoned's problem is to use encode/decode with the appropriate codec.
 
P

Paul Hankin

String appending is not tragically slower, for strings long tens of MB,
the speed makes me a difference in few tens of percents, so it is not
several times slower, or so

That is a half-truth.

Because strings are immutable, when you concat two strings Python has to
duplicate both of them. This leads to quadratic-time behaviour, where the
time taken is proportional to the SQUARE of the number of characters.
This rapidly becomes very slow.

*However*, as of Python 2.4, CPython has an optimization that can detect
some types of string concatenation and do them in-place, giving (almost)
linear-time performance. But that depends on:

(1) the implementation: it only works for CPython, not Jython or
IronPython or other Python implementations;

(2) the version: it is an implementation detail introduced in Python 2.4,
and is not guaranteed to remain in future versions;

(3) the specific details of how you concat strings: s=s+t will get the
optimization, but s=t+s or s=s+t1+t2 will not.

In other words: while having that optimization in place is a win, you
cannot rely on it. If you care about portable code, the advice to use
join() still stands.

[snip]
Nice, I did not know that string translation exists, but Abandoned have
defined allowed characters, so making a translation table for the
unallowed characters, which would take nearly complete unicode character
table would be inefficient.

The cost of building the unicode translation table is minimal: about 1.5
seconds ONCE, and it is only a couple of megabytes of data:

... u'ACBEDGFIHKJMLONQPSRUTWVYXZacbedgfihkjmlonqpsrutwvyxz'
timer = timeit.Timer('not_allowed = [i for i in range(0x110000) if

unichr(i) not in allowed]; TABLE = dict(zip(not_allowed, u" "*len
(not_allowed)))', 'from __main__ import allowed')

[18.267689228057861, 16.495684862136841, 16.785034894943237]

The translate method runs about ten times faster than anything you can
write in pure Python. If Abandoned has got as much data as he keeps
saying he has, he will save a lot more than 1.5 seconds by using
translate compared to relatively slow Python code.

String translate runs 10 times faster than pure python: unicode
translate isn't anywhere near as fast as it has to look up each
character in the mapping dict.

import timeit

timer = timeit.Timer("a.translate(m)", setup = "a = u'abc' * 1000; m =
dict((x, x) for x in range(256))")

print timer.repeat(3, 10000)

[2.4009871482849121, 2.4191598892211914, 2.3641388416290283]


timer = timeit.Timer("a.translate(m)", setup = "a = 'abc' * 1000; m =
''.join(chr(x) for x in range(256))")

print timer.repeat(3, 10000)

[0.12261486053466797, 0.12225103378295898, 0.12217879295349121]


Also, the unicode translation dict as given doesn't work on
character's that aren't allowed: it should map ints to ints rather
than ints to strings.

Anyway, there's no need to pay the cost of building a full mapping
dict when most of the entries are the same. Something like this can
work:

from collections import defaultdict

def clear(message):
allowed = u'abc...'
clear_translate = defaultdict(lambda: ord(u' '))
clear_translate.update((c, c) for c in map(ord, allowed))
return message.translate(clear_translate)
 
Z

Zentrader

....And my problem this function replace the character to "" but i
want to " "
for example:
input: Exam%^^ple
output: Exam ple
I want to this output but in my code output "Example"

I don't think anyone has addressed this yet. It would be
if chr found_in_allowed_set:
output_string += chr
else:
output_string += " "
This Is Just A General Example of code to use. You probably would not
use 'output_string +=' but whatever form the implementation takes, you
would use an if/else

Nice, I did not know that string translation exists, but Abandoned
have
defined allowed characters, so making a translation table for the
unallowed characters, which would take nearly complete unicode character
table would be inefficient.

And this is also bad logic. If you use a 'not allowed', then
everything else will be included by default. Any errors in the 'not
allowed' or deviations that use an additional unicode character will
be included by default. You want to have the program include only
what it is told to include IMHO.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

My Status, Ciphertext 2
Blue J Ciphertext Program 2
How to play corresponding sound? 2
unicode issue 24
ChatGPT will make us Job(Home)less 3
Python code problem 2
How to create python codecs? 0
for loop skips items 13

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top