Replacing words from strings except 'and' / 'or' / 'and not'

N

Nico Grubert

Hi there,

Background of this question is:
I want to convert all words <word> except 'and' / 'or' / 'and not' from
a string into '*<word>*'.

Example:
I have the following string:
"test and testing and not perl or testit or example"

I want to convert this string to:
'*test*' and '*testing*' and not '*perl*' or '*testit*' or '*example*'


Any idea, how to do this?

Thanks in advance,
Nico
 
D

Diez B. Roggisch

import sets
KEYWORDS = sets.Set(['and', 'or', 'not'])

query = "test and testing and not perl or testit or example"

def decorate(w):
if w in KEYWORDS:
return w
return "*%s*" % w

query = " ".join([decorate(w.strip()) for w in query.split()])
 
T

Thomas Guettler

Am Thu, 25 Nov 2004 15:43:53 +0100 schrieb Nico Grubert:
Hi there,

Background of this question is:
I want to convert all words <word> except 'and' / 'or' / 'and not' from
a string into '*<word>*'.

You can give re.sub() a function

import re
ignore=["and", "not", "or"]
test="test and testing and not perl or testit or example"
def repl(match):
word=match.group(1)
if word in ignore:
return word
else:
return "*%s*" % word
print re.sub(r'(\w+)', repl, test)

Result: *test* and *testing* and not *perl* or *testit* or *example*

HTH,
Thomas
 
J

Jean Brouwers

Just a comment. The w.strip() call in the last line is superfluous in
this particular case. The items in the list resulting from the
query.split() call will be stripped already. Example,
['a', 'b', 'c']


/Jean Bouwers


Diez B. Roggisch said:
import sets
KEYWORDS = sets.Set(['and', 'or', 'not'])

query = "test and testing and not perl or testit or example"

def decorate(w):
if w in KEYWORDS:
return w
return "*%s*" % w

query = " ".join([decorate(w.strip()) for w in query.split()])
 
M

Mitja

Example:
I have the following string: "test and testing and not perl or testit or
example"

I want to convert this string to:
'*test*' and '*testing*' and not '*perl*' or '*testit*' or '*example*'

A compact, though not too readable a solution:

foo="test and testing and not perl or testit or example"

' '.join([
("'*"+w+"*'",w)[w in ('and','or')]
for w in foo.split()
]).replace("and '*not*'","and not")
 
P

Peter Maas

Diez said:
import sets
KEYWORDS = sets.Set(['and', 'or', 'not'])

query = "test and testing and not perl or testit or example"

def decorate(w):
if w in KEYWORDS:
return w
return "*%s*" % w

query = " ".join([decorate(w.strip()) for w in query.split()])

Is there a reason to use sets here? I think lists will do as well.
 
P

Peter Otten

Peter said:
Diez said:
import sets
KEYWORDS = sets.Set(['and', 'or', 'not'])

query = "test and testing and not perl or testit or example"

def decorate(w):
if w in KEYWORDS:
return w
return "*%s*" % w

query = " ".join([decorate(w.strip()) for w in query.split()])

Is there a reason to use sets here? I think lists will do as well.

Sets represent the concept better, and large lists will significantly slow
down the code (linear vs constant time). Unfortunately, as 2.3's Set is
implemented in Python, you'll have to wait for the 2.4 set builtin to see
the effect for small lists/sets. In the meantime, from a performance point
of view, a dictionary fares best:

$cat contains.py
from sets import Set

# we need more items than in KEYWORDS above for Set
# to even meet the performance of list :-(
alist = dir([])
aset = Set(alist)
adict = dict.fromkeys(alist)

$timeit.py -s"from contains import alist, aset, adict" "'not' in alist"
100000 loops, best of 3: 2.21 usec per loop
$timeit.py -s"from contains import alist, aset, adict" "'not' in aset"
100000 loops, best of 3: 2.2 usec per loop
$timeit.py -s"from contains import alist, aset, adict" "'not' in adict"
1000000 loops, best of 3: 0.337 usec per loop

Peter
 
P

Peter Hansen

Peter said:
Diez said:
import sets
KEYWORDS = sets.Set(['and', 'or', 'not'])
...
def decorate(w):
if w in KEYWORDS:
return w
return "*%s*" % w
Is there a reason to use sets here? I think lists will do as well.

Sets are implemented using dictionaries, so the "if w in KEYWORDS"
part would be O(1) instead of O(n) as with lists...

(I.e. searching a list is a brute-force operation, whereas
sets are not.)

-Peter
 
S

Skip Montanaro

Jp> And yet... using sets here is slower in every possible case:
...
Jp> This is a pretty clear example of premature optimization.

I think the set concept is correct. The keywords of interest are best
thought of as an unordered collection. Lists imply some ordering (or at
least that potential). Premature optimization would have been realizing
that scanning a short list of strings was faster than testing for set
membership and choosing to use lists instead of sets.

Skip
 
J

John Machin

Skip Montanaro said:
Jp> And yet... using sets here is slower in every possible case:
...
Jp> This is a pretty clear example of premature optimization.

I think the set concept is correct. The keywords of interest are best
thought of as an unordered collection. Lists imply some ordering (or at
least that potential). Premature optimization would have been realizing
that scanning a short list of strings was faster than testing for set
membership and choosing to use lists instead of sets.

Skip

Jp scores extra points for pre-maturity by not trying out version 2.4,
by not reading the bit about sets now being built-in, based on dicts,
dicts being one of the timbot's optimise-the-snot-out-of targets ...
herewith some results from a box with a 1.4Ghz Athlon chip running
Windows 2000:

C:\junk>\python24\python \python24\lib\timeit.py -s "from sets import
Set; x = Set(['and', 'or', 'not'])" "None in x"
1000000 loops, best of 3: 1.81 usec per loop

C:\junk>\python24\python \python24\lib\timeit.py -s "from sets import
Set; x = Set(['and', 'or', 'not'])" "None in x"
1000000 loops, best of 3: 1.77 usec per loop

C:\junk>\python24\python \python24\lib\timeit.py -s "x = set(['and',
'or', 'not'])" "None in x"
1000000 loops, best of 3: 0.29 usec per loop

C:\junk>\python24\python \python24\lib\timeit.py -s "x = set(['and',
'or', 'not'])" "None in x"
1000000 loops, best of 3: 0.289 usec per loop

C:\junk>\python24\python \python24\lib\timeit.py -s "x = ['and',
'or', 'not']" "None in x"
1000000 loops, best of 3: 0.804 usec per loop

C:\junk>\python24\python \python24\lib\timeit.py -s "x = ['and',
'or', 'not']" "None in x"
1000000 loops, best of 3: 0.81 usec per loop

C:\junk>\python24\python \python24\lib\timeit.py -s "from sets import
Set; x = Set(['and', 'or', 'not'])" "'and' in x"
1000000 loops, best of 3: 1.69 usec per loop

C:\junk>\python24\python \python24\lib\timeit.py -s "x = set(['and',
'or', 'not'])" "'and' in x"
1000000 loops, best of 3: 0.243 usec per loop

C:\junk>\python24\python \python24\lib\timeit.py -s "x = set(['and',
'or', 'not'])" "'and' in x"
1000000 loops, best of 3: 0.245 usec per loop

C:\junk>\python24\python \python24\lib\timeit.py -s "x = ['and',
'or', 'not']" "'and' in x"
1000000 loops, best of 3: 0.22 usec per loop

C:\junk>\python24\python \python24\lib\timeit.py -s "x = ['and',
'or', 'not']" "'and' in x"
1000000 loops, best of 3: 0.22 usec per loop

C:\junk>\python24\python \python24\lib\timeit.py -s "x = set(['and',
'or', 'not'])" "'not' in x"
1000000 loops, best of 3: 0.257 usec per loop

C:\junk>\python24\python \python24\lib\timeit.py -s "x = ['and',
'or', 'not']" "'not' in x"
1000000 loops, best of 3: 0.34 usec per loop

tee hee ...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top