replace random matches of regexp

G

gry

[Python 2.7]
I have a body of text (~1MB) that I need to modify. I need to look
for matches of a regular expression and replace a random selection of
those matches with a new string. There may be several matches on any
line, and a random selection of them should be replaced. The
probability of replacement should be adjustable. Performance is not
an issue. E.g: if I have:

SELECT max(PUBLIC.TT.I) AS SEL_0 FROM (SCHM.T RIGHT OUTER JOIN
PUBLIC.TT ON (SCHM.T.I IS NULL)) WHERE (NOT(NOT((power(PUBLIC.TT.F,
PUBLIC.TT.F) = cast(ceil(( SELECT 22 AS SEL_0 FROM
(PUBLIC.TT AS PUBLIC_TT_0 JOIN PUBLIC.TT AS PUBLIC_TT_1 ON (ceil(0.46)
=sin(PUBLIC_TT_1.F))) WHERE ((zeroifnull(PUBLIC_TT_0.I) =
sqrt((0.02 + PUBLIC_TT_1.F))) OR

I might want to replace '(max|min|cos|sqrt|ceil' with "public.\1", but
only with probability 0.7. I looked and looked for some computed
thing in re's that I could stick and expression, but could not find
such(for good reasons, I know).
Any ideas how to do this? I would go for simple, even if it's wildly
inefficient, though elegance is always admired...
 
M

MRAB

[Python 2.7]
I have a body of text (~1MB) that I need to modify. I need to look
for matches of a regular expression and replace a random selection of
those matches with a new string. There may be several matches on any
line, and a random selection of them should be replaced. The
probability of replacement should be adjustable. Performance is not
an issue. E.g: if I have:

SELECT max(PUBLIC.TT.I) AS SEL_0 FROM (SCHM.T RIGHT OUTER JOIN
PUBLIC.TT ON (SCHM.T.I IS NULL)) WHERE (NOT(NOT((power(PUBLIC.TT.F,
PUBLIC.TT.F) = cast(ceil(( SELECT 22 AS SEL_0 FROM
(PUBLIC.TT AS PUBLIC_TT_0 JOIN PUBLIC.TT AS PUBLIC_TT_1 ON (ceil(0.46)
=sin(PUBLIC_TT_1.F))) WHERE ((zeroifnull(PUBLIC_TT_0.I) =
sqrt((0.02 + PUBLIC_TT_1.F))) OR

I might want to replace '(max|min|cos|sqrt|ceil' with "public.\1", but
only with probability 0.7. I looked and looked for some computed
thing in re's that I could stick and expression, but could not find
such(for good reasons, I know).
Any ideas how to do this? I would go for simple, even if it's wildly
inefficient, though elegance is always admired...

re.sub can accept a function as the replacement. It'll call the
function when it finds a match, and the string returned by that
function will be the replacement.

You could write a function which returns either the original substring
which was found or a different substring.
 
A

André Malo

* gry said:
I might want to replace '(max|min|cos|sqrt|ceil' with "public.\1", but
only with probability 0.7. I looked and looked for some computed
thing in re's that I could stick and expression, but could not find
such(for good reasons, I know).
Any ideas how to do this? I would go for simple, even if it's wildly
inefficient, though elegance is always admired...

You can run a re.sub() with a function as replacement value. This function
then either returns the replacement or the original match based on a
weighted random value.

nd
 
P

Peter Otten

gry said:
[Python 2.7]
I have a body of text (~1MB) that I need to modify. I need to look
for matches of a regular expression and replace a random selection of
those matches with a new string. There may be several matches on any
line, and a random selection of them should be replaced. The
probability of replacement should be adjustable. Performance is not
an issue. E.g: if I have:

SELECT max(PUBLIC.TT.I) AS SEL_0 FROM (SCHM.T RIGHT OUTER JOIN
PUBLIC.TT ON (SCHM.T.I IS NULL)) WHERE (NOT(NOT((power(PUBLIC.TT.F,
PUBLIC.TT.F) = cast(ceil(( SELECT 22 AS SEL_0 FROM
(PUBLIC.TT AS PUBLIC_TT_0 JOIN PUBLIC.TT AS PUBLIC_TT_1 ON (ceil(0.46)
=sin(PUBLIC_TT_1.F))) WHERE ((zeroifnull(PUBLIC_TT_0.I) =
sqrt((0.02 + PUBLIC_TT_1.F))) OR

I might want to replace '(max|min|cos|sqrt|ceil' with "public.\1", but
only with probability 0.7. I looked and looked for some computed
thing in re's that I could stick and expression, but could not find
such(for good reasons, I know).
Any ideas how to do this? I would go for simple, even if it's wildly
inefficient, though elegance is always admired...

def make_sub(text, probability):
def sub(match):
if random.random() < probability:
return text + match.group(1)
return match.group(1)
return sub

print re.compile("(max|min|cos|sqrt|ceil)").sub(make_sub(r"public.", .7),
sample)

or even

def make_sub(text, probability):
def sub(match):
if random.random() < probability:
def group_sub(m):
return match.group(int(m.group(1)))
return re.compile(r"[\\](\d+)").sub(group_sub, text)
return match.group(0)
return sub

print re.compile("(max|min|cos|sqrt|ceil)").sub(make_sub(r"public.\1", .7),
sample)
 
G

gry

To elaborate(always give example of desired output...) I would hope to
get something like:

SELECT public.max(PUBLIC.TT.I) AS SEL_0 FROM (SCHM.T RIGHT OUTER JOIN
PUBLIC.TT ON (SCHM.T.I IS NULL)) WHERE (NOT(NOT((power(PUBLIC.TT.F,
PUBLIC.TT.F) = cast(ceil(( SELECT 22 AS SEL_0 FROM
(PUBLIC.TT AS PUBLIC_TT_0 JOIN PUBLIC.TT AS PUBLIC_TT_1 ON
(public.ceil(0.46)
=public.sin(PUBLIC_TT_1.F))) WHERE ((zeroifnull(PUBLIC_TT_0.I)
=
public.sqrt((0.02 + PUBLIC_TT_1.F))) OR

notice the 'ceil' on the third line did not get changed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top