Using a function for regular expression substitution

naugiedoggie · Aug 29, 2010

Hello,

I'm having a problem with using a function as the replacement in
re.sub().

Here is the function:

def normalize(s) :
return
urllib.quote(string.capwords(urllib.unquote(s.group('provider'))))

The purpose of this function is to proper-case the words contained in
a URL query string parameter value. I'm massaging data in web log
files.

In case it matters, the regex pattern looks like this:

provider_pattern = r'(?P<search>Search_Provider)=(?P<provider>[^&]+)'

The call looks like this:

<code>
re.sub(matcher,normalize,line)
</code>

Where line is the log line entry.

What I get back is first the entire line with the normalization of the
parameter value, but missing the parameter; then appended to that
string is the entire line again, with the query parameter back in
place pointing to the normalized string.

if line.find('Search_Type') != -1 and line.find('Search_Provider') !=
-1 :
re.sub(provider_matcher,normalize,line)
print line,'\n'
</code>

The output of the print is like this:

<code>
'log-entry parameter=value&normalized-string&parameter=value\n
log-entry parameter=value&parameter=normalized-string&parameter=value'
</code>

The goal is to massage the specified entries in the log files and
write the entire log back into a new file. The new file has to be
exactly the same as the old one, with the exception of the entries
I've altered with my function.

No doubt I'm doing something trivially wrong, but I've tried to
reproduce the structure as defined in the documentation.

Thanks.

mp

Roy Smith · Aug 29, 2010

naugiedoggie said:
Hello,

I'm having a problem with using a function as the replacement in
re.sub().

Here is the function:

def normalize(s) :
return
urllib.quote(string.capwords(urllib.unquote(s.group('provider'))))

I read though this entire post, and I'm not quite sure what you're
asking. May I suggest that you need to break this down into smaller
pieces and find a minimal test case.

I'm guessing this is a problem with the regex processing. To prove
that, strip away everything else and verify that part in isolation.
Compile your regex, and match it against a string that you expect it to
match. Then, examine the groups returned by the match object. If
they're not what you expect, then re-post your question, with just this
minimal test code.

Terry Reedy · Aug 29, 2010

Hello,

I'm having a problem with using a function as the replacement in
re.sub().

Here is the function:

def normalize(s) :
return
urllib.quote(string.capwords(urllib.unquote(s.group('provider'))))

To debug your problem, I would start with print(s) in the function and
if still not clear, unnest the expression and print intermediate results.

MRAB · Aug 29, 2010

Hello,

I'm having a problem with using a function as the replacement in
re.sub().

Here is the function:

def normalize(s) :
return
urllib.quote(string.capwords(urllib.unquote(s.group('provider'))))

This normalises the provider and returns only that, and none of the
remainder of the string.

I think you might want this:

def normalize(s):
return s[ : s.start('provider')] +
urllib.quote(string.capwords(urllib.unquote(s.group('provider')))) +
s[s.start('provider') : ]

It returns the part before the provider, followed by the normalised
provider, and then the part after the provider.

The purpose of this function is to proper-case the words contained in
a URL query string parameter value. I'm massaging data in web log
files.

In case it matters, the regex pattern looks like this:

provider_pattern = r'(?P<search>Search_Provider)=(?P<provider>[^&]+)'

The call looks like this:

<code>
re.sub(matcher,normalize,line)
</code>

Where line is the log line entry.

What I get back is first the entire line with the normalization of the
parameter value, but missing the parameter; then appended to that
string is the entire line again, with the query parameter back in
place pointing to the normalized string.

if line.find('Search_Type') != -1 and line.find('Search_Provider') !=
-1 :

These can be replaced by:

if 'Search_Type' in line and 'Search_Provider' in line:

re.sub(provider_matcher,normalize,line)

re.sub is returning the result, which you're throwing away!

line = re.sub(provider_matcher,normalize,line)

naugiedoggie · Aug 30, 2010

On 29/08/2010 15:22, naugiedoggie wrote:

I'm having a problem with using a function as the replacement in
re.sub().
Here is the function:
def normalize(s) :
return
urllib.quote(string.capwords(urllib.unquote(s.group('provider'))))

Click to expand...

This normalises the provider and returns only that, and none of the
remainder of the string.

I think you might want this:

def normalize(s):
return s[ : s.start('provider')] +
urllib.quote(string.capwords(urllib.unquote(s.group('provider')))) +
s[s.start('provider') : ]

It returns the part before the provider, followed by the normalised
provider, and then the part after the provider.

Hello,

Thanks for the reply.

There must be something basic about the re.sub() function that I'm
missing. The documentation shows this example:

.... if matchobj.group(0) == '-': return ' '
.... else: return '-''Baked Beans & Spam'
</code>

According to the doc, the modifying function takes one parameter, the
MatchObject. The re.sub function takes only a compiled regex object
or a pattern, generates a MatchObject from that object/pattern and
passes the MatchObject to the given function. Notice that in the
examples, the re.sub() returns the entire line, with the changes made.
But the function itself returns only the change. What is happening
for me is that, if I have a line that contains
&Search_Provider=chen&p=value, the processed line ends up with
&Chen&p=value.

Now, I did follow up with your suggestion. `s' is actually a
MatchObject (bad param naming on my part, I started out passing a
string into the function and then changed it to a MatchObject, but
didn't change the param name), so I made the following change:

<code>
return line[s.pos : s.start('provider')] + \

urllib.quote(string.capwords(urllib.unquote(s.group('provider')))) + \
line[s.end('provider') : ]
</code>

In order to make this work (finally), I had to make the processing
function look like this:

<code>
def processLine(l) :
global line
line = l
provider = getProvider(line)
if provider == "No Provider" : return line
scenario = getScenario(line)
if filter (lambda a: a != None, [getOrg(s,scenario) for s in
orgs]) == [] :
line = re.sub(provider_pattern,normalize,line)
else :
line.replace(provider_parameter, org_parameter)
return line
</code>

And then the call:

<code>
lines = fileReader.readlines()
[ fileWriter.write(l) for l in [processLine(l) for l in lines]]
</code>

Without this complicated gobbledigook, I could not get the correct
result. I hate global vars and I completely do not understand why I
have to go through this twisting and turning to get the desired
result.

[ ... ]

These can be replaced by:

if 'Search_Type' in line and 'Search_Provider' in line:

re.sub is returning the result, which you're throwing away!

line = re.sub(provider_matcher,normalize,line)

I can't count the number of times I have forgotten the meaning of
'returns a string' when reading docs about doing substitutions. In
this case, I had put the `line = ' in and taken it out. And I should
know better, from years of programming in Java, where strings are
immutable and you _always_ get a new, returned string. Should be
second nature.

Thanks for the help, much appreciated.

mp

naugiedoggie · Aug 30, 2010

This normalises the provider and returns only that, and none of the
remainder of the string.

Click to expand...

I think you might want this:

Click to expand...

def normalize(s):
return s[ : s.start('provider')] +
urllib.quote(string.capwords(urllib.unquote(s.group('provider')))) +
s[s.start('provider') : ]

Click to expand...

It returns the part before the provider, followed by the normalised
provider, and then the part after the provider.

Click to expand...

Hello,

Thanks for the reply.

There must be something basic about the re.sub() function that I'm
missing. The documentation shows this example:

<code>>>> def dashrepl(matchobj):

... if matchobj.group(0) == '-': return ' '
... else: return '-'>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
'pro--gram files'
'Baked Beans & Spam'
</code>

According to the doc, the modifying function takes one parameter, the
MatchObject. The re.sub function takes only a compiled regex object
or a pattern, generates a MatchObject from that object/pattern and
passes the MatchObject to the given function. Notice that in the
examples, the re.sub() returns the entire line, with the changes made.
But the function itself returns only the change. What is happening
for me is that, if I have a line that contains
&Search_Provider=chen&p=value, the processed line ends up with
&Chen&p=value.

Now, I did follow up with your suggestion. `s' is actually a
MatchObject (bad param naming on my part, I started out passing a
string into the function and then changed it to a MatchObject, but
didn't change the param name), so I made the following change:

<code>
return line[s.pos : s.start('provider')] + \

urllib.quote(string.capwords(urllib.unquote(s.group('provider')))) + \
line[s.end('provider') : ]
</code>

In order to make this work (finally), I had to make the processing
function look like this:

<code>
def processLine(l) :
global line
line = l
provider = getProvider(line)
if provider == "No Provider" : return line
scenario = getScenario(line)
if filter (lambda a: a != None, [getOrg(s,scenario) for s in
orgs]) == [] :
line = re.sub(provider_pattern,normalize,line)
else :
line.replace(provider_parameter, org_parameter)
return line
</code>

And then the call:

<code>
lines = fileReader.readlines()
[ fileWriter.write(l) for l in [processLine(l) for l in lines]]
</code>

Without this complicated gobbledigook, I could not get the correct
result. I hate global vars and I completely do not understand why I
have to go through this twisting and turning to get the desired
result.

[ ... ]

These can be replaced by:

Click to expand...

if 'Search_Type' in line and 'Search_Provider' in line:

Click to expand...

re.sub is returning the result, which you're throwing away!

Click to expand...

line = re.sub(provider_matcher,normalize,line)

Click to expand...

I can't count the number of times I have forgotten the meaning of
'returns a string' when reading docs about doing substitutions. In
this case, I had put the `line = ' in and taken it out. And I should
know better, from years of programming in Java, where strings are
immutable and you _always_ get a new, returned string. Should be
second nature.

Thanks for the help, much appreciated.

mp

Hello,

Well, that turned out to be still wrong. I did start getting the
proper param=value back from my `normalize' function, but I got
"extra" data as well.

This works:

<code>
def normalize(s) :
return s.group('search')
+'='+urllib.quote(string.capwords(urllib.unquote(s.group('provider'))))
</code>

Essentially, the pattern contained two groups, one identifying the
parameter name and one the value. By concat'ing the two back
together, I was able to achieve the desired result.

I suppose the lesson is, the function replaces the entire match rather
than just the specified text captured.

Thanks.

mp

I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
Creating a Simple User Interface for a Function	6	Jul 25, 2013
Implementing a Q-Learning Algorithm with Logistic Regression Normalization in C++	0	Jun 4, 2025
Help to find a regular expression to parse po file	4	Jul 6, 2009
Database Manager: A C++ Console Application	14	May 12, 2025
Looking for a regular expression for this...	4	Jul 28, 2006
How do I get the text that is found by a regular expression?	10	Apr 30, 2014
Checking dynamically populated data using ajax with user entered value	5	Apr 11, 2020

Using a function for regular expression substitution

naugiedoggie

Roy Smith

Terry Reedy

MRAB

naugiedoggie

naugiedoggie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads