text processing problem

M

Maurice LING

Hi,

I'm looking for a way to do this: I need to scan a text (paragraph or
so) and look for occurrences of "<text-x> (<text-x>)". That is, if the
text just before the open bracket is the same as the text in the
brackets, then I have to delete the brackets, with the text in it.

Does anyone knows any way to achieve this?

The closest I've seen is
(http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305306) by
Raymond Hettinger
>>> s = 'People of [planet], take us to your leader.'
>>> d = dict(planet='Earth')
>>> print convert_template(s) % d
People of Earth, take us to your leader.
People of Earth, take us to your leader.

"""

import re

def convert_template(template, opener='[', closer=']'):
opener = re.escape(opener)
closer = re.escape(closer)
pattern = re.compile(opener + '([_A-Za-z][_A-Za-z0-9]*)' + closer)
return re.sub(pattern, r'%(\1)s', template.replace('%','%%'))

Cheers
Maurice
 
M

Matt

Maurice said:
Hi,

I'm looking for a way to do this: I need to scan a text (paragraph or
so) and look for occurrences of "<text-x> (<text-x>)". That is, if the
text just before the open bracket is the same as the text in the
brackets, then I have to delete the brackets, with the text in it.

Does anyone knows any way to achieve this?

The closest I've seen is
(http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305306) by
Raymond Hettinger
s = 'People of [planet], take us to your leader.'
d = dict(planet='Earth')
print convert_template(s) % d
People of Earth, take us to your leader.
People of Earth, take us to your leader.

"""

import re

def convert_template(template, opener='[', closer=']'):
opener = re.escape(opener)
closer = re.escape(closer)
pattern = re.compile(opener + '([_A-Za-z][_A-Za-z0-9]*)' + closer)
return re.sub(pattern, r'%(\1)s', template.replace('%','%%'))

Cheers
Maurice


Try this:
import re
my_expr = re.compile(r'(\w+) (\(\1\))')
s = "this is (is) a test"
print my_expr.sub(r'\1', s)
#prints 'this is a test'

M@
 
M

Maurice LING

Matt said:
Try this:
import re
my_expr = re.compile(r'(\w+) (\(\1\))')
s = "this is (is) a test"
print my_expr.sub(r'\1', s)
#prints 'this is a test'

M@

Thank you Matt. It works out well. The only think that gives it problem
is in events as "there (there)", where between the word and the same
bracketted word is more than one whitespaces...

Cheers
Maurice
 
M

Matt

Maurice said:
Thank you Matt. It works out well. The only think that gives it problem
is in events as "there (there)", where between the word and the same
bracketted word is more than one whitespaces...

Cheers
Maurice


Maurice,
I'd HIGHLY suggest purchasing the excellent <a
href="http://www.oreilly.com/catalog/regex2/index.html">Mastering
Regular Expressions</a> by Jeff Friedl. Although it's mostly geared
towards Perl, it will answer all your questions about regular
expressions. If you're going to work with regexs, this is a must-have.

That being said, here's what the new regular expression should be with
a bit of instruction (in the spirit of teaching someone to fish after
giving them a fish ;-) )

my_expr = re.compile(r'(\w+)\s*(\(\1\))')

Note the "\s*", in place of the single space " ". The "\s" means "any
whitespace character (equivalent to [ \t\n\r\f\v]). The "*" following
it means "0 or more occurances". So this will now match:

"there (there)"
"there (there)"
"there(there)"
"there (there)"
"there\t(there)" (tab)
"there\t\t\t\t\t\t\t\t\t\t\t\t(there)"
etc.

Hope that's helpful. Pick up the book!

M@
 
M

Maurice LING

Matt said:
I'd HIGHLY suggest purchasing the excellent <a
href="http://www.oreilly.com/catalog/regex2/index.html">Mastering
Regular Expressions</a> by Jeff Friedl. Although it's mostly geared
towards Perl, it will answer all your questions about regular
expressions. If you're going to work with regexs, this is a must-have.

That being said, here's what the new regular expression should be with
a bit of instruction (in the spirit of teaching someone to fish after
giving them a fish ;-) )

my_expr = re.compile(r'(\w+)\s*(\(\1\))')

Note the "\s*", in place of the single space " ". The "\s" means "any
whitespace character (equivalent to [ \t\n\r\f\v]). The "*" following
it means "0 or more occurances". So this will now match:

"there (there)"
"there (there)"
"there(there)"
"there (there)"
"there\t(there)" (tab)
"there\t\t\t\t\t\t\t\t\t\t\t\t(there)"
etc.

Hope that's helpful. Pick up the book!

M@

Thanks again. I've read a number of tutorials on regular expressions but
it's something that I hardly used in the past, so gone far too rusty.

Before my post, I've tried
my_expr = re.compile(r'(\w+) \s* (\(\1\))') instead but it doesn't work,
so I'm a bit stumped......

Thanks again,
Maurice
 
P

Paul McGuire

Maurice -

Here is a pyparsing treatment of your problem. It is certainly more
verbose, but hopefully easier to follow and later maintain (modifying
valid word characters, for instance). pyparsing implicitly ignores
whitespace, so tabs and newlines within the expression are easily
skipped, without cluttering up the expression definition. The example
also shows how to *not* match "<X> (<X>)" if inside a quoted string (in
case this becomes a requirement).

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul
(replace leading '.'s with ' 's)

from pyparsing import *

LPAR = Literal("(")
RPAR = Literal(")")

# define a word as beginning with an alphabetic character followed by
# zero or more alphanumerics, -, _, ., or $ characters
word = Word(alphas, alphanums+"-_$.")

targetExpr = word.setResultsName("first") + \
.............LPAR + word.setResultsName("second") + RPAR

# this will match any 'word ( word )' arrangement, but we want to
# reject matches if the two words aren't the same
def matchWords(s,l,tokens):
.....if tokens.first != tokens.second:
.........raise ParseException(s,l,"")
.....return tokens[0]
targetExpr.setParseAction( matchWords )


testdata = """
This is (is) a match.
This is (isn't) a match.
I.B.M.\t\t\t(I.B.M. ) is a match.
This is also a A.T.T.
(A.T.T.) match.
Paris in "the(the)" Spring( Spring ).
"""
print testdata

print targetExpr.transformString(testdata)

print "\nNow don't process ()'s inside quoted strings..."
targetExpr.ignore(quotedString)
print targetExpr.transformString(testdata)

Prints out:
This is (is) a match.
This is (isn't) a match.
I.B.M. (I.B.M. ) is a match.
This is also a A.T.T.
(A.T.T.) match.
Paris in "the(the)" Spring( Spring ).


This is a match.
This is (isn't) a match.
I.B.M. is a match.
This is also a A.T.T. match.
Paris in "the" Spring.


Now don't process ()'s inside quoted strings...

This is a match.
This is (isn't) a match.
I.B.M. is a match.
This is also a A.T.T. match.
Paris in "the(the)" Spring.
 
M

Matt

Maurice said:
Matt said:
I'd HIGHLY suggest purchasing the excellent <a
href="http://www.oreilly.com/catalog/regex2/index.html">Mastering
Regular Expressions</a> by Jeff Friedl. Although it's mostly geared
towards Perl, it will answer all your questions about regular
expressions. If you're going to work with regexs, this is a must-have.

That being said, here's what the new regular expression should be with
a bit of instruction (in the spirit of teaching someone to fish after
giving them a fish ;-) )

my_expr = re.compile(r'(\w+)\s*(\(\1\))')

Note the "\s*", in place of the single space " ". The "\s" means "any
whitespace character (equivalent to [ \t\n\r\f\v]). The "*" following
it means "0 or more occurances". So this will now match:

"there (there)"
"there (there)"
"there(there)"
"there (there)"
"there\t(there)" (tab)
"there\t\t\t\t\t\t\t\t\t\t\t\t(there)"
etc.

Hope that's helpful. Pick up the book!

M@

Thanks again. I've read a number of tutorials on regular expressions but
it's something that I hardly used in the past, so gone far too rusty.

Before my post, I've tried
my_expr = re.compile(r'(\w+) \s* (\(\1\))') instead but it doesn't work,
so I'm a bit stumped......

Thanks again,
Maurice

Maurice,
The reason your regex failed is because you have spaces around the
"\s*". This translates to "one space, followed by zero or more
whitespace elements, followed by one space". So your regex would only
match the two text elements separated by at least 2 spaces.

This kind of demostrates why regular expressions can drive you nuts.

I still suggests picking up the book; not because Jeff Friedl drove a
dump truck full of money up to my door, but because it specifically has
a use case like yours. So you get to learn & solve your problem at the
same time!

HTH,
M@
 
L

Leif K-Brooks

Maurice said:
I'm looking for a way to do this: I need to scan a text (paragraph or
so) and look for occurrences of "<text-x> (<text-x>)". That is, if the
text just before the open bracket is the same as the text in the
brackets, then I have to delete the brackets, with the text in it.


How's this?

import re

bracket_re = re.compile(r'(.*?)\s*\(\1\)')

def remove_brackets(text):
return bracket_re.sub('\\1', text)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top