text processing problem

Maurice LING · Apr 7, 2005

Hi,

I'm looking for a way to do this: I need to scan a text (paragraph or
so) and look for occurrences of "<text-x> (<text-x>)". That is, if the
text just before the open bracket is the same as the text in the
brackets, then I have to delete the brackets, with the text in it.

Does anyone knows any way to achieve this?

The closest I've seen is
(http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305306) by
Raymond Hettinger

>>> s = 'People of [planet], take us to your leader.'
>>> d = dict(planet='Earth')
>>> print convert_template(s) % d

Click to expand...

Click to expand...

People of Earth, take us to your leader.
People of Earth, take us to your leader.

"""

import re

def convert_template(template, opener='[', closer=']'):
opener = re.escape(opener)
closer = re.escape(closer)
pattern = re.compile(opener + '([_A-Za-z][_A-Za-z0-9]*)' + closer)
return re.sub(pattern, r'%(\1)s', template.replace('%','%%'))

Cheers
Maurice

Matt · Apr 8, 2005

Maurice said:
Hi,

I'm looking for a way to do this: I need to scan a text (paragraph or

so) and look for occurrences of "<text-x> (<text-x>)". That is, if the
text just before the open bracket is the same as the text in the
brackets, then I have to delete the brackets, with the text in it.

Does anyone knows any way to achieve this?

The closest I've seen is
(http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305306) by
Raymond Hettinger

s = 'People of [planet], take us to your leader.'
d = dict(planet='Earth')
print convert_template(s) % d

Click to expand...

Click to expand...

People of Earth, take us to your leader.
People of Earth, take us to your leader.

"""

import re

def convert_template(template, opener='[', closer=']'):
opener = re.escape(opener)
closer = re.escape(closer)
pattern = re.compile(opener + '([_A-Za-z][_A-Za-z0-9]*)' + closer)
return re.sub(pattern, r'%(\1)s', template.replace('%','%%'))

Cheers
Maurice

Try this:
import re
my_expr = re.compile(r'(\w+) ($\1$)')
s = "this is (is) a test"
print my_expr.sub(r'\1', s)
#prints 'this is a test'

M@

Maurice LING · Apr 8, 2005

Matt said:
Try this:
import re
my_expr = re.compile(r'(\w+) ($\1$)')
s = "this is (is) a test"
print my_expr.sub(r'\1', s)
#prints 'this is a test'

M@

Thank you Matt. It works out well. The only think that gives it problem
is in events as "there (there)", where between the word and the same
bracketted word is more than one whitespaces...

Cheers
Maurice

Matt · Apr 8, 2005

Maurice said:
Thank you Matt. It works out well. The only think that gives it problem
is in events as "there (there)", where between the word and the same

bracketted word is more than one whitespaces...

Cheers
Maurice

Maurice,
I'd HIGHLY suggest purchasing the excellent <a
href="http://www.oreilly.com/catalog/regex2/index.html">Mastering
Regular Expressions</a> by Jeff Friedl. Although it's mostly geared
towards Perl, it will answer all your questions about regular
expressions. If you're going to work with regexs, this is a must-have.

That being said, here's what the new regular expression should be with
a bit of instruction (in the spirit of teaching someone to fish after
giving them a fish ;-) )

my_expr = re.compile(r'(\w+)\s*($\1$)')

Note the "\s*", in place of the single space " ". The "\s" means "any
whitespace character (equivalent to [ \t\n\r\f\v]). The "*" following
it means "0 or more occurances". So this will now match:

"there (there)"
"there (there)"
"there(there)"
"there (there)"
"there\t(there)" (tab)
"there\t\t\t\t\t\t\t\t\t\t\t\t(there)"
etc.

Hope that's helpful. Pick up the book!

M@

Maurice LING · Apr 8, 2005

Matt said:
I'd HIGHLY suggest purchasing the excellent <a
href="http://www.oreilly.com/catalog/regex2/index.html">Mastering
Regular Expressions</a> by Jeff Friedl. Although it's mostly geared
towards Perl, it will answer all your questions about regular
expressions. If you're going to work with regexs, this is a must-have.

That being said, here's what the new regular expression should be with
a bit of instruction (in the spirit of teaching someone to fish after
giving them a fish ;-) )

my_expr = re.compile(r'(\w+)\s*($\1$)')

Note the "\s*", in place of the single space " ". The "\s" means "any
whitespace character (equivalent to [ \t\n\r\f\v]). The "*" following
it means "0 or more occurances". So this will now match:

"there (there)"
"there (there)"
"there(there)"
"there (there)"
"there\t(there)" (tab)
"there\t\t\t\t\t\t\t\t\t\t\t\t(there)"
etc.

Hope that's helpful. Pick up the book!

M@

Thanks again. I've read a number of tutorials on regular expressions but
it's something that I hardly used in the past, so gone far too rusty.

Before my post, I've tried
my_expr = re.compile(r'(\w+) \s* ($\1$)') instead but it doesn't work,
so I'm a bit stumped......

Thanks again,
Maurice

Paul McGuire · Apr 8, 2005

Maurice -

Here is a pyparsing treatment of your problem. It is certainly more
verbose, but hopefully easier to follow and later maintain (modifying
valid word characters, for instance). pyparsing implicitly ignores
whitespace, so tabs and newlines within the expression are easily
skipped, without cluttering up the expression definition. The example
also shows how to *not* match "<X> (<X>)" if inside a quoted string (in
case this becomes a requirement).

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul
(replace leading '.'s with ' 's)

from pyparsing import *

LPAR = Literal("(")
RPAR = Literal(")")

# define a word as beginning with an alphabetic character followed by
# zero or more alphanumerics, -, _, ., or $ characters
word = Word(alphas, alphanums+"-_$.")

targetExpr = word.setResultsName("first") + \
.............LPAR + word.setResultsName("second") + RPAR

# this will match any 'word ( word )' arrangement, but we want to
# reject matches if the two words aren't the same
def matchWords(s,l,tokens):
.....if tokens.first != tokens.second:
.........raise ParseException(s,l,"")
.....return tokens[0]
targetExpr.setParseAction( matchWords )

testdata = """
This is (is) a match.
This is (isn't) a match.
I.B.M.\t\t\t(I.B.M. ) is a match.
This is also a A.T.T.
(A.T.T.) match.
Paris in "the(the)" Spring( Spring ).
"""
print testdata

print targetExpr.transformString(testdata)

print "\nNow don't process ()'s inside quoted strings..."
targetExpr.ignore(quotedString)
print targetExpr.transformString(testdata)

Prints out:
This is (is) a match.
This is (isn't) a match.
I.B.M. (I.B.M. ) is a match.
This is also a A.T.T.
(A.T.T.) match.
Paris in "the(the)" Spring( Spring ).

This is a match.
This is (isn't) a match.
I.B.M. is a match.
This is also a A.T.T. match.
Paris in "the" Spring.

Now don't process ()'s inside quoted strings...

This is a match.
This is (isn't) a match.
I.B.M. is a match.
This is also a A.T.T. match.
Paris in "the(the)" Spring.

Matt · Apr 8, 2005

Maurice said:
Matt said:

I'd HIGHLY suggest purchasing the excellent <a
href="http://www.oreilly.com/catalog/regex2/index.html">Mastering
Regular Expressions</a> by Jeff Friedl. Although it's mostly geared
towards Perl, it will answer all your questions about regular
expressions. If you're going to work with regexs, this is a must-have.

That being said, here's what the new regular expression should be with
a bit of instruction (in the spirit of teaching someone to fish after
giving them a fish ;-) )

my_expr = re.compile(r'(\w+)\s*($\1$)')

Note the "\s*", in place of the single space " ". The "\s" means "any
whitespace character (equivalent to [ \t\n\r\f\v]). The "*" following
it means "0 or more occurances". So this will now match:

"there (there)"
"there (there)"
"there(there)"
"there (there)"
"there\t(there)" (tab)
"there\t\t\t\t\t\t\t\t\t\t\t\t(there)"
etc.

Hope that's helpful. Pick up the book!

M@

Click to expand...

Thanks again. I've read a number of tutorials on regular expressions but
it's something that I hardly used in the past, so gone far too rusty.

Before my post, I've tried
my_expr = re.compile(r'(\w+) \s* ($\1$)') instead but it doesn't work,
so I'm a bit stumped......

Thanks again,
Maurice

Maurice,
The reason your regex failed is because you have spaces around the
"\s*". This translates to "one space, followed by zero or more
whitespace elements, followed by one space". So your regex would only
match the two text elements separated by at least 2 spaces.

This kind of demostrates why regular expressions can drive you nuts.

I still suggests picking up the book; not because Jeff Friedl drove a
dump truck full of money up to my door, but because it specifically has
a use case like yours. So you get to learn & solve your problem at the
same time!

HTH,
M@

Leif K-Brooks · Apr 8, 2005

Maurice said:
I'm looking for a way to do this: I need to scan a text (paragraph or
so) and look for occurrences of "<text-x> (<text-x>)". That is, if the
text just before the open bracket is the same as the text in the
brackets, then I have to delete the brackets, with the text in it.

How's this?

import re

bracket_re = re.compile(r'(.*?)\s*$\1$')

def remove_brackets(text):
return bracket_re.sub('\\1', text)

Text processing	29	Sep 26, 2011
emacs lisp as text processing language...	1	Oct 29, 2007
FAQ 6.12 Can I use Perl regular expressions to match balanced text?	0	Jan 9, 2011
RE Engine error with sub()	6	Apr 15, 2005
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
FLV download script works, but I want to enhance it	3	May 6, 2009
Correct Identation/Contex can solve the too many compiler error messages problem when a closing brac	107	Sep 9, 2005
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004

text processing problem

Maurice LING

Matt

Maurice LING

Matt

Maurice LING

Paul McGuire

Matt

Leif K-Brooks

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads