Help needed: cryptic perl regular expression in python syntax

P

pekka niiranen

Hi there,

I have perl script that uses dynamically
constructed regular in this way:

------perl code starts ----
$result "";
$key = AAA\?01;
$key = quotemeta $key;
$line = " s^\?AAA\?01^BBB^g; #Comment "
if ($line =~ /(^\s*)(s|tr)(.)(\\?\??$key\??)\3(.*?)\3(.*)/) {
$result = $5;

# $result should be "BBB"
# \3 gets the same value as returned by (.)
# which is in this example ^. So we are searching
# parameter limited by first two ^-signs
# and returning the one limited byt the second
# and third ^-sign. Note that using \3 in regular
# expression enables other constants used than ^ -sign.

------perl code stops ----

How can I construct equivalent python regural expression ?

I have tested with constant regular expression like this:
[(' ', 's', '^', '\\?AAA\\?01')]

Which is fine, but is there a way to join 3 raw strings
together into another raw strings? like:

r1 = r'''(^\s*)(s|tr)(.)(\\?\??'''
r2 = r'''\\?\??)\3(.*?)\3(.*)'''
p1 = r1 + key + r2 # p1 should remain raw string too

-pekka-
 
A

Antoon Pardon

Op 2004-10-19 said:
Hi there,

I have perl script that uses dynamically
constructed regular in this way:

------perl code starts ----
$result "";
$key = AAA\?01;
$key = quotemeta $key;
$line = " s^\?AAA\?01^BBB^g; #Comment "
if ($line =~ /(^\s*)(s|tr)(.)(\\?\??$key\??)\3(.*?)\3(.*)/) {
$result = $5;

# $result should be "BBB"
# \3 gets the same value as returned by (.)
# which is in this example ^. So we are searching
# parameter limited by first two ^-signs
# and returning the one limited byt the second
# and third ^-sign. Note that using \3 in regular
# expression enables other constants used than ^ -sign.

------perl code stops ----

How can I construct equivalent python regural expression ?

I have tested with constant regular expression like this:
[(' ', 's', '^', '\\?AAA\\?01')]

Which is fine, but is there a way to join 3 raw strings
together into another raw strings? like:

r1 = r'''(^\s*)(s|tr)(.)(\\?\??'''
r2 = r'''\\?\??)\3(.*?)\3(.*)'''
p1 = r1 + key + r2 # p1 should remain raw string too

If I understand correctly there are no raw strings, just raw string
literals. The re.compile uses just a normal string.

raw string literal just make it easier to form a strings that are
typically used for regular expressions but the strings themselves
are just ordinary strings.
 
V

Ville Vainio

pekka> Which is fine, but is there a way to join 3 raw strings
pekka> together into another raw strings? like:

pekka> r1 = r'''(^\s*)(s|tr)(.)(\\?\??'''
pekka> r2 = r'''\\?\??)\3(.*?)\3(.*)'''
pekka> p1 = r1 + key + r2 # p1 should remain raw string too

The term "raw string" only has significance with string literals -
every string object is a "raw string". Backslashes are only
interpreted when converting string literals to in-memory string
objects.
 
P

Pekka Niiranen

Thanks,

I managed to solve my problem with code like this:[(' ', 's', '^', '\\?AAA\\?01', 'BBB', 'g; #Comment ')]

but what an ugly piece of code...

I was hoping to do without excess backslashes with re.escape(),
but no avail since group item '\3' gets misquoted (among other things):
'\\\\\\?\\?\\)\\\x03\\(\\.\\*\\?\\)\\\x03\\(\\.\\*\\)\\/\\)'


-pekka-



Antoon said:
Op 2004-10-19 said:
Hi there,

I have perl script that uses dynamically
constructed regular in this way:

------perl code starts ----
$result "";
$key = AAA\?01;
$key = quotemeta $key;
$line = " s^\?AAA\?01^BBB^g; #Comment "
if ($line =~ /(^\s*)(s|tr)(.)(\\?\??$key\??)\3(.*?)\3(.*)/) {
$result = $5;

# $result should be "BBB"
# \3 gets the same value as returned by (.)
# which is in this example ^. So we are searching
# parameter limited by first two ^-signs
# and returning the one limited byt the second
# and third ^-sign. Note that using \3 in regular
# expression enables other constants used than ^ -sign.

------perl code stops ----

How can I construct equivalent python regural expression ?

I have tested with constant regular expression like this:

line = ' s^\\?AAA\\?01^BBB^g; #Comment '
r1 = "(^\s*)(s|tr)(.)(\\\\\?\\\??AAA\\\\\?01)"
re.compile(r1).findall(line)

[(' ', 's', '^', '\\?AAA\\?01')]

Which is fine, but is there a way to join 3 raw strings
together into another raw strings? like:

r1 = r'''(^\s*)(s|tr)(.)(\\?\??'''
r2 = r'''\\?\??)\3(.*?)\3(.*)'''
p1 = r1 + key + r2 # p1 should remain raw string too


If I understand correctly there are no raw strings, just raw string
literals. The re.compile uses just a normal string.

raw string literal just make it easier to form a strings that are
typically used for regular expressions but the strings themselves
are just ordinary strings.

1


'\\b'


'\\b'

\b

\b
 
S

Steven Bethard

Pekka Niiranen said:
I managed to solve my problem with code like this:[(' ', 's', '^', '\\?AAA\\?01', 'BBB', 'g; #Comment ')]


Could you do something like:
[(' ', 's', '^', '\\?AAA\\?01', 'BBB', 'g; #Comment ')]

Basically, I still use the r'' string so that I don't have to write so many
backslashes, but then I use a %s to insert the "AAA\?01" into the middle of
the expression. Looks at least a little cleaner to me.

Steve
 
P

Paul McGuire

Steven Bethard said:
Could you do something like:
[(' ', 's', '^', '\\?AAA\\?01', 'BBB', 'g; #Comment ')]

Basically, I still use the r'' string so that I don't have to write so many
backslashes, but then I use a %s to insert the "AAA\?01" into the middle of
the expression. Looks at least a little cleaner to me.

Steve

Here's a more verbose version of Steve Bethard's suggestion. By building
up the regexp from individual parts, it is possible to give each part some
semi-meaningful name, or to attach comments to individual pieces. It also
makes it easier to maintain later. What if you had to support an additional
command besides s and tr, like 'rep'? Just change replaceCmd to read
replaceCmd = r'(s|tr|rep)'. What if you needed to support leading tabs
in addition to leading spaces? Change leadingWhite as needed. For
that matter, just giving the finished regexp the name 'replaceCmdExpr'
gives the reader more of a clue as to what the regexp's purpose is,
as the original code did with extra comments.

I find nearly *all* regexp's to be cryptic, and when I need them, I
usually assemble them in some fashion such as this. David Mertz
proposes a similar style in his very good book, "Text Processing
in Python."

(Some quibble with the practice of aligning '=' signs, but I find it to be a
helpful guide to the eye when declaring a set of related strings such as
these, assuming of course that one edits using a fixed space font.)

So why does the key get prepended with the backslashes and
question marks?

-- Paul
(I'll bet you thought I'd post a pyparsing version. :) Well, in a
certain way, I did.)


import re

line = ' s^\\?AAA\\?01^BBB^g; #Comment '

r1 = r'(^\s*)(s|tr)(.)(\\\?\\??'
key = "AAA\?01"
r2 = r'\\??)\3(.*?)\3(.*)'
r = r1 + re.escape(key) + r2
print re.compile(r).findall(line)

# desired regexp, from Steve Bethard's post
# r'(^\s*)(s|tr)(.)(\\\?%s)\3(.*?)\3(.*)'

# build up regexp by parts
key = r'AAA\?01'
leadingWhite = r'(^\s*)'
replaceCmd = r'(s|tr)'
sepChar = r'(.)'
# prepend \'s and ?'s, only the OP knows why...
findString = r'(\\\?\\??%s)' % re.escape(key)
# sepCharRef references the char read by sepChar,
# to support separators other than '^'
sepCharRef = r'\3'
replString = r'(.*?)'
restOfLine = r'(.*)'
replaceCmdExpr = leadingWhite + replaceCmd + \
sepChar + findString + sepCharRef + \
replString + sepCharRef + restOfLine

matcher = re.compile( replaceCmdExpr )
print matcher.findall(line)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,830
Latest member
ZADIva7383

Latest Threads

Top