regular expression: perl ==> python

L

les_ander

Hi,
i am so use to perl's regular expression that i find it hard
to memorize the functions in python; so i would appreciate if
people can tell me some equivalents.

1) In perl:
$line = "The food is under the bar in the barn.";
if ( $line =~ /foo(.*)bar/ ) { print "got <$1>\n"; }

in python, I don't know how I can do this?
How does one capture the $1? (I know it is \1 but it is still not clear
how I can simply print it.
thanks
 
S

Steven Bethard

1) In perl:
$line = "The food is under the bar in the barn.";
if ( $line =~ /foo(.*)bar/ ) { print "got <$1>\n"; }

in python, I don't know how I can do this?

I don't know Perl very well, but I believe this is more or less the
equivalent:
got <d is under the bar in the >

Of course, you can do this in fewer lines if you like:
got <d is under the bar in the bar>

Steve
 
F

Fredrik Lundh

i am so use to perl's regular expression that i find it hard
to memorize the functions in python; so i would appreciate if
people can tell me some equivalents.

1) In perl:
$line = "The food is under the bar in the barn.";
if ( $line =~ /foo(.*)bar/ ) { print "got <$1>\n"; }

in python, I don't know how I can do this?
How does one capture the $1? (I know it is \1 but it is still not clear
how I can simply print it.

in Python, the RE machinery returns match objects, which has methods
that let you dig out more information about the match. "captured groups"
are available via the "group" method:

m = re.search(..., line)
if m:
print "got", m.group(1)

see the regex howto (or the RE chapter in the library reference) for more
information:

http://www.amk.ca/python/howto/regex/

</F>
 
J

JZ

Dnia 21 Dec 2004 21:12:09 -0800, (e-mail address removed) napisa³(a):
1) In perl:
$line = "The food is under the bar in the barn.";
if ( $line =~ /foo(.*)bar/ ) { print "got <$1>\n"; }

in python, I don't know how I can do this?
How does one capture the $1? (I know it is \1 but it is still not clear
how I can simply print it.
thanks

import re
line = "The food is under the bar in the barn."
if re.search(r'foo(.*)bar',line):
print 'got %s\n' % _.group(1)
 
F

Fredrik Lundh

JZ said:
import re
line = "The food is under the bar in the barn."
if re.search(r'foo(.*)bar',line):
print 'got %s\n' % _.group(1)

Traceback (most recent call last):
File "jz.py", line 4, in ?
print 'got %s\n' % _.group(1)
NameError: name '_' is not defined

</F>
 
D

Doug Holton

Fredrik said:
Traceback (most recent call last):
File "jz.py", line 4, in ?
print 'got %s\n' % _.group(1)
NameError: name '_' is not defined

He was using the python interactive prompt, which I suspect you already
knew.
 
J

JZ

Dnia Wed, 22 Dec 2004 10:27:39 +0100, Fredrik Lundh napisa³(a):
Traceback (most recent call last):
File "jz.py", line 4, in ?
print 'got %s\n' % _.group(1)
NameError: name '_' is not defined

I forgot to add: I am using Python 2.3.4/Win32 (from ActiveState.com). The
code works in my interpreter.
 
F

Fredrik Lundh

JZ said:
I forgot to add: I am using Python 2.3.4/Win32 (from ActiveState.com). The
code works in my interpreter.

only if you type it into the interactive prompt. see:

http://www.python.org/doc/2.4/tut/node5.html#SECTION005110000000000000000

"In interactive mode, the last printed expression is assigned to the variable _.
This means that when you are using Python as a desk calculator, it is some-
what easier to continue calculations /.../"

the "_" symbol has no special meaning when you run a Python program, so the
"if re.search" construct won't work.

</F>
 
J

JZ

Dnia Wed, 22 Dec 2004 16:55:55 +0100, Fredrik Lundh napisa³(a):
the "_" symbol has no special meaning when you run a Python program,

That's right. So the final code will be:

import re
line = "The food is under the bar in the barn."
found = re.search('foo(.*)bar',line)
if found: print 'got %s\n' % found.group(1)
 
N

Nick Craig-Wood

1) In perl:
$line = "The food is under the bar in the barn.";
if ( $line =~ /foo(.*)bar/ ) { print "got <$1>\n"; }

in python, I don't know how I can do this?
How does one capture the $1? (I know it is \1 but it is still not clear
how I can simply print it.
thanks


Fredrik Lundh said:
Traceback (most recent call last):
File "jz.py", line 4, in ?
print 'got %s\n' % _.group(1)
NameError: name '_' is not defined

I've found that a slight irritation in python compared to perl - the
fact that you need to create a match object (rather than relying on
the silver thread of $_ (etc) running through your program ;-)

import re
line = "The food is under the bar in the barn."
m = re.search(r'foo(.*)bar',line)
if m:
print 'got %s\n' % m.group(1)

This becomes particularly irritating when using if, elif etc, to
match a series of regexps, eg

line = "123123"
m = re.search(r'^(\d+)$', line)
if m:
print "int",int(m.group(1))
else:
m = re.search(r'^(\d*\.\d*)$', line)
if m:
print "float",float(m.group(1))
else:
print "unknown thing", line

The indentation keeps growing which looks rather untidy compared to
the perl

$line = "123123";
if ($line =~ /^(\d+)$/) {
print "int $1\n";
}
elsif ($line =~ /^(\d*\.\d*)$/) {
print "float $1\n";
}
else {
print "unknown thing $line\n";
}

Is there an easy way round this? AFAIK you can't assign a variable in
a compound statement, so you can't use elif at all here and hence the
problem?

I suppose you could use a monstrosity like this, which relies on the
fact that list.append() returns None...

line = "123123"
m = []
if m.append(re.search(r'^(\d+)$', line)) or m[-1]:
print "int",int(m[-1].group(1))
elif m.append(re.search(r'^(\d*\.\d*)$', line)) or m[-1]:
print "float",float(m[-1].group(1))
else:
print "unknown thing", line
 
F

Fredrik Lundh

Nick said:
I've found that a slight irritation in python compared to perl - the
fact that you need to create a match object (rather than relying on
the silver thread of $_ (etc) running through your program ;-)

the old "regex" engine associated the match with the pattern, but that
approach isn't thread safe...
line = "123123"
m = re.search(r'^(\d+)$', line)
if m:
print "int",int(m.group(1))
else:
m = re.search(r'^(\d*\.\d*)$', line)
if m:
print "float",float(m.group(1))
else:
print "unknown thing", line

that's not a very efficient way to match multiple patterns, though. a
much better way is to combine the patterns into a single one, and use
the "lastindex" attribute to figure out which one that matched. see

http://effbot.org/zone/xml-scanner.htm

for more on this topic.

</F>
 
J

John Machin

Fredrik said:
only if you type it into the interactive prompt. see:

No, it doesn't work at all, anywhere. Did you actually try this?
http://www.python.org/doc/2.4/tut/node5.html#SECTION005110000000000000000

"In interactive mode, the last printed expression is assigned to the variable _.
This means that when you are using Python as a desk calculator, it is some-
what easier to continue calculations /.../"

In the 3 lines that are executed before the exception, there are *no*
printed expressions.

Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information..... print 'got %s\n' % _.group(1)
....
Traceback (most recent call last):
 
F

Fredrik Lundh

John said:
>

No, it doesn't work at all, anywhere. Did you actually try this?

the OP claims that it works in his ActiveState install (PythonWin?). maybe he
played with re.search before typing in the commands he quoted; maybe Python-
Win contains some extra hacks?

as I've illustrated earlier, it definitely doesn't work in a script executed by a standard
Python...

</F>
 
J

John Machin

Fredrik said:
the OP claims that it works in his ActiveState install (PythonWin?). maybe he
played with re.search before typing in the commands he quoted; maybe Python-
Win contains some extra hacks?

as I've illustrated earlier, it definitely doesn't work in a script executed by a standard
Python...

</F>

It is quite possible that the OP played with re.search before before
typing in the commands he quoted; however *you* claimed that it [his
quoted commands] worked "only if you type it into the interactive
prompt". It doesn't work, in the unqualified sense that I understood.

Anyway, enough of punch-ups about how many dunces can angle on the hat
of a pun -- I did appreciate your other posting about multiple patterns
and "lastindex"; thanks.
 
S

Stephen Thorne

Is there an easy way round this? AFAIK you can't assign a variable in
a compound statement, so you can't use elif at all here and hence the
problem?

I suppose you could use a monstrosity like this, which relies on the
fact that list.append() returns None...

line = "123123"
m = []
if m.append(re.search(r'^(\d+)$', line)) or m[-1]:
print "int",int(m[-1].group(1))
elif m.append(re.search(r'^(\d*\.\d*)$', line)) or m[-1]:
print "float",float(m[-1].group(1))
else:
print "unknown thing", line

I wrote a scanner for a recursive decent parser a while back. This is
the pattern i used for using mulitple regexps, instead of using an
if/elif/else chain.

import re
patterns = [
(re.compile('^(\d+)$'),int),
(re.compile('^(\d+\.\d*)$'),float),
]

def convert(s):
for regexp, action in patterns:
m = regexp.match(s)
if not m:
continue
return action(m.group(1))
raise ValueError, "Invalid input %r, was not a numeric string" % (s,)

if __name__ == '__main__':
tests = [ ("123123",123123), ("123.123",123.123), ("123.",123.) ]
for input, expected in tests:
assert convert(input) == expected

try:
convert('')
convert('abc')
except:
pass
else:
assert None,"Should Raise on invalid input"


Of course, I wrote the tests first. I used your regexp's but I was
confused as to why you were always using .group(1), but decided to
leave it. I would probably actually send the entire match object to
the action. Using something like:
(re.compile('^(\d+)$'),lambda m:int(m.group(1)),
and
return action(m)

but lambdas are going out fashion. :(

Stephen Thorne
 
N

Nick Craig-Wood

Fredrik Lundh said:
that's not a very efficient way to match multiple patterns, though. a
much better way is to combine the patterns into a single one, and use
the "lastindex" attribute to figure out which one that matched.

lastindex is useful, yes.

I take your point. However I don't find the below very readable -
making 5 small regexps into 1 big one, plus a game of count the
brackets doesn't strike me as a huge win...

xml = re.compile(r"""
<([/?!]?\w+) # 1. tags
|&(\#?\w+); # 2. entities
|([^<>&'\"=\s]+) # 3. text strings (no special characters)
|(\s+) # 4. whitespace
|(.) # 5. special characters
""", re.VERBOSE)

Its probably faster though, so I give in gracelessly ;-)
 
F

Fredrik Lundh

Nick said:
I take your point. However I don't find the below very readable -
making 5 small regexps into 1 big one, plus a game of count the
brackets doesn't strike me as a huge win...

if you're doing that a lot, you might wish to create a helper function.

the undocumented sre.Scanner provides a ready-made mechanism for this
kind of RE matching; see

http://aspn.activestate.com/ASPN/Mail/Message/python-dev/1614344

for some discussion.

here's (a slight variation of) the code example they're talking about:

def s_ident(scanner, token): return token
def s_operator(scanner, token): return "op%s" % token
def s_float(scanner, token): return float(token)
def s_int(scanner, token): return int(token)

scanner = sre.Scanner([
(r"[a-zA-Z_]\w*", s_ident),
(r"\d+\.\d*", s_float),
(r"\d+", s_int),
(r"=|\+|-|\*|/", s_operator),
(r"\s+", None),
])
(['sum', 'op=', 3, 'op*', 'foo', 'op+', 312.5, 'op+', 'bar'], '')

</F>
 
N

Nick Craig-Wood

Fredrik Lundh said:
the undocumented sre.Scanner provides a ready-made mechanism for this
kind of RE matching; see

http://aspn.activestate.com/ASPN/Mail/Message/python-dev/1614344

for some discussion.

here's (a slight variation of) the code example they're talking about:

def s_ident(scanner, token): return token
def s_operator(scanner, token): return "op%s" % token
def s_float(scanner, token): return float(token)
def s_int(scanner, token): return int(token)

scanner = sre.Scanner([
(r"[a-zA-Z_]\w*", s_ident),
(r"\d+\.\d*", s_float),
(r"\d+", s_int),
(r"=|\+|-|\*|/", s_operator),
(r"\s+", None),
])
(['sum', 'op=', 3, 'op*', 'foo', 'op+', 312.5, 'op+', 'bar'],
'')

That is very cool - exactly the kind of problem I come across quite
often!

I've found the online documentation (using pydoc) for re / sre in
general to be a bit lacking.

For instance nowhere in

pydoc sre

Does it tell you what methods a match object has (or even what type it
is). To find this out you have to look at the HTML documentation.
This is probably what Windows people look at by default but Unix
hackers like me expect everything (or at least a hint) to be in the
man/pydoc pages.

Just noticed in pydoc2.4 a new section

MODULE DOCS
http://www.python.org/doc/current/lib/module-sre.html

Which is at least a hint that you are looking in the wrong place!
....however that page doesn't exist ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top