RegEx issues

M

Mark Tolonen

Sean Brown said:
Using python 2.4.4 on OpenSolaris 2008.11

I have the following string created by opening a url that has the
following string in it:

td[ct] = [[ ... ]];\r\n

The ... above is what I'm interested in extracting which is really a
whole bunch of text. So I think the regex \[\[(.*)\]\]; should do it.
The problem is it appears that python is escaping the \ in the regex
because I see this:
reg = '\[\[(.*)\]\];'
reg
'\\[\\[(.*)\\]\\];'

Now to me looks like it would match the string - \[\[ ... \]\];

You are viewing the repr of the string
reg='\[\[(.*)\]\];'
reg '\\[\\[(.*)\\]\\];'
print reg
\[\[(.*)\]\]; <== these are the chars passed to regex

The slashes are telling regex the the [ are literal.
Which obviously doesn't match anything because there are no literal \ in
the above string. Leaving the \ out of the \[\[ above has re.compile
throw an error because [ is a special regex character. Which is why it
needs to be escaped in the first place.

I am either doing something really wrong, which very possible, or I've
missed something obvious. Either way, I thought I'd ask why this isn't
working and why it seems to be changing my regex to something else.

Did you try it?
s='td[ct] = [[blah blah]];\r\n'
re.search(reg,s).group(1)
'blah blah'

-Mark
 
S

Steve Holden

Mark said:
Sean Brown said:
Using python 2.4.4 on OpenSolaris 2008.11

I have the following string created by opening a url that has the
following string in it:

td[ct] = [[ ... ]];\r\n

The ... above is what I'm interested in extracting which is really a
whole bunch of text. So I think the regex \[\[(.*)\]\]; should do it.
The problem is it appears that python is escaping the \ in the regex
because I see this:
reg = '\[\[(.*)\]\];'
reg
'\\[\\[(.*)\\]\\];'

Now to me looks like it would match the string - \[\[ ... \]\];

You are viewing the repr of the string
reg='\[\[(.*)\]\];'
reg '\\[\\[(.*)\\]\\];'
print reg
\[\[(.*)\]\]; <== these are the chars passed to regex

The slashes are telling regex the the [ are literal.
Which obviously doesn't match anything because there are no literal \ in
the above string. Leaving the \ out of the \[\[ above has re.compile
throw an error because [ is a special regex character. Which is why it
needs to be escaped in the first place.

I am either doing something really wrong, which very possible, or I've
missed something obvious. Either way, I thought I'd ask why this isn't
working and why it seems to be changing my regex to something else.

Did you try it?
s='td[ct] = [[blah blah]];\r\n'
re.search(reg,s).group(1)
'blah blah'
Beware, though, that by default regex matches are greedy, so if there's
a chance that two [[ ... ]] [[ ... ]] can appear on the same line then
the above pattern will match

... ]] [[ ...

regards
Steve
 
R

Roy Smith

Sean Brown said:
The problem is it appears that python is escaping the \ in the regex
because I see this:
reg = '\[\[(.*)\]\];'

The first trick of working with regexes in Python is to *always* use raw
strings. Instead of

reg = '\[\[(.*)\]\];'

you want

reg = r'\[\[(.*)\]\];'

In this case, I think it ends up not mattering, but it's one less thing to
worry about. Next, when looking at something like
'\\[\\[(.*)\\]\\];'

it's hard to see exactly what all the backslashes mean. Which are real and
which are escapes? Try doing
\[\[(.*)\]\];

which gets you the str(reg) instead of repr(reg). Another trick when
you're not 100% what you're looking at is to explode the string like this:
['\\', '[', '\\', '[', '(', '.', '*', ')', '\\', ']', '\\', ']', ';']
 
S

Sean Brown

Using python 2.4.4 on OpenSolaris 2008.11

I have the following string created by opening a url that has the
following string in it:

td[ct] = [[ ... ]];\r\n

The ... above is what I'm interested in extracting which is really a
whole bunch of text. So I think the regex \[\[(.*)\]\]; should do it.
The problem is it appears that python is escaping the \ in the regex
because I see this:
'\\[\\[(.*)\\]\\];'

Now to me looks like it would match the string - \[\[ ... \]\];

Which obviously doesn't match anything because there are no literal \ in
the above string. Leaving the \ out of the \[\[ above has re.compile
throw an error because [ is a special regex character. Which is why it
needs to be escaped in the first place.

I am either doing something really wrong, which very possible, or I've
missed something obvious. Either way, I thought I'd ask why this isn't
working and why it seems to be changing my regex to something else.
 
J

John Machin

Sean said:
I have the following string ...:  "td[ct] = [[ ... ]];\r\n"
The ... (representing text in the string) is what I'm extracting ....
So I think the regex \[\[(.*)\]\]; should do it.
The problem is it appears that python is escaping the \ in the regex
because I see this:
reg = '\[\[(.*)\]\];'
reg
'\\[\\[(.*)\\]\\];'
Now to me looks like it would match the string - \[\[ ... \]\];
...

OK, you already have a good answer as to what is happening.
I'll mention that raw strings were put in the language exactly for
regex work.  They are useful for any time you need to use the backslash
character (\) within a string (but not as the final character).
For example:
     len(r'\a\b\c\d\e\f\g\h') == 16 and len('\a\b\c\d\e\f\g\h') == 13

If you get in the habit of typing regex strings as r'...' or r"...",
and examining the patters with print(somestring), you'll ease your life.

All excellent suggestions, but I'm surprised that nobody has mentioned
the re.VERBOSE format.

Manual sez:
'''
re.X
re.VERBOSE
This flag allows you to write regular expressions that look nicer.
Whitespace within the pattern is ignored, except when in a character
class or preceded by an unescaped backslash, and, when a line contains
a '#' neither in a character class or preceded by an unescaped
backslash, all characters from the leftmost such '#' through the end
of the line are ignored.

That means that the two following regular expression objects that
match a decimal number are functionally equal:

a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")
'''

My comments:
(1)"looks nicer" is not the point; it's understandability
(2) if you need a space, use a character class ->[ ]<- not an
unescaped backslash ->\ <-
(3) the indentation in the manual doesn't fit my idea of "looks
nicer"; I'd do
a = re.compile(r"""
\d + # the integral part
\. # the decimal point
\d * # some fractional digits
""", re.X)
(4) you can aid understandability by more indentation especially when
you have multiple capturing expressions and (?......) gizmoids e.g.
r"""
(
..... # prefix
)
(
(?......) # look-back assertion
(?....) # etc etc
)
"""
Worth a try if you find yourself going nuts getting the parentheses
matching.

Cheers,
John
 
G

Gabriel Genellina

En Sat, 24 Jan 2009 19:03:26 -0200, Sean Brown gmail.com>
Using python 2.4.4 on OpenSolaris 2008.11

I have the following string created by opening a url that has the
following string in it:

td[ct] = [[ ... ]];\r\n

The ... above is what I'm interested in extracting which is really a
whole bunch of text. So I think the regex \[\[(.*)\]\]; should do it.
The problem is it appears that python is escaping the \ in the regex
because I see this:
reg = '\[\[(.*)\]\];'
reg
'\\[\\[(.*)\\]\\];'

Now to me looks like it would match the string - \[\[ ... \]\];

No. Python escape character is the backslash \; if you want to include a
backslash inside a string, you have to double it. By example, these are
all single character strings: 'a' '\n' '\\'
Coincidentally (or not), the backslash has a similar meaning in a regular
expression: if you want a string containing \a (two characters) you should
write "\\a".
That's rather tedious and error prone. To help with this, Python allows
for "raw-string literals", where no escape interpretation is done. Just
put an r before the opening quote: r"\(\d+\)" (seven characters; matches
numbers inside parenthesis).

Also, note that when you *evaluate* an expression in the interpreter (like
the lone "reg" above), it prints the "repr" of the result: for a string,
it is the escaped contents surrounded by quotes. (That's very handy when
debugging, but may be confusing if don't know how to interpret it)

Third, Python is very permissive with wrong escape sequences: they just
end up in the string, instead of flagging them as an error. In your case,
\[ is an invalid escape sequence, which is left untouched in the string.

py> reg = r'\[\[(.*)\]\];'
py> reg
'\\[\\[(.*)\\]\\];'
py> print reg
\[\[(.*)\]\];
py> len(reg)
13
Which obviously doesn't match anything because there are no literal \ in
the above string. Leaving the \ out of the \[\[ above has re.compile
throw an error because [ is a special regex character. Which is why it
needs to be escaped in the first place.

It works in this example:

py> txt = """
.... Some text
.... and td[ct] = [[ more things ]];
.... more text"""
py> import re
py> m = re.search(reg, txt)
py> m
<_sre.SRE_Match object at 0x00AC66A0>
py> m.groups()
(' more things ',)

So maybe your r.e. doesn't match the text (the final ";"? whitespace?)
For more info, see the Regular Expressions HOWTO at
http://docs.python.org/howto/regex.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top