RegEx issues

Mark Tolonen · Jan 24, 2009

Sean Brown said:
Using python 2.4.4 on OpenSolaris 2008.11

I have the following string created by opening a url that has the
following string in it:

td[ct] = [[ ... ]];\r\n

The ... above is what I'm interested in extracting which is really a
whole bunch of text. So I think the regex \[\[(.*)\]\]; should do it.
The problem is it appears that python is escaping the \ in the regex
because I see this:

reg = '\[\[(.*)\]\];'
reg

Click to expand...

Click to expand...

'\\[\\[(.*)\\]\\];'

Now to me looks like it would match the string - \[\[ ... \]\];

You are viewing the repr of the string

reg='\[\[(.*)\]\];'
reg '\\[\\[(.*)\\]\\];'
print reg

Click to expand...

Click to expand...

\[\[(.*)\]\]; <== these are the chars passed to regex

The slashes are telling regex the the [ are literal.

Which obviously doesn't match anything because there are no literal \ in
the above string. Leaving the \ out of the \[\[ above has re.compile
throw an error because [ is a special regex character. Which is why it
needs to be escaped in the first place.

I am either doing something really wrong, which very possible, or I've
missed something obvious. Either way, I thought I'd ask why this isn't
working and why it seems to be changing my regex to something else.

Did you try it?

s='td[ct] = [[blah blah]];\r\n'
re.search(reg,s).group(1)

Click to expand...

Click to expand...

'blah blah'

-Mark

Steve Holden · Jan 24, 2009

Mark said:
Sean Brown said:

Using python 2.4.4 on OpenSolaris 2008.11

I have the following string created by opening a url that has the
following string in it:

td[ct] = [[ ... ]];\r\n

The ... above is what I'm interested in extracting which is really a
whole bunch of text. So I think the regex \[\[(.*)\]\]; should do it.
The problem is it appears that python is escaping the \ in the regex
because I see this:

reg = '\[\[(.*)\]\];'
reg

Click to expand...

'\\[\\[(.*)\\]\\];'

Now to me looks like it would match the string - \[\[ ... \]\];

Click to expand...

You are viewing the repr of the string

reg='\[\[(.*)\]\];'
reg '\\[\\[(.*)\\]\\];'
print reg

Click to expand...

Click to expand...

\[\[(.*)\]\]; <== these are the chars passed to regex

The slashes are telling regex the the [ are literal.

Which obviously doesn't match anything because there are no literal \ in
the above string. Leaving the \ out of the \[\[ above has re.compile
throw an error because [ is a special regex character. Which is why it
needs to be escaped in the first place.

I am either doing something really wrong, which very possible, or I've
missed something obvious. Either way, I thought I'd ask why this isn't
working and why it seems to be changing my regex to something else.

Click to expand...

Did you try it?

s='td[ct] = [[blah blah]];\r\n'
re.search(reg,s).group(1)

Click to expand...

Click to expand...

'blah blah'

Beware, though, that by default regex matches are greedy, so if there's
a chance that two [[ ... ]] [[ ... ]] can appear on the same line then
the above pattern will match

... ]] [[ ...

regards
Steve

Roy Smith · Jan 24, 2009

Sean Brown said:
The problem is it appears that python is escaping the \ in the regex
because I see this:

reg = '\[\[(.*)\]\];'

Click to expand...

Click to expand...

The first trick of working with regexes in Python is to *always* use raw
strings. Instead of

reg = '\[\[(.*)\]\];'

you want

reg = r'\[\[(.*)\]\];'

In this case, I think it ends up not mattering, but it's one less thing to
worry about. Next, when looking at something like

'\\[\\[(.*)\\]\\];'

it's hard to see exactly what all the backslashes mean. Which are real and
which are escapes? Try doing
\[\[(.*)\]\];

which gets you the str(reg) instead of repr(reg). Another trick when
you're not 100% what you're looking at is to explode the string like this:

[c for c in reg]

Click to expand...

Click to expand...

['\\', '[', '\\', '[', '(', '.', '*', ')', '\\', ']', '\\', ']', ';']

MRAB · Jan 24, 2009

Roy Smith wrote:
[snip]

Another trick when you're not 100% what you're looking at is to
explode the string like this:

>>>> [c for c in reg]

Click to expand...

Click to expand...

> ['\\', '[', '\\', '[', '(', '.', '*', ')', '\\', ']', '\\', ']', ';']
>

A shorter way is list(reg).

Sean Brown · Jan 24, 2009

Using python 2.4.4 on OpenSolaris 2008.11

I have the following string created by opening a url that has the
following string in it:

td[ct] = [[ ... ]];\r\n

The ... above is what I'm interested in extracting which is really a
whole bunch of text. So I think the regex \[\[(.*)\]\]; should do it.
The problem is it appears that python is escaping the \ in the regex
because I see this:

reg = '\[\[(.*)\]\];'
reg

Click to expand...

Click to expand...

'\\[\\[(.*)\\]\\];'

Now to me looks like it would match the string - \[\[ ... \]\];

Which obviously doesn't match anything because there are no literal \ in
the above string. Leaving the \ out of the \[\[ above has re.compile
throw an error because [ is a special regex character. Which is why it
needs to be escaped in the first place.

I am either doing something really wrong, which very possible, or I've
missed something obvious. Either way, I thought I'd ask why this isn't
working and why it seems to be changing my regex to something else.

John Machin · Jan 25, 2009

Sean said:
Sean said:

I have the following string ...: "td[ct] = [[ ... ]];\r\n"
The ... (representing text in the string) is what I'm extracting ....
So I think the regex \[\[(.*)\]\]; should do it.
The problem is it appears that python is escaping the \ in the regex
because I see this:

reg = '\[\[(.*)\]\];'
reg

Click to expand...

'\\[\\[(.*)\\]\\];'
Now to me looks like it would match the string - \[\[ ... \]\];
...

Click to expand...

OK, you already have a good answer as to what is happening.
I'll mention that raw strings were put in the language exactly for
regex work. They are useful for any time you need to use the backslash
character (\) within a string (but not as the final character).
For example:
len(r'\a\b\c\d\e\f\g\h') == 16 and len('\a\b\c\d\e\f\g\h') == 13

If you get in the habit of typing regex strings as r'...' or r"...",
and examining the patters with print(somestring), you'll ease your life.

All excellent suggestions, but I'm surprised that nobody has mentioned
the re.VERBOSE format.

Manual sez:
'''
re.X
re.VERBOSE
This flag allows you to write regular expressions that look nicer.
Whitespace within the pattern is ignored, except when in a character
class or preceded by an unescaped backslash, and, when a line contains
a '#' neither in a character class or preceded by an unescaped
backslash, all characters from the leftmost such '#' through the end
of the line are ignored.

That means that the two following regular expression objects that
match a decimal number are functionally equal:

a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")
'''

My comments:
(1)"looks nicer" is not the point; it's understandability
(2) if you need a space, use a character class ->[ ]<- not an
unescaped backslash ->\ <-
(3) the indentation in the manual doesn't fit my idea of "looks
nicer"; I'd do
a = re.compile(r"""
\d + # the integral part
\. # the decimal point
\d * # some fractional digits
""", re.X)
(4) you can aid understandability by more indentation especially when
you have multiple capturing expressions and (?......) gizmoids e.g.
r"""
(
..... # prefix
)
(
(?......) # look-back assertion
(?....) # etc etc
)
"""
Worth a try if you find yourself going nuts getting the parentheses
matching.

Cheers,
John

Gabriel Genellina · Jan 25, 2009

En Sat, 24 Jan 2009 19:03:26 -0200, Sean Brown gmail.com>

Using python 2.4.4 on OpenSolaris 2008.11

I have the following string created by opening a url that has the
following string in it:

td[ct] = [[ ... ]];\r\n

The ... above is what I'm interested in extracting which is really a
whole bunch of text. So I think the regex \[\[(.*)\]\]; should do it.
The problem is it appears that python is escaping the \ in the regex
because I see this:

reg = '\[\[(.*)\]\];'
reg

Click to expand...

Click to expand...

'\\[\\[(.*)\\]\\];'

Now to me looks like it would match the string - \[\[ ... \]\];

No. Python escape character is the backslash \; if you want to include a
backslash inside a string, you have to double it. By example, these are
all single character strings: 'a' '\n' '\\'
Coincidentally (or not), the backslash has a similar meaning in a regular
expression: if you want a string containing \a (two characters) you should
write "\\a".
That's rather tedious and error prone. To help with this, Python allows
for "raw-string literals", where no escape interpretation is done. Just
put an r before the opening quote: r"\(\d+\)" (seven characters; matches
numbers inside parenthesis).

Also, note that when you *evaluate* an expression in the interpreter (like
the lone "reg" above), it prints the "repr" of the result: for a string,
it is the escaped contents surrounded by quotes. (That's very handy when
debugging, but may be confusing if don't know how to interpret it)

Third, Python is very permissive with wrong escape sequences: they just
end up in the string, instead of flagging them as an error. In your case,
\[ is an invalid escape sequence, which is left untouched in the string.

py> reg = r'\[\[(.*)\]\];'
py> reg
'\\[\\[(.*)\\]\\];'
py> print reg
\[\[(.*)\]\];
py> len(reg)
13

Which obviously doesn't match anything because there are no literal \ in
the above string. Leaving the \ out of the \[\[ above has re.compile
throw an error because [ is a special regex character. Which is why it
needs to be escaped in the first place.

It works in this example:

py> txt = """
.... Some text
.... and td[ct] = [[ more things ]];
.... more text"""
py> import re
py> m = re.search(reg, txt)
py> m
<_sre.SRE_Match object at 0x00AC66A0>
py> m.groups()
(' more things ',)

So maybe your r.e. doesn't match the text (the final ";"? whitespace?)
For more info, see the Regular Expressions HOWTO at
http://docs.python.org/howto/regex.html

Why is regex so slow?	21	Jun 18, 2013
Doing both regex match and assignment within a If loop?	7	Mar 29, 2013
Regex not matching a string	2	Jan 9, 2013
Help with regex	11	Aug 6, 2009
Question regarding re module	1	Jun 5, 2008
Puzzled about this regex	0	Apr 18, 2009
Unexpected regex result	0	Aug 22, 2008
Compiling regex inside function?	3	Aug 3, 2009

RegEx issues

Mark Tolonen

Steve Holden

Roy Smith

MRAB

Sean Brown

John Machin

Gabriel Genellina

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads