Regular expression question -- exclude substring

dreamerbin · Nov 7, 2005

Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".

I'm thinking that the solution to my problem might be to use a regular
expression to exclude the substring "target_mark", which will replace
the part of ".*" above. However, I don't know how to exclude a
substring. Can anyone help on this? Or maybe give another solution to
my problem? Thanks very much.

Kent Johnson · Nov 8, 2005

Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".

If there is a character that can't appear in the bit between the numbers then use everything-but-that instead of . - for example if spaces can only appear as you show them, use
"(00 [^ ]* 01) target_mark" or
"(00 \S* 01) target_mark"

Kent

google · Nov 8, 2005

Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

James Stroud · Nov 8, 2005

Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

My understanding is that backward looking operators are very resource
expensive to implement.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Kent Johnson · Nov 8, 2005

James said:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

Click to expand...

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

??? not in my Python:

>>> rgx = re.compile(r"(00.*01) target_mark")
>>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01']
>>> rgx = re.compile(r"(00.*?01) target_mark")
>>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')

Click to expand...

Click to expand...

['00 noise1 01 noise2 00 target 01']

Since target_mark only occurs once in the string the greedy and non-greedy match is the same in this case.

Kent

James Stroud · Nov 8, 2005

James said:
James said:

Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

Click to expand...

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the
non-greedy finds the start of the first start-of-the-match it comes
accross and then finds the first occurrence of '01' that makes the
complete match, otherwise the greedy operator would match .* as much as
it could, gobbling up all '01's before the last because these match '.*'.
For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
01') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
01') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

Click to expand...

??? not in my Python:
['00 noise1 01 noise2 00 target 01']

['00 noise1 01 noise2 00 target 01']

Since target_mark only occurs once in the string the greedy and non-greedy
match is the same in this case.

Somehow my cutting and pasting got messed up. It should be:

py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']

Sorry about that.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Bengt Richter · Nov 8, 2005

Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

Click to expand...

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

My understanding is that backward looking operators are very resource
expensive to implement.

If the delimiting strings are fixed, we can use plain python string methods, e.g.,
(not tested beyond what you see ;-)
... start = 0
... while True:
... t = s.find(tmk, start)
... if t<0: break
... start = s.rfind(beg, start, t)
... if start<0: break
... e = s.find(end, start, t)
... if e+len(end)==t: # _just_ after
... yield s[start:e+len(end)]
... start = t+len(tmk)
...

>>> list(findit(s)) ['00 target 01']
>>> s2 = s + ' garbage noise3 00 almost 01 target_mark 00 success 01 target_mark'
>>> list(findit(s2))

Click to expand...

Click to expand...

['00 target 01', '00 success 01']

(I didn't enforce exact adjacency the first time, obviously it would be more efficient
to search for end+tmk instead of tmk and back to beg and forward to end ;-)

If there can be spurious target_marks, and tricky matching spans, additional logic may be needed.
Too lazy to think about it ;-)

Regards,
Bengt Richter

Regular expression question -- exclude substring	8	Nov 7, 2005
Help with regular expression in python	1	Aug 18, 2011
Regular Expression : Bad Character Range	0	Dec 20, 2013
Question: Optional Regular Expression Grouping	4	Oct 10, 2011
Regular Expression for Finding and Deleting comments	1	Jan 4, 2011
extract substring by regex from a text file	5	Apr 15, 2010
Pattern Search Regular Expression	20	Jun 15, 2013
grimace: a fluent regular expression generator in Python	0	Jul 15, 2013

Regular expression question -- exclude substring

dreamerbin

Kent Johnson

google

James Stroud

Kent Johnson

James Stroud

Bengt Richter

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads