Regular expression question -- exclude substring

D

dreamerbin

Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".

I'm thinking that the solution to my problem might be to use a regular
expression to exclude the substring "target_mark", which will replace
the part of ".*" above. However, I don't know how to exclude a
substring. Can anyone help on this? Or maybe give another solution to
my problem? Thanks very much.
 
K

Kent Johnson

Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".

If there is a character that can't appear in the bit between the numbers then use everything-but-that instead of . - for example if spaces can only appear as you show them, use
"(00 [^ ]* 01) target_mark" or
"(00 \S* 01) target_mark"

Kent
 
G

google

Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)
 
J

James Stroud

Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

My understanding is that backward looking operators are very resource
expensive to implement.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
K

Kent Johnson

James said:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)


The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

??? not in my Python:
>>> rgx = re.compile(r"(00.*01) target_mark")
>>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01') ['00 noise1 01 noise2 00 target 01']
>>> rgx = re.compile(r"(00.*?01) target_mark")
>>> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01']

Since target_mark only occurs once in the string the greedy and non-greedy match is the same in this case.

Kent
 
J

James Stroud

James said:
Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the
non-greedy finds the start of the first start-of-the-match it comes
accross and then finds the first occurrence of '01' that makes the
complete match, otherwise the greedy operator would match .* as much as
it could, gobbling up all '01's before the last because these match '.*'.
For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
01') ['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat
01') ['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

??? not in my Python:
['00 noise1 01 noise2 00 target 01']

['00 noise1 01 noise2 00 target 01']

Since target_mark only occurs once in the string the greedy and non-greedy
match is the same in this case.

Somehow my cutting and pasting got messed up. It should be:

py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']
py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01 target_mark')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']

Sorry about that.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
 
B

Bengt Richter

Ya, for some reason your non-greedy "?" doesn't seem to be taking.
This works:

re.sub('(.*)(00.*?01) target_mark', r'\2', your_string)

The non-greedy is actually acting as expected. This is because non-greedy
operators are "forward looking", not "backward looking". So the non-greedy
finds the start of the first start-of-the-match it comes accross and then
finds the first occurrence of '01' that makes the complete match, otherwise
the greedy operator would match .* as much as it could, gobbling up all '01's
before the last because these match '.*'. For example:

py> rgx = re.compile(r"(00.*01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01']
py> rgx = re.compile(r"(00.*?01) target_mark")
py> rgx.findall('00 noise1 01 noise2 00 target 01 target_mark 00 dowhat 01')
['00 noise1 01 noise2 00 target 01', '00 dowhat 01']

My understanding is that backward looking operators are very resource
expensive to implement.
If the delimiting strings are fixed, we can use plain python string methods, e.g.,
(not tested beyond what you see ;-)
... start = 0
... while True:
... t = s.find(tmk, start)
... if t<0: break
... start = s.rfind(beg, start, t)
... if start<0: break
... e = s.find(end, start, t)
... if e+len(end)==t: # _just_ after
... yield s[start:e+len(end)]
... start = t+len(tmk)
...
>>> list(findit(s)) ['00 target 01']
>>> s2 = s + ' garbage noise3 00 almost 01 target_mark 00 success 01 target_mark'
>>> list(findit(s2))
['00 target 01', '00 success 01']

(I didn't enforce exact adjacency the first time, obviously it would be more efficient
to search for end+tmk instead of tmk and back to beg and forward to end ;-)

If there can be spurious target_marks, and tricky matching spans, additional logic may be needed.
Too lazy to think about it ;-)

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,054
Latest member
LucyCarper

Latest Threads

Top