Regular expression question -- exclude substring

D

dreamerbin

Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".

I'm thinking that the solution to my problem might be to use a regular
expression to exclude the substring "target_mark", which will replace
the part of ".*" above. However, I don't know how to exclude a
substring. Can anyone help on this? Or maybe give another solution to
my problem? Thanks very much.
 
H

hiwa

I think your description of requirement is vague and imprecise.
Here is a stub in the dark:

/*
from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
*/
import java.util.regex.*;

public class Dreamer{

public static void main(String[] args){
String input = "00 noise1 01 noise2 00 target 01 target_mark";
String regex = "00\\s\\S+\\s01(?=\\starget_mark)";

Pattern pat = Pattern.compile(regex);
Matcher mat = pat.matcher(input);
while (mat.find()){
System.out.println(mat.group());
}
}
}
 
B

Benji

My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".
I'm thinking that the solution to my problem might be to use a regular
expression to exclude the substring "target_mark", which will replace
the part of ".*" above. However, I don't know how to exclude a
substring. Can anyone help on this? Or maybe give another solution to
my problem? Thanks very much.

in order to get the smallest substring out of that, you have to exclude
"00" from appearing in .*. (I'm not sure why you have the question mark -
* intrinsically includes 0 repetitions)

since you can't say "anything that doesn't contain 00", my best whack at
it would be
"(00((0[^0])|([^0]))*01) target_mark"
 
B

Benji

My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".
I'm thinking that the solution to my problem might be to use a regular
expression to exclude the substring "target_mark", which will replace
the part of ".*" above. However, I don't know how to exclude a
substring. Can anyone help on this? Or maybe give another solution to
my problem? Thanks very much.

in order to get the smallest substring out of that, you have to exclude
"00" from appearing in .*. (I'm not sure why you have the question mark -
* intrinsically includes 0 repetitions)

since you can't say "anything that doesn't contain 00", you have to say
"(either 0 followed by not 0, or not 0) repeated"
my best whack at it would be
"(00((0[^0])|([^0]))*01) target_mark"
 
B

Benji

hiwa said:
I think your description of requirement is vague and imprecise.
Here is a stub in the dark:

stab? =)
String regex = "00\\s\\S+\\s01(?=\\starget_mark)";

this will not work if "noise" contains any whitespace. (I'm not sure
if it can or not - like you said, it was vague.)
 
D

dreamerbin

Great! Thanks a lot, guys. I appologize for my vague problem
description. Actually "noise" does contain whitespace since it is
normally a long string. I'm not sure if hiwa's solution can be tuned to
meet this requirement. For Benji's solution, fortunately "00" is short.
So this does solve my problem. But I'm curious to know how to solve the
general problem if the start symbol ("00" here) is long, such as
"startsymbol". Do we have to write down all the combinations in the
regular expression? Why did you say 'you can't say "anything that
doesn't contain 00"'? Is it a common sense?
 
J

John C. Bollinger

Benji said:
in order to get the smallest substring out of that, you have to exclude
"00" from appearing in .*. (I'm not sure why you have the question mark -
* intrinsically includes 0 repetitions)

The ? makes the * a "reluctant quantifier". Together, *? means the
smallest number of repetitions necessary for a successful match, and it
is a concise way of solving the type problem you raise. Unfortunately,
it needs help to work in a situation where the opening and closing
delimiters are different. Understanding why that's the case goes hand
in hand with finding a solution; the key here is that the regex engine
works from the left, and only backtracks when it absolutely has to. In
the example, the regex engine latches on to the first 00 as the start of
the group, and never has to backtrack (to start the group at the second
00 instead) because it can obtain a complete match without doing so.

Here's an example of a pattern that works:
Pattern pat = Pattern.compile(".*(00.*?01) target_mark");
Matcher matcher = pat.matcher(
"00 noise1 01 noise2 00 target 01 target_mark");
if (matcher.matches()) { // or .find() or .lookingAt()
System.out.println("The pattern selected this group: '"
+ matcher.group(1) + "'");
} else {
System.out.println("The pattern did not match");
}

Output is:
The pattern selected this group: '00 target 01'

*Why* that works is left as an exercise for the reader :)

Extra credit: is the reluctant quantifier necessary in the above
pattern? Would a greedy ("normal") one do the job as well? (Hint:
yes.) Why or why not?
since you can't say "anything that doesn't contain 00", you have to say
"(either 0 followed by not 0, or not 0) repeated"
my best whack at it would be
"(00((0[^0])|([^0]))*01) target_mark"

And that one does the job for the sample data too, though it can fail if
the target or noise is permitted to abut the 01 delimiter without
intervening whitespace.
 
B

Benji

"startsymbol". Do we have to write down all the combinations in the
regular expression? Why did you say 'you can't say "anything that
doesn't contain 00"'? Is it a common sense?

Yes, you would have to write all of the combinations that it could not be.

Because of the way that regular expressions are turned into DFAs,
implementing a generalized "not" would be exactly as complicated as
doing what I did. (the size of the DFA would be 2^n, where n is the
length of the string)

While it would be possible for them to implement, they chose not to,
since if someone did something like

"^(this really long string)", then the compiled DFA could be much larger
than the user intended if they didn't know what they were doing.
 
J

John C. Bollinger

But I'm curious to know how to solve the
general problem if the start symbol ("00" here) is long, such as
"startsymbol". Do we have to write down all the combinations in the
regular expression? Why did you say 'you can't say "anything that
doesn't contain 00"'? Is it a common sense?

Benji gave a good answer to your actual question. I just wanted to
reiterate Chris Uppal's recent assertion in another thread that regexes
are often not the right tool for the job. It may be faster, clearer,
and all-around better to tokenize the input into grammatical tokens
first ("startsymbol" would then be a single token, for instance) and
then parse it with appropriate code (often not very difficult). For
parsing complex languages you might want to look into the class of tools
that are designed for this job, lexical analyzers and parsers / compiler
compilers.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,173
Latest member
GeraldReund
Top