regexp lookahead

M

Michael Powe

I experimented a bit with the Java regexp lookahead functionality, and
the results don't make sense to me. The test is below.

8<========================================>8
public static void main (String [] args)
{
// this is negative lookahead
String re = "(.*)\\[(?!\\S+)\\](.*)";
// positive lookahead
//String re = "(.*)\\[(?=\\S+)\\](.*)";
String test = "this is [sometext] and some more";
String test2 = "this is [] and some more";

Pattern p = Pattern.compile(re);
Matcher m = p.matcher(test);
if (m.find()) {
System.out.println("success match one");
for (int i = 0; i <= m.groupCount(); i++) {
System.out.println("Group " + i + " " + m.group(i));
}
} else {
System.out.println("fail match one");
}
Matcher m2 = p.matcher(test2);
if (m2.find()) {
System.out.println("success match two");

for (int i = 0; i <= m2.groupCount(); i++) {
System.out.println("Group " + i + " " + m2.group(i));
}
} else {
System.out.println("fail match two");
}
} // end main

8<========================================>8
Here's the output for positive lookahead:

cd /home/powem/src/java/
/opt/jdk1.5/bin/java DateTest

fail match one
success match two
Group 0 this is [] and some more
Group 1 this is
Group 2 and some more

And for negative lookahead:

cd /home/powem/src/java/
/opt/jdk1.5/bin/java DateTest

fail match one
fail match two
8<========================================>8

Thus, negative lookahead appears to be useless since it fails whether
the text is there or not. Positive lookahead appears to do the
opposite of what you would expect, it fails if the condition is true
(text is there) and succeeds if the condition is false.

Things that make me go "hmmm."

Am I making some fundamental error here?

Thanks.

mp

--
Michael Powe (e-mail address removed) Naugatuck CT USA
"We had pierced the veneer of outside things. We had `suffered,
starved, and triumphed, groveled down yet grasped at glory, grown
bigger in the bigness of the whole.' We had seen God in his
splendors, heard the text that Nature renders. We had reached the
naked soul of man." -- Sir Ernest Shackleton, <South>
 
J

Jussi Piitulainen

Michael said:
I experimented a bit with the Java regexp lookahead functionality,
and the results don't make sense to me. The test is below.

....

Negative lookahead:
String re = "(.*)\\[(?!\\S+)\\](.*)";

The look-ahead pattern and the following pattern match at the same
position: (?!\S+) matches the empty string between the \[ and
something that _fails_ to match \S+ at that position, and that
something should start with the \]. Where can this happen?

Positive lookahead:
String re = "(.*)\\[(?=\\S+)\\](.*)";

The look-ahead pattern and the following pattern match at the same
position: (?=\S+) matches the empty string between the \[ and before
an \S+, and that \S+ should start with the \]. Where can this happen?

(Javadoc for 1.4.2 was not too helpful here, so I experimented a bit,
never having used these myself.)
 
M

Michael Powe

Jussi> ...

Jussi> Negative lookahead:
String re = "(.*)\\[(?!\\S+)\\](.*)";

Jussi> The look-ahead pattern and the following pattern match at
Jussi> the same position: (?!\S+) matches the empty string between
Jussi> the \[ and something that _fails_ to match \S+ at that
Jussi> position, and that something should start with the
Jussi> \]. Where can this happen?

In my test, it happens everywhere -- the regexp fails when there's
nothing there and when there's text there.

Jussi> Positive lookahead:
String re = "(.*)\\[(?=\\S+)\\](.*)";

Jussi> The look-ahead pattern and the following pattern match at
Jussi> the same position: (?=\S+) matches the empty string between
Jussi> the \[ and before an \S+, and that \S+ should start with
Jussi> the \]. Where can this happen?

The reason for my testing was because the regexp fails to match the
case where there is nothing between the brackets. Note that the
brackets are not included in the group, they are part of the original
text string only:

[(\\S+)]

This fails as indicated in my example. It's explainable but, to me
anyway, counterintuitive that "positive" lookahead -- which is
supposed to *confirm* the existence of a match -- fails when there is
a match and succeeds when there isn't.

In the real-world case that led me to examine the lookahead option, I
had a regexp matching a long string (9 group captures) that failed
when one of the expected groups, inside a bracket pair, was
empty. \\S+ does not match inside [] and thus caused the whole regex
to fail. I would like to see a useful, nontrivial application of
lookahead. It doesn't appear to me that there is one.

And the negative lookahead just appears broken.

Jussi> (Javadoc for 1.4.2 was not too helpful here, so I
Jussi> experimented a bit, never having used these myself.)

I actually have Habibi's book, _Java Regular Expressions_, but IMO it
is not very useful if you already have good knowledge of regex. It
does have some value as a method reference and for information about
how things work behind the scenes. However, I don't know that I
needed to spend that much money for that amount of information. Its
explanations and sample code for lookahead, however, are incomplete
and trivial, respectively. And, finding a typographical error on page
2 and another on page 3 is really offputting.

Ironically, Habibi criticizes perl's conditional construct in regex,
and it is exactly that construct that I need in the case described
here.

Thanks.

mp
 
J

Jussi Piitulainen

Michael said:
Jussi> Negative lookahead:
String re = "(.*)\\[(?!\\S+)\\](.*)";

Jussi> The look-ahead pattern and the following pattern match at
Jussi> the same position: (?!\S+) matches the empty string between
Jussi> the \[ and something that _fails_ to match \S+ at that
Jussi> position, and that something should start with the
Jussi> \]. Where can this happen?

In my test, it happens everywhere -- the regexp fails when there's
nothing there and when there's text there.

Right, except I would say _nowhere_ rather than everywhere. If (?!\S+)
matches, \] does not. If \] matches, (?!\S+) does not.
Jussi> Positive lookahead:
String re = "(.*)\\[(?=\\S+)\\](.*)";

Jussi> The look-ahead pattern and the following pattern match at
Jussi> the same position: (?=\S+) matches the empty string between
Jussi> the \[ and before an \S+, and that \S+ should start with
Jussi> the \]. Where can this happen?

The reason for my testing was because the regexp fails to match the
case where there is nothing between the brackets. Note that the

I thought that was the case that succeeded. That pattern is just like
(.*)\[\](.*) with an extra condition that the part of input that
matches \](.*) must also match \S+, which it does, since the ] is
there.

Are you sure that you understand that a lookahead pattern always
consumes an empty string? So your whole pattern can only match a pair
of brackets [], with the two groups on each side of it.
In the real-world case that led me to examine the lookahead option,
I had a regexp matching a long string (9 group captures) that failed
when one of the expected groups, inside a bracket pair, was empty.
\\S+ does not match inside [] and thus caused the whole regex to
fail.

\S matches the right bracket, and eats it, too. (?=\S+) also matches
the right bracket but doesn't eat it.

Nine groups sounds rather complicated. Do you need to do it all in one
expression?
I would like to see a useful, nontrivial application of lookahead.
It doesn't appear to me that there is one.

I think there is a candidate in the other post I made, this morning I
think, where someone wanted to split a certain file at each <?xml...>
thingamajic in it.

(Which reminds me, you might consider the use of non-greedy patterns,
like .*?, since those .* try to eat the bracket pairs, too, and that
may lead to something that feels unintuitive.)
And the negative lookahead just appears broken.

Let me contrive an example of sorts: a maximal digit sequence not
bounded by a . or a - or an e.

import java.util.regex.Matcher;
import java.util.regex.Pattern;
class NonLook {
public static void main(String [] _) {
Matcher m = Pattern
.compile("(?<![.e\\-\\d])\\d++(?![.e\\-])")
.matcher("pi 3.14 314e-2 1024 e 2.7 27e-1 31415926");
while (m.find()) {
System.out.println(m.group(0));
}
}
}

Ok, I had to throw in a lookbehind, a possessive quantifier in \d++,
and a \d inside the lookbehind. This does not eat the preceding or
following character, and matches even where there is no following
character at all. It seems to work.
Jussi> (Javadoc for 1.4.2 was not too helpful here, so I
Jussi> experimented a bit, never having used these myself.)

I actually have Habibi's book, _Java Regular Expressions_, but IMO
it is not very useful if you already have good knowledge of regex.

Does it tell what (?>X) does? Sun's doc says it matches "X, as an
independent, non-capturing group". I have no idea what an independent
group is. (I know that I'm not looking at the latest documentation.)

....
Ironically, Habibi criticizes perl's conditional construct in regex,
and it is exactly that construct that I need in the case described
here.

There are likely to be other ways.

If your problem is that a pair of brackets in your input may contain
an empty string that you need to match, then you need to match an
empty string there. There is no way around that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top