Java Regex execution order

M

Martin Gregorie

Quick question: in a regex with alternative matching strings, is it safe
to assume that the alternatives will always be tested in the order given?

For example, I have a String, s where s = "xxx\rxxxx\nxxxx\r\nxxxx\r\n";
is it safe to assume that String r = s.replaceAll("\r\n|\r|\n", "\r\n");
will always yield r = "xxx\r\nxxxx\r\nxxxx\r\nxxxx\r\n" for all strings
containing \r, \n or \r\n ?

Its doing that with the (limited) test set I've run it on so far, but I'd
like to know that isn't a fluke. In case you're wondering, I need a
quarantee that, no matter what programs have handled a file on a variety
of operating systems, after applying replaceAll() the line terminators
will always by CRLF because this affects a hash check on the whole file.
 
S

Stefan Ram

Martin Gregorie said:
in a regex with alternative matching strings, is it safe
to assume that the alternatives will always be tested in the order given?

I never heard of such a guarantee. So unless you can find it
in the JavaDocs, it is not safe to assume this.
 
A

Arne Vajhøj

Quick question: in a regex with alternative matching strings, is it safe
to assume that the alternatives will always be tested in the order given?

For example, I have a String, s where s = "xxx\rxxxx\nxxxx\r\nxxxx\r\n";
is it safe to assume that String r = s.replaceAll("\r\n|\r|\n", "\r\n");
will always yield r = "xxx\r\nxxxx\r\nxxxx\r\nxxxx\r\n" for all strings
containing \r, \n or \r\n ?

Its doing that with the (limited) test set I've run it on so far, but I'd
like to know that isn't a fluke. In case you're wondering, I need a
quarantee that, no matter what programs have handled a file on a variety
of operating systems, after applying replaceAll() the line terminators
will always by CRLF because this affects a hash check on the whole file.

http://www.pcre.org/pcre.txt says:

"Because alternative branches are tried from left to right"

and http://perldoc.perl.org/perlre.html says:

"Alternatives are tried from left to right, so the first alternative
found for which the entire expression matches, is the one that is
chosen. This means that alternatives are not necessarily greedy."

and
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html says:

"The Pattern engine performs traditional NFA-based matching with ordered
alternation as occurs in Perl 5."

so I think you are good.

Arne
 
M

markspace

"The Pattern engine performs traditional NFA-based matching with ordered
alternation as occurs in Perl 5."


Arne is correct, alternation is tested left to right. However,
something that won't affect your use case but good to be aware of in the
general case: target strings are also matched left to right and that
takes precedence over the matching in alternations.

For example, for the pattern "test|Java|is" and the target string:
"This is a Java REGEX test string."
you might expect the "test" to be matched at offset 21, but instead the
match is "is" at offset 2.


import java.util.regex.Matcher;
import static java.util.regex.Pattern.compile;

public class RegexAlternation {
static String s = "This is a Java REGEX test string.";
public static void main( String[] args ) {
Matcher m = compile("test|Java|is").matcher( s );
if( m.find() )
System.out.println( "found match at position "+m.start() );
System.out.println( " "+s );
System.out.println( repeat( " ", m.start() ) + "^" );
}
private static CharSequence repeat( String string, int start ) {
StringBuilder sb = new StringBuilder( string);
for( int i = 0; i < start; i++ )
sb.append( string );
return sb;
}
}

run:
found match at position 2
This is a Java REGEX test string.
^
BUILD SUCCESSFUL (total time: 1 second)
 
M

Martin Gregorie

Arne is correct, alternation is tested left to right. However,
something that won't affect your use case but good to be aware of in the
general case: target strings are also matched left to right and that
takes precedence over the matching in alternations.

For example, for the pattern "test|Java|is" and the target string: "This
is a Java REGEX test string."
you might expect the "test" to be matched at offset 21, but instead the
match is "is" at offset 2.
I'd expect that: I realise that what I'm doing is rather a special case
and just wanted to know that it wasn't going to transform \r\n into
\r\n\r\n by being tripped into trying the 2nd and 3rd alternations first.

Thanks, Arne and markspace for confirming that.

I'd had a quick read through 'regular expressions' in the SDK Javadocs
and didn't notice anything saying whether alternation order was
significant. Normally it wouldn't matter because I wouldn't write a regex
where one alternation is a subset of another in the same list. In fact I
think this is the first time I've ever needed to do that - and I write a
lot of regexes as part of Spamassassin rule development.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top