Is matching against several regex's so clumsy?

J

joosteto

/*
I'd like to search for several regex's in a (large) String, walking
through the string.
In order not to copy the String all the time, I thought I'd use
matcherObject.find(position), where
position is set position=macherObject.end() whenever a regex is found.
For example, search for the regex's:
ABLEWORD: \b\S*able\b
FULWORD: \b\S*ful\b
ANYWORD: \b\S+\b
SPACE: \s+

The only way I found was to create a Pattern and a Matcher for each
regex I want to search for, and use \\G
to make the matcherObject.find(position) start at position (not the
"previous match" as the documentation
claims), as I do in the code below.

Now, my question is: does it really have to be this clumsy?
(declaring two objects for each regex, having to copy end position
from last match, etc)

And, does "\G" really mean match from start index for
matcherObject.find(index), and not match from end
of previous match, as claimed by the documentation
http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html
*/

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Scan {
public Scan() {
}
public static void main(String[] args){

int pos=0;
String s="a beautiful string with matchable words";

Pattern able=Pattern.compile("\\G\\b(\\S*able)\\b");
Matcher matchAble=able.matcher(s);

Pattern ful=Pattern.compile("\\G\\b(\\S*ful)\\b");
Matcher matchFul=ful.matcher(s);

Pattern any=Pattern.compile("\\G(\\S+)");
Matcher matchAny=any.matcher(s);

Pattern space=Pattern.compile("\\G(\\s+)");
Matcher matchSpace=space.matcher(s);

while(pos<s.length()){
if(matchAble.find(pos)){
pos=matchAble.end();
System.out.print("ABLE: \""+matchAble.group(1)+"\",
");
} else if(matchFul.find(pos)){
pos=matchFul.end();
System.out.print("FUL: \""+matchFul.group(1)+"\", ");
} else if(matchAny.find(pos)){
pos=matchAny.end();
System.out.print("ANY: \""+matchAny.group(1)+"\", ");
} else if(matchSpace.find(pos)){
pos=matchSpace.end();
System.out.print("SPACE: \""+matchSpace.group(1)+"\",
");
} else {
System.out.println("No match found at:
\""+s.substring(pos)+"\"");
break;
}
}
}
}
 
O

Oliver Wong

/*
I'd like to search for several regex's in a (large) String, walking
through the string.
In order not to copy the String all the time, I thought I'd use
matcherObject.find(position), where
position is set position=macherObject.end() whenever a regex is found.
For example, search for the regex's:
ABLEWORD: \b\S*able\b
FULWORD: \b\S*ful\b
ANYWORD: \b\S+\b
SPACE: \s+

The only way I found was to create a Pattern and a Matcher for each
regex I want to search for, and use \\G
to make the matcherObject.find(position) start at position (not the
"previous match" as the documentation
claims), as I do in the code below.

Now, my question is: does it really have to be this clumsy?
(declaring two objects for each regex, having to copy end position
from last match, etc)

It looks like you're reinventing lexical analysis. You may find it
less clumsy to reuse the existing algorithms and tools:
http://en.wikipedia.org/wiki/Lexical_analysis
And, does "\G" really mean match from start index for
matcherObject.find(index), and not match from end
of previous match, as claimed by the documentation
http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html
*/

I'd assume the documentation is correct, but I haven't verified it
personally.

- Oliver
 
J

Joshua Cranmer

And, does "\G" really mean match from start index for
matcherObject.find(index), and not match from end of previous match, as
claimed by the documentation
http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html */

I would trust the documentation, especially given your code:
[snip]
Pattern able=Pattern.compile("\\G\\b(\\S*able)\\b"); Matcher
matchAble=able.matcher(s);

Pattern ful=Pattern.compile("\\G\\b(\\S*ful)\\b"); Matcher
matchFul=ful.matcher(s);
[snip]

"\G" probably means from the end of the previous match, but you're using
four different matchers, so the end of the "previous" match that the
Matcher sees is not the one you thinking of.
 
R

Roedy Green

I'd like to search for several regex's in a (large) String, walking
through the string.
In order not to copy the String all the time, I thought I'd use
matcherObject.find(position), where
position is set position=macherObject.end() whenever a regex is found.
For example, search for the regex's:
ABLEWORD: \b\S*able\b
FULWORD: \b\S*ful\b
ANYWORD: \b\S+\b
SPACE: \s+

what you might do if you need more speed is use a Boyer Moore
algorithm to search for several strings simultaneously. When you find
a decent candidate, then fire up your regexes.

I have written a single search Boyer Moore you could use to get
started.

See http://mindprod.com/products1.html#BOYER

Regexes are for lightweight parsing tasks. You might be needing a
parser. See http://mindprod.com/jgloss/parser.html
 
J

joosteto

And, does "\G" really mean match from start index for
matcherObject.find(index), and not match from end of previous match, as
claimed by the documentation
http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html*/

I would trust the documentation, especially given your code:
[snip]
Pattern able=Pattern.compile("\\G\\b(\\S*able)\\b"); Matcher
matchAble=able.matcher(s);
Pattern ful=Pattern.compile("\\G\\b(\\S*ful)\\b"); Matcher
matchFul=ful.matcher(s);
[snip]

"\G" probably means from the end of the previous match, but you're using
four different matchers, so the end of the "previous" match that the
Matcher sees is not the one you thinking of.

The code works perfectly OK, and it \G matches not from the start of
the previous match, but form index in the matchFil.find(index). That
is indeed not as it is described in the manual.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,054
Latest member
LucyCarper

Latest Threads

Top