RegEx partial matching

S

Stanimir Stamenkov

Is it possible to detect a partial match at the end of the supplied
data?, i.e.:

String data = "A regular expression, specified as a string,
must first be compiled into an instance of this class. The resulting
pattern can then be used";

String search = "can then be used to create";

Pattern pattern = Pattern.compile(search);

Matcher matcher = pattern.matcher(data);

matcher.find();
...

It is obvious the above won't match but then the initial data is
only a chunk from an input stream (for example), so I want to detect
if the pattern has been partially matched and at which position the
partial match begins so I could prepend it to the next data chunk
and continue matching.
 
B

bilbo

I assume your data is coming from a stream, which is why you can't
search the whole string at once. The java.util.regex package doesn't
give you any way to do what you want, without rewriting your regex to
match partial strings. If you know how long the input stream that
you're searching is going to be, then you can implement your own
java.lang.CharSequence that wraps your stream and pass that to
pattern.matcher(). However, since CharSequence has a length() method,
this can't be implemented for streams in general.

The Jakarta-Regexp library uses a CharacterIterator interface instead
of a CharSequence. CharacterIterator can be implemented over streams,
so you can do what you're trying to do, as in

import org.apache.regexp.*;
....
java.io.Reader myReader = ...;
RE regex = new RE("Your long string");
regex.match(new ReaderCharacterIterator(myReader));

You can get the Jakarta-Regexp library here:
http://jakarta.apache.org/regexp/

Adam
 
C

Chris

Stanimir Stamenkov said:
Is it possible to detect a partial match at the end of the supplied
data?, i.e.:

String data = "A regular expression, specified as a string,
must first be compiled into an instance of this class. The resulting
pattern can then be used";

String search = "can then be used to create";

Pattern pattern = Pattern.compile(search);

Matcher matcher = pattern.matcher(data);

matcher.find();
...

It is obvious the above won't match but then the initial data is
only a chunk from an input stream (for example), so I want to detect
if the pattern has been partially matched and at which position the
partial match begins so I could prepend it to the next data chunk
and continue matching.

You can't know that it's a true match until you get the next chunk. So you'd
have to set a "might be a match" flag, fetch the next chunk, concatenate it,
and then try to match again. Or check to see if the first part of the next
chunk matched the last part of the regex.

This logic is tricky enough that I wouldn't bother trying. Instead, access
your chunks though a custom object that implements the CharSequence
interface. Feed that to your matcher. That way you can fetch the next chunk
when you run out of chars in the first chunk, and it's all transparent to
your matcher. This is a much cleaner design.
 
S

Stanimir Stamenkov

/Chris/:
You can't know that it's a true match until you get the next chunk. So you'd
have to set a "might be a match" flag, fetch the next chunk, concatenate it,
and then try to match again. Or check to see if the first part of the next
chunk matched the last part of the regex.

That's exactly what I'm saying I need to do and what I need the
partial "might be a" match detection for.
This logic is tricky enough that I wouldn't bother trying. Instead, access
your chunks though a custom object that implements the CharSequence
interface. Feed that to your matcher. That way you can fetch the next chunk
when you run out of chars in the first chunk, and it's all transparent to
your matcher. This is a much cleaner design.

If you read the other reply from "bilbo" (Adam) you would notice I
couldn't implement CharSequence interface over a stream.
 
S

Stanimir Stamenkov

/bilbo/:
The Jakarta-Regexp library uses a CharacterIterator interface instead
of a CharSequence. CharacterIterator can be implemented over streams,
so you can do what you're trying to do, as in

import org.apache.regexp.*;
...
java.io.Reader myReader = ...;
RE regex = new RE("Your long string");
regex.match(new ReaderCharacterIterator(myReader));

You can get the Jakarta-Regexp library here:
http://jakarta.apache.org/regexp/

Thank you, Adam - this is exactly what I need.
 
Y

Yamin

Stanimir Stamenkov said:
if the pattern has been partially matched and at which position the
partial match begins so I could prepend it to the next data chunk
and continue matching.

Maybe someone can help directly with the regular expression stuff.
I'd imagine it would just be a big blog of patterning, but from your
above comment, it look like you're reading from some source (network
or file)...and you want to match a string where the string can span
between two successive reads.

There is a very simple solution to this.
Just keep appending data to your main buffer, and do a simple pattern
match. If you don't want your buffer to be too big, simply chop off
any early data that you know cannot match the string.

Suppose the string you're trying to match is 10 character long.
If you've read 20 characters already, and you have not found a full
match, you can safely discard the lower 10 characters. Normally I
like to be safe and just keep 3x the length of the pattern I'm looking
for.

Yaimn
 
S

Stanimir Stamenkov

/Yamin/:
There is a very simple solution to this.
Just keep appending data to your main buffer, and do a simple pattern
match. If you don't want your buffer to be too big, simply chop off
any early data that you know cannot match the string.

How do I know which earlier data cannot match the regular
expression? (see bellow)
Suppose the string you're trying to match is 10 character long.
If you've read 20 characters already, and you have not found a full
match, you can safely discard the lower 10 characters. Normally I
like to be safe and just keep 3x the length of the pattern I'm looking
for.

I want to match full featured regular expressions not just fixed
strings (as I've included in my example for simplicity). I can't
possibly determine the result length which would match some regular
expression as in:

opentag [.\n]* closetag

What would be the length of "[.\n]*" ?

The only reasonable solution is to use rexeg library which supports
streaming source as Adam (bilbo) has already proposed in another
message in this thread.
 
B

bilbo

Stanimir said:
Is it possible to detect a partial match at the end of the supplied
data?, i.e.:

String data = "A regular expression, specified as a string,
must first be compiled into an instance of this class. The resulting
pattern can then be used";

String search = "can then be used to create";

Pattern pattern = Pattern.compile(search);

Matcher matcher = pattern.matcher(data);

matcher.find();
...

It is obvious the above won't match but then the initial data is
only a chunk from an input stream (for example), so I want to detect
if the pattern has been partially matched and at which position the
partial match begins so I could prepend it to the next data chunk
and continue matching.

Just found out that Java 1.5 actually includes a solution to this
problem as well. Use the java.util.Scanner class, which includes
methods to search various kinds of streams.

InputStream stream = ...;
Scanner scanner = new Scanner(stream);
Pattern pattern = Pattern.compile(search);
String match;
while ((match = scanner.findWithinHorizon(pattern, 0)) != null)
System.out.println(match);
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Embarrassing regex question 5
complex regex 1
complex regex 1
Question on regular expression. 6
regex problem 9
regex question 4
Simple pattern-matching in a functional way 12
regexp(ing) Backus-Naurish expressions ... 23

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,135
Latest member
VeronaShap
Top