Search byte[] for pattern

I

Ian Pilcher

I'm looking for an API that will search an array of bytes for a
specified byte pattern -- basically the same thing as
String.indexOf(String).

Does such a method exist?

Thanks!
 
C

Chris Smith

C

Chris Smith

Ian said:
I'm looking for an API that will search an array of bytes for a
specified byte pattern -- basically the same thing as
String.indexOf(String).

Does such a method exist?

Oops, I meant to convert the code before sending to use byte[] instead
of String. Here's a conversion based on the code from the page I
mentioned earlier (and some API changes to make it more like indexOf):

/**
* Knuth-Morris-Pratt Algorithm for Pattern Matching
*/
public class KMPMatch {
/**
* Finds the first occurrence of the pattern in the text.
*/
public boolean indexOf(byte[] data, byte[] pattern) {
int[] failure = computeFailure(pattern);

int j = 0;
if (data.length == 0) return false;

for (int i = 0; i < data.length; i++) {
while (j > 0 && pattern[j] != data) {
j = failure[j - 1];
}
if (pattern[j] == data) { j++; }
if (j == pattern.length) {
return i - pattern.length + 1;
}
}
return -1;
}

/**
* Computes the failure function using a boot-strapping process,
* where the pattern is matched against itself.
*/
private int[] computeFailure(byte[] pattern) {
int[] failure = new int[pattern.length];

int j = 0;
for (int i = 1; i < pattern.length; i++) {
while (j > 0 && pattern[j] != pattern) {
j = failure[j - 1];
}
if (pattern[j] == pattern) {
j++;
}
failure = j;
}

return failure;
}
}


--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
I

Ian Pilcher

Chris said:
Sure, there's one right here, for example:

I take it from your response that the Java platform API does not provide
such a method. <sigh>

Thanks!
 
G

Gregory A. Swarthout

If it's really basically the same thing, why not:

int index = new String(byteArray1).indexOf(new String(byteArray2));

Greg
 
C

Chris Smith

If it's really basically the same thing, why not:

int index = new String(byteArray1).indexOf(new String(byteArray2));

Well, because that code does something very different than what you're
imagining it to do. The code there takes byteArray1, converts it to a
text String using some platform-dependent default character encoding,
then does the same thing with byteArray2, and then determines the index
of one String within the other.

The result is going to vary widely depending on the platform it's run
on, and is certainly not going to reliably be the same as what the OP
clearly wanted -- to find the location of occurrence of one byte-
sequence within a larger byte-sequence.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
G

Gregory A. Swarthout

Chris Smith said:
Well, because that code does something very different than what you're
imagining it to do. The code there takes byteArray1, converts it to a
text String using some platform-dependent default character encoding,
then does the same thing with byteArray2, and then determines the index
of one String within the other.

The result is going to vary widely depending on the platform it's run
on, and is certainly not going to reliably be the same as what the OP
clearly wanted -- to find the location of occurrence of one byte-
sequence within a larger byte-sequence.

You're right, but using String(byteArray1, charsetName) for your
String definitions would remove platform-dependence and could very
well be analagous to what the OP was after.

Greg
 
C

Chris Smith

Gregory said:
You're right, but using String(byteArray1, charsetName) for your
String definitions would remove platform-dependence and could very
well be analagous to what the OP was after.

Could be. The requirements would be that the charset:

1. Is a straight 8-bit character encoding.
2. Contains a unique character mapping for each possible byte value.

A violation of the first rule would skew the indexOf result, and would
also create the danger that the beginning of a match in the byte data
would be missed because it starts within a character in the converted
form. A violation of the second rule could create false positives in
matching, if two different bytes are mapped to the same character (or
are both unmapped, in which case String would map them to '?').

I am not familiar enough with the range of available character encodings
to know if there's one that fits that description. Do you know if there
is such a character encoding?

In any case, it would require comments to point out the kludgy solution,
as I would presume on seeing such a thing that the author was confused,
and try to fix it.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
S

Steve Horsley

Chris said:
Could be. The requirements would be that the charset:

1. Is a straight 8-bit character encoding.
2. Contains a unique character mapping for each possible byte value.

A violation of the first rule would skew the indexOf result, and would
also create the danger that the beginning of a match in the byte data
would be missed because it starts within a character in the converted
form. A violation of the second rule could create false positives in
matching, if two different bytes are mapped to the same character (or
are both unmapped, in which case String would map them to '?').

I am not familiar enough with the range of available character encodings
to know if there's one that fits that description. Do you know if there
is such a character encoding?

In any case, it would require comments to point out the kludgy solution,
as I would presume on seeing such a thing that the author was confused,
and try to fix it.
Are you playing Devil's Advocate, or do you know something that I
don't? I thought that "ISO8859_1" always gave direct 1-1 mapping
between bytes(0-255) and char values.
Am I missing a gotcha? Control char values maybe?

"new String(byte[], 0)" should do the trick too, though it is deprecated.

I agree, converting bytes to a String just to pattern match is
'orrible. Better (and probably lots faster) to rustle up your own
search method. You'd think the Byte class would have a utility for
that, wouldn't you.

Steve
 
C

Chris Smith

Steve said:
Are you playing Devil's Advocate, or do you know something that I
don't? I thought that "ISO8859_1" always gave direct 1-1 mapping
between bytes(0-255) and char values.
Am I missing a gotcha? Control char values maybe?

IIRC, ISO 8859-1 does not define character mappings for byte values in
the range of 128-160 or so. As such, the technique suggested would
produce a false positive for patterns that differed only in the choice
of value from within that range.

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
S

Steve Horsley

Chris said:
IIRC, ISO 8859-1 does not define character mappings for byte values in
the range of 128-160 or so. As such, the technique suggested would
produce a false positive for patterns that differed only in the choice
of value from within that range.

Ooh!
That one goes in my little black book. Thanks.

Steve
 
I

Ian Pilcher

Steve said:
I agree, converting bytes to a String just to pattern match is 'orrible.
Better (and probably lots faster) to rustle up your own search method.
You'd think the Byte class would have a utility for that, wouldn't you.

Only if Sun were serious about using Java for real programs. Of course,
then it would support things like unsigned integer types too. (Legacy
data, what a concept!)
 
D

Dale King

Steve Horsley said:
Ooh!
That one goes in my little black book. Thanks.


I suggest erasing it from your little black book, because it isn't true.

ISO8859-1 defines symbols for all values in the range 0 to 255 and they
correspond exactly with the symbols assigned to values 0-255 in Unicode.

The values 128-159 are most definitely defined by ISO8859-1, it is just that
they are assigned to control characters, just as they are in Unicode.
Because they are assigned to control characters instead of printable
characters they usually do not get shown in a code chart and this has led to
this false belief that they are unassigned.
 
D

Dale King

Chris Smith said:
IIRC, ISO 8859-1 does not define character mappings for byte values in
the range of 128-160 or so. As such, the technique suggested would
produce a false positive for patterns that differed only in the choice
of value from within that range.


Well, you don't recall correctly. 128-159 are mapped, its just that they are
mapped to the "C1" control characters, so many discussions of ISO8859-1
don't talk about this range.
 
D

Dale King

Chris Smith said:
IIRC, ISO 8859-1 does not define character mappings for byte values in
the range of 128-160 or so. As such, the technique suggested would
produce a false positive for patterns that differed only in the choice
of value from within that range.


Well, you don't recall correctly. 128-159 are mapped, its just that they are
mapped to the "C1" control characters, so many discussions of ISO8859-1
don't talk about this range, since most people are only interested in the
printable characters.

See http://www.cv.nrao.edu/~abridle/ISO8859-1.htm
 
C

Chris Smith

Dale said:
Well, you don't recall correctly. 128-159 are mapped, its just that they are
mapped to the "C1" control characters, so many discussions of ISO8859-1
don't talk about this range.

Okay, then. So ISO 8859-1 does match those requirements, and will work.
Thanks!

--
www.designacourse.com
The Easiest Way to Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top