Search for byte pattern in a binary file.

  • Thread starter Ryan Tan via JavaKB.com
  • Start date
R

Ryan Tan via JavaKB.com

Hi there, does anyone have a quick way to search for a byte pattern in a binary file?

Ie. I have a byte[] with a series of byte values and I want to locate the occurances of this pattern in a binary file.

The only way I figure so far is to read the binary file with a reasonable sized byte[] (buffer) and manually comparing it byte by byte. This will take a long time and there is complication in data that crosses two buffers (two reads).

Thank you in advance. I'm so glad I discovered this forum today.

_____________________
Message posted via http://www.javakb.com
 
A

Andrew Thompson

On Thu, 18 Nov 2004 06:35:12 GMT, a new poster via JavaKB.com wrote:

*I thought it better to mark this one as specifically off-topic.*
Thank you in advance. I'm so glad I discovered this forum today.

Speaking of which, Mike.
<http://groups.google.com/[email protected]>
Thank you for implementing some of the changes suggested.

I feel the text on the main page is now much more accurate.

Some other matters still require further attention.
_____________________
Message posted via http://www.javakb.com

Note that '_____________________' is *not* a valid sig. delimiter,
whereas Thomas suggested '-- ', which is the accepted sig. delimiter.

The lack of links to resources on Usenet etiquette is of continued concern.

A basic understanding of netiquette will ultimately assist your clients
greatly by helping them to get answers, rather then being harangued for
- not doing proper research, or
- posting to a number of groups (do you offer the ability to post
to multiple groups at a time?), or
- treating the other posters of the group as if they were a 'help-desk'.
 
A

Andrew Thompson

Thank you for your swift reply m Thompson, but what does your post mean? Sorry, I'm a new kid on the block :p

My apologies Ryan, I was not even sure you would see my post.*

I was having a discussion with one of the people who speak for the
(new) forum that you are using to post to usenet.

Many people who post from forums get confused about what they are
actually posting *to*. In this case, you are posting to an usenet
news group, and if your ISP carries 'groups' you can access them
in your email/news client.

One Usenet archive is Google itself, you can find this group here..
<http://groups.google.com/groups?group=comp.lang.java.programmer>

But whereas Google have big, friendly help pages like this..
<http://groups.google.com/googlegroups/help.html>
that explain (some of) how to get the best of news-groups,
other web-interfaces to usenet can be ..less attentive.

* Some posts that appear in the group (and on Google) will not
appear on the site from which you are posting.

In any case.
a) Check out the slew of helpful resources in the Java FAQ
available at my site. <http://www.physci.org/codes/javafaq.jsp>
b) I hope you get the answer you are after.
c) Welcome to the group. :)

HTH (..Hope that helps)
 
R

Ryan Tan via JavaKB.com

Thanks for the explanation Mr Thompson, I'll go have a peek at your website :). I've already searched through comp.lang.java.* at Googlegroup. Thanks for the help :)

_____________________
Message posted via http://www.javakb.com
 
T

Thomas Weidenfeller

Ryan said:
Hi there, does anyone have a quick way to search for a byte pattern in a binary file?

Look at (Google for) algorithms like Boyer-Moore, Knuth-Morris-Pratt or
Rabin-Karp.


/Thomas
 
T

Thomas Weidenfeller

Ryan said:
Thank you for your swift reply m Thompson, but what does your post mean? Sorry, I'm a new kid on the block :p

Andrew's posting was directed at Mike/Michael from javakb. The Usenet
postings generated by javakb were initially not very "Usenet compliant".
Mike fixed this (thanks Mike!), but apparently missed one change, the
change to use a proper sig (sig = signature) delimiter to separate stuff
like
_____________________
Message posted via http://www.javakb.com

from the main message. The correct sig delimiter is '-- ' (dash, dash
space) as the only contents of a line, and not the '________'. This is
AFAIR defined in RFC 1855.

The second thing, also directed at javakb was that they apparently
forward some of the messages on the board to Usenet, but don't properly
educate the users about the customs and expected behavior on Usenet (aka
netiquette).

If this is all greek to you, just forward Andrews and/or mine posting to
Mike or whoever is currently in charge at javakb.

/Thomas
 
F

Frank

Andrew said:
Note that '_____________________' is *not* a valid sig. delimiter,
whereas Thomas suggested '-- ', which is the accepted sig. delimiter.

The lack of links to resources on Usenet etiquette is of continued concern.

A basic understanding of netiquette

While we're on that subject. Perhaps some of us should make an effort to
trim their sig to 4 lines or less..? ;)
 
X

xarax

Thomas Weidenfeller said:
binary file?

Look at (Google for) algorithms like Boyer-Moore, Knuth-Morris-Pratt or
Rabin-Karp.

His main concern is about finding the byte pattern
when it crosses a buffer boundary. The algorithms
mentioned above presume the entire file is in memory.
 
M

Michael Borgwardt

xarax said:
His main concern is about finding the byte pattern
when it crosses a buffer boundary. The algorithms
mentioned above presume the entire file is in memory.

His main concern seemed to be about that *and* about the speed. The
algorithms Thomas referred him to will take care of the speed concern
as far as possible. The finding of patterns covering buffer boundaries
is a minor technical problem.

I'd solve it by keeping two buffers in memory and only replacing the "older"
one when you've gone through it. Oh, and refuse to search for a pattern that's
larger than the buffer size.
 
T

Thomas Weidenfeller

xarax said:
His main concern is about finding the byte pattern
when it crosses a buffer boundary. The algorithms
mentioned above presume the entire file is in memory.

Not at all. Knuth-Morris-Pratt requires no backing-up in the data to be
tested, so that one can even be run with a byte-by-byte read method from
a file.

Boyer-Moore requires some backing-up to verify a potential match. If I
am not mistaken, at worst as much as the wanted pattern is long. So this
can e.g. easily be handed with a back-buffer of that size.

I think Rabin-Karp requires a few passes over the data, and is not 100%
reliable, due to using hashes, so this might not be a good idea for
searching files.

Oh, and it is of course possible to implement the algorithms on top of
memory mapped I/O or RandomAccessFile, although I wouldn't do it.

/Thomas
 
C

Chris Smith

Michael Borgwardt said:
I'd solve it by keeping two buffers in memory and only replacing the "older"
one when you've gone through it. Oh, and refuse to search for a pattern that's
larger than the buffer size.

Actually, one of the algorithms posted does partially solve the buffer
problem. The Knuth-Morris-Pratt algorithm is a forward-only algorithm,
which means that you ought to be able to wrap the byte[] in a
ByteArrayInputStream, and just call read() in succession to get the
characters you need. Note that this is not true, for example, of Boyer-
Moore, so this is a unique characteristic of the algorithm. When you
may need to look backwards (such as in Boyer-Moore), the code to search
a large file on disk becomes much more complex and error-prone.

On the other hand, Boyer-Moore sometimes doesn't need to examine every
byte. If the search pattern is extremely large (on the order of several
kilobytes, for example), this could be a win in performance for a disk-
based search, since disk access is extremely slow.

(I'm not familiar with Rabin-Karp, which Thomas also mentioned. I
therefore have no comment on its advantages or disadvantages for this
problem.)

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
C

Chris Smith

Oops! Guess I should have waited for your response before writing my
own!

Thomas Weidenfeller said:
Boyer-Moore requires some backing-up to verify a potential match. If I
am not mistaken, at worst as much as the wanted pattern is long. So this
can e.g. easily be handed with a back-buffer of that size.

True, but only if the buffer is larger than the search pattern. Most of
the time, I imagine this would be the case. It would, though, require
special handling for searching at the end of the buffer (for example,
you'd probably use two chained buffers so that at least one buffer's
worth of data would always be available).

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
T

Thomas Weidenfeller

Chris said:
Oops! Guess I should have waited for your response before writing my
own!

Why? This is Usenet :)
True, but only if the buffer is larger than the search pattern.

I really have to look it up. AFAIR, B-M requires a maximum back-up of
the size of the search pattern. It would be best to have that as a
buffer. Otherwise things can become really slow - if one e.g. uses a
random access file.
Most of
the time, I imagine this would be the case. It would, though, require
special handling for searching at the end of the buffer (for example,
you'd probably use two chained buffers so that at least one buffer's
worth of data would always be available).

Without closer examination of the algorithm, I would go for a
ring-buffer of the size of the wanted pattern, and a byte-by-byte read.
Each read byte is added to the ring-buffer at the current write position
and then further examined according to the algorithm. The write index of
the ring buffer is advanced, and when the end of the buffer is reached,
the write index is wrapped around to the first index of the buffer.

When having to back up, a read index in the buffer can be moved around
in the buffer, relative to the write index.

/Thomas
 
C

Chris Smith

Thomas Weidenfeller said:
I really have to look it up. AFAIR, B-M requires a maximum back-up of
the size of the search pattern. It would be best to have that as a
buffer. Otherwise things can become really slow - if one e.g. uses a
random access file.

Yes, I'd agree in general.

However, I can imagine certain situations, as I said in my other
response, where you're looking for maximum performance on large search
patterns and decide to use Boyer-Moore with a much smaller buffer (say,
512 bytes or so) and RandomAccessFile, with the hope that you won't need
to read the majority of the file. You'd need to be very careful not to
trigger idiosynchrasies of disks that are optimized for forward access,
though.
Without closer examination of the algorithm, I would go for a
ring-buffer of the size of the wanted pattern, and a byte-by-byte read.

Yes, that's also an option. It's a mere matter of a constant multiplier
of space efficiency, and we all know that doesn't matter, right? ;)

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
T

Tim Jowers

Thomas Weidenfeller said:
Look at (Google for) algorithms like Boyer-Moore, Knuth-Morris-Pratt or
Rabin-Karp.


/Thomas

Which "quick" are you talking about?

String.split() using a regular expression of what you want to match as
StringTokenizer only works on char delims I think. Or maybe something
like this: http://javaalmanac.com/egs/java.util.regex/Tokenize.html.
Or did you mean quick performance?

Seems to me that the algorithm can be optimised by skipping
compares. On what unit? For instance, if your first byte set is
"abbacdedf" then you could measure from statistical sampling (or
measures from a book) that "b,c,d, or f" are much less common than "a
or e", then you can cut the sample set up by matching on those letter.
E.g. start in the middle. If the set of 8 characters does not contain
a "b" or an "f", then recurse each side taking blocks of 8 characters
- cutting your processing time down by 4x or more. Thus, I project
your division is determined by the data and that an optimal division
would account for pattern frequencies in the match set and the input
set. Of course this algorithm can be reversed to use patterns in the
incoming data versus the match set. OTOH, you could do a string
tokenization on the string to be matched and just get a faster
machine. :)

What did you come up with?

TimJowers
 
A

Andrew Thompson

While we're on that subject. Perhaps some of us should make an effort to
trim their sig to 4 lines or less..? ;)

Standards of etiquette are mostly set by the contributors to a group.

I'll make you a deal, Frank. If you start a new thread about sig.
delimiters and can get any *three* regulars on this group to agree
with your view, I will shorten my sig..
 
R

Ryan Tan via JavaKB.com

Hi, thank you all for you invaluable input. I will have a good look at the suggested search algorithm.

In my case, the byte patterns to search are small (less than 100 bytes). So, I was thinking of reading a buffer (say 2048 bytes) and apply one of the algorithm before reading again from (the last position - the size of the pattern) and so on. This way my overhead is only the pattern size. What do you think? If this is the case, I would have to use a forward-searching algorithm like Knuth-Morris-Pratt.

Mr Jowers, by saying "quick" I meant a quick method of coding and also best performance (which sometimes coincide). Since I am interested in processing binary files, there is not much sense in using a dictionary (correct me if I am wrong) because all the data are completely random. And also, all the string functions would not work well with binary data.

Thanks again for all the input. Man, I wish someone has already done a byte[].indexOf method :p

_____________________
Message posted via http://www.javakb.com
 
A

Andrew Thompson

...
I'll make you a deal, Frank. If you start a new thread about sig.
delimiters ..

Ughhh.. I meant sig. *length*.
..and can get any *three* regulars on this group to agree
with your view, I will shorten my sig..

(below, for those unfamiliar with the abbreviation 'sig.')
 
C

Chris Smith

Ryan Tan via JavaKB.com said:
Thanks again for all the input. Man, I wish someone has already done a
byte[].indexOf method :p

Google told me that there's a SourceForge project at
http://mockrunner.sourceforge.net/ that contains exactly that. The code
is in a class called com.mockrunner.util.ArrayUtil, and the method is
(somewhat confusingly) named contains... but it acts just like indexOf
does in java.lang.String.

Note that byte array searching is NOT the primary purpose of mockrunner;
the code just happens to exist by coincidence. I know nothing about the
quality of the code there. Mockrunner uses an Apache-style license, so
you are free to use the code in your own project. See the license for
details about your coincident obligations, though, should you choose to
do so.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top