Find all occurences of a bit sequence in a file

Daneel · Feb 8, 2007

Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael

KoopaJah · Feb 8, 2007

Hello Daneel,

An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!

Don't know if you can use this method,

Remi J.

Daneel a écrit :

Daneel · Feb 8, 2007

An easy way to do this could be to work on characters instead of bits. I

mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!

Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?

All the best,
Michael

KoopaJah · Feb 8, 2007

Daneel a écrit :

Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?

Hmmm i guess i did not explain myself properly. If you represent each
bit as a character, you do not need to be char/byte aligned. If your
binary content of your file is 0001001100110100, you need to represent
this binary as an array of 16 characters (so 16*8 bits) each char
corresponding to one of your bit with the value 0x30 or 0x31 ONLY (to
represent '0' or '1')

If you are searching for the pattern 0100 in binary, it corresponds to
the patterns 0x30 0x31 0x30 0x30 in your char array representation

Searching a pattern on such a string becomes really easy even if it asks
for 8th more memory to store the file content

Sincerely,

Remi J.

Daneel · Feb 8, 2007

An easy way to do this could be to work on characters instead of bits. I

Hmmm i guess i did not explain myself properly. If you represent each
bit as a character, you do not need to be char/byte aligned. If your
binary content of your file is 0001001100110100, you need to represent
this binary as an array of 16 characters (so 16*8 bits) each char
corresponding to one of your bit with the value 0x30 or 0x31 ONLY (to
represent '0' or '1')

If you are searching for the pattern 0100 in binary, it corresponds to
the patterns 0x30 0x31 0x30 0x30 in your char array representation

Searching a pattern on such a string becomes really easy even if it asks
for 8th more memory to store the file content

Oh, I see. How would you translate a binary file into such a string?

Thanks,
Michael

Mark P · Feb 8, 2007

Daneel said:
Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael

See, e.g., http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm

You'll have better luck posting algorithm questions on comp.programming

-Mark

Default User · Feb 8, 2007

KoopaJah said:
Hello Daneel,

Please don't top-post. Your replies belong following or interspersed
with properly trimmed quotes. See the majority of other posts in the
newsgroup, or the group FAQ list:
<http://www.parashift.com/c++-faq-lite/how-to-post.html>

W Karas · Feb 8, 2007

Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael

(Assuming an 8-bit byte) if your pattern is 8 bits or less, it might
help
you to first fill in these two arrays:

unsigned char unsplit[256], split[256][256];

where unsplit[val] gives the number of occurences of the pattern
in the binary representation of "val", and split[val1][val2] gives the
number of occurences of the pattern in the (16-bit) concatenation
of the binary representations of "val1"and "val2", minus
(unsplit[val1] + unsplit[val2]). If b1, b2, b3 ... are the bytes in
the file, then the solution is:

unsplit[b1] + split[b1][b2] + unsplit[b2] + split[b2][b3]
+ unsplit[b3] + split[b3][b4] + ...

The file might have to be several hundred Kbytes for the overhead
of generating the arrays to be worth it. The rate of cache evictions
of the arrays would also probably have a big influence on
performance.

paul.joseph.davis · Feb 9, 2007

Hello!

Click to expand...

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

Click to expand...

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

Click to expand...

It would be great to get some input, possibly with sample source code.

Click to expand...

Many thanks,
Michael

Click to expand...

(Assuming an 8-bit byte) if your pattern is 8 bits or less, it might
help
you to first fill in these two arrays:

unsigned char unsplit[256], split[256][256];

where unsplit[val] gives the number of occurences of the pattern
in the binary representation of "val", and split[val1][val2] gives the
number of occurences of the pattern in the (16-bit) concatenation
of the binary representations of "val1"and "val2", minus
(unsplit[val1] + unsplit[val2]). If b1, b2, b3 ... are the bytes in
the file, then the solution is:

unsplit[b1] + split[b1][b2] + unsplit[b2] + split[b2][b3]
+ unsplit[b3] + split[b3][b4] + ...

The file might have to be several hundred Kbytes for the overhead
of generating the arrays to be worth it. The rate of cache evictions
of the arrays would also probably have a big influence on
performance.

Another method to do this would to be to make 8 copies of your search
pattern. Making each copy shifted by one bit. (Note, after 8 shifts
you have what you started with + 1 byte of zeros.)

Then perform a regular NxM comparison of all 8 copies on the input
data paying special attention to the last byte of the pattern. (First
shited copy you're going to want to only match 7 of 8 bits, second
shifted copy 6 of 8) and so on.

Then its just adding up the total matches for each shifted pattern.
The benefit you get from this are that you can apply any of the string
matching algorithms with a slight modification for the last byte match
(There are string matching algorithms that are sub-linear).

Hth,
Paul Davis

Program to find occurences of a word in a file	18	Nov 20, 2007
Finding a bit sequence in another	11	Oct 22, 2009
Cycle around a sequence	9	Feb 8, 2012
searching sequence in file	8	Mar 11, 2009
frequency count or number of occurences of a number in an array	5	Mar 12, 2008
show all subset of a set	11	Feb 6, 2014
Data saving in condition of changing reality	0	Apr 29, 2022
How can I get my menu inside of a menu to function properly?	1	Jan 19, 2023

Find all occurences of a bit sequence in a file

Daneel

KoopaJah

Daneel

KoopaJah

Daneel

Mark P

Default User

W Karas

paul.joseph.davis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads