Find all occurences of a bit sequence in a file

D

Daneel

Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael
 
K

KoopaJah

Hello Daneel,

An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!

Don't know if you can use this method,

Remi J.

Daneel a écrit :
 
D

Daneel

An easy way to do this could be to work on characters instead of bits. I
mean you could store each bit as a character in a string (0001 becomes
0x30 0x30 0x30 0x31). It should simplify a lot the pattern search!

Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?

All the best,
Michael
 
K

KoopaJah

Daneel a écrit :
Yes, but this would only work if the bit sequence in the file is byte
(or in this case char) aligned (which it is not, as stated in my
initial post), right?

Hmmm i guess i did not explain myself properly. If you represent each
bit as a character, you do not need to be char/byte aligned. If your
binary content of your file is 0001001100110100, you need to represent
this binary as an array of 16 characters (so 16*8 bits) each char
corresponding to one of your bit with the value 0x30 or 0x31 ONLY (to
represent '0' or '1')

If you are searching for the pattern 0100 in binary, it corresponds to
the patterns 0x30 0x31 0x30 0x30 in your char array representation


Searching a pattern on such a string becomes really easy even if it asks
for 8th more memory to store the file content

Sincerely,

Remi J.
 
D

Daneel

An easy way to do this could be to work on characters instead of bits. I
Hmmm i guess i did not explain myself properly. If you represent each
bit as a character, you do not need to be char/byte aligned. If your
binary content of your file is 0001001100110100, you need to represent
this binary as an array of 16 characters (so 16*8 bits) each char
corresponding to one of your bit with the value 0x30 or 0x31 ONLY (to
represent '0' or '1')

If you are searching for the pattern 0100 in binary, it corresponds to
the patterns 0x30 0x31 0x30 0x30 in your char array representation

Searching a pattern on such a string becomes really easy even if it asks
for 8th more memory to store the file content

Oh, I see. How would you translate a binary file into such a string?

Thanks,
Michael
 
M

Mark P

Daneel said:
Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael

See, e.g., http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm

You'll have better luck posting algorithm questions on comp.programming

-Mark
 
W

W Karas

Hello!

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).

I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?

It would be great to get some input, possibly with sample source code.

Many thanks,
Michael

(Assuming an 8-bit byte) if your pattern is 8 bits or less, it might
help
you to first fill in these two arrays:

unsigned char unsplit[256], split[256][256];

where unsplit[val] gives the number of occurences of the pattern
in the binary representation of "val", and split[val1][val2] gives the
number of occurences of the pattern in the (16-bit) concatenation
of the binary representations of "val1"and "val2", minus
(unsplit[val1] + unsplit[val2]). If b1, b2, b3 ... are the bytes in
the file, then the solution is:

unsplit[b1] + split[b1][b2] + unsplit[b2] + split[b2][b3]
+ unsplit[b3] + split[b3][b4] + ...

The file might have to be several hundred Kbytes for the overhead
of generating the arrays to be worth it. The rate of cache evictions
of the arrays would also probably have a big influence on
performance.
 
P

paul.joseph.davis

I'm looking for an algorithm which finds all occurences of a bit
sequence (e.g., "0001") in a file. This sequence can start at any bit
in the file (it is not byte aligned).
I have some ideas of how to approach the problem (1) reading file into
unsigned char buffer, 2) defining bit structure, 3) comparing the
first 4 bits of the buffer with the bit structure, 4) shifting the
char buffer one left, 5) repeat at step 3)) but I'm not sure if this
is a good or the right approach?
It would be great to get some input, possibly with sample source code.
Many thanks,
Michael

(Assuming an 8-bit byte) if your pattern is 8 bits or less, it might
help
you to first fill in these two arrays:

unsigned char unsplit[256], split[256][256];

where unsplit[val] gives the number of occurences of the pattern
in the binary representation of "val", and split[val1][val2] gives the
number of occurences of the pattern in the (16-bit) concatenation
of the binary representations of "val1"and "val2", minus
(unsplit[val1] + unsplit[val2]). If b1, b2, b3 ... are the bytes in
the file, then the solution is:

unsplit[b1] + split[b1][b2] + unsplit[b2] + split[b2][b3]
+ unsplit[b3] + split[b3][b4] + ...

The file might have to be several hundred Kbytes for the overhead
of generating the arrays to be worth it. The rate of cache evictions
of the arrays would also probably have a big influence on
performance.

Another method to do this would to be to make 8 copies of your search
pattern. Making each copy shifted by one bit. (Note, after 8 shifts
you have what you started with + 1 byte of zeros.)

Then perform a regular NxM comparison of all 8 copies on the input
data paying special attention to the last byte of the pattern. (First
shited copy you're going to want to only match 7 of 8 bits, second
shifted copy 6 of 8) and so on.

Then its just adding up the total matches for each shifted pattern.
The benefit you get from this are that you can apply any of the string
matching algorithms with a slight modification for the last byte match
(There are string matching algorithms that are sub-linear).

Hth,
Paul Davis
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top