Finding a bit string in another bitstring

J

jacob navia

Here is a function to do the equivalent of strstr with bitstrings.

The basic ideas is to prepare a table of shifted patterns, and compare
bytes instead of bits, what makes things faster and easier.

I would like to thank Mr Morris Keesan and Mr Peter Nilsson of this discussion
group for their help with the bit shifting.

I present this function for you to shake it. Please try to find bugs,
inconsistencies or things that could be written better.

Thanks in advance for your help
-------------------------------------------------------------------bitstrstr.c
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
static unsigned char mask[] = { 255,127,63,31,15,7,3,1};
static unsigned char maskI[]= { 255,1,3,7,15,31,63,127};
/*------------------------------------------------------------------------
Procedure: bitBitstr ID:1
Purpose: Finds the bit position of the given bit-pattern in
the given text. The algorithm is as follows:
(1) It builds a table of CHAR_BIT lines, each line
holding a one bit right shifted pattern.
(2) For each byte in the text it will compare the
byte with each byte in the shifted pattern table. A
match is found if all bytes are equal.
(3) The first byte and the last byte must be masked
because they contain some bit positions with unused
bits.
Input: The bit-string to be searched and its length, and
the bit-pattern and its length. Both lengths in bits
Output: Zero if the pattern isn't found, or 1+bit-index of
the starting bit of the pattern if the pattern is
found.
Errors: If there is no memory it will return "pattern not
found" (zero).
------------------------------------------------------------------------*/
size_t bitBitstr(unsigned char *text,size_t textSize,unsigned char *pat,size_t patSize)
{
size_t patByteLength = (patSize+CHAR_BIT-1)/CHAR_BIT;
size_t i,j,k,stop;
unsigned char *ShiftTable,*ptable,*pText;

if (patSize > textSize) /* eliminate wrong arguments */
return 0;
/* Allocate the shift table. If no memory abort with zero. */
ShiftTable = calloc(patByteLength,CHAR_BIT);
if (ShiftTable == NULL) {
return 0;
}
/* The first line of the shift table is unshifted. Just copy it */
memcpy(ShiftTable,pat,patByteLength);
ptable = ShiftTable + patByteLength; /* Advance poinbter to next line */
for (i=1; i<CHAR_BIT; i++) {
/* Copy the last line into the new line */
memcpy(ptable,ptable-patByteLength,patByteLength);
/* Now shift it by one bit */
k=patByteLength;
for (j = 0; k--; ptable++) {
j = (j << CHAR_BIT) | *ptable;
*ptable = j >> 1;
}
}

/* We should stop when the number of bytes left is less than the pattern length */
stop = (textSize+CHAR_BIT-1)/CHAR_BIT - patByteLength;

/* For each byte in the input text */
for (k=0; k <= stop; k++) {
/* For each line in the shift table */
for (i=0; i<CHAR_BIT;i++) {
ptable = ShiftTable+i*patByteLength; /* will point to the table line */
pText = text+k; /* Will point at the current start character in the text */
/* The first character needs a mask since there could be
unused bits */
if ((*pText++ & mask) != *ptable++)
break;
/* Search each byte in the input text until last byte minus 1 */
for (j=1; j<patByteLength-1; j++) {
if (*pText++ != *ptable++)
break;
}
if (j == patByteLength-1) {
/* Test the last byte with the complemented mask */
if ((*pText & maskI[patSize&(CHAR_BIT-1)]) == *ptable) {
free(ShiftTable);
return 1+(k*CHAR_BIT)+i;
}
}
}
}
/* There was no match */
free(ShiftTable);
return 0;
}

#ifdef TEST
#include <stdio.h>
int main(void)
{
unsigned char *txt="\xfb\x1\x2\x3\x4\xba\xdb\xad\x5\x7\x2\x77";
unsigned char *pat="\xba\xdb\xad\x1";
size_t txtlen=12*CHAR_BIT;
size_t patlen=1+3*CHAR_BIT;
size_t result = bitBitstr(txt,txtlen,pat,patlen);
if (result != 41)
printf("failed (%u)\n",result);
}

#endif
 
R

robertwessel2

Here is a function to do the equivalent of strstr with bitstrings.

The basic ideas is to prepare a table of shifted patterns, and compare
bytes instead of bits, what makes things faster and easier.

I would like to thank Mr Morris Keesan and Mr Peter Nilsson of this discussion
group for their help with the bit shifting.

I present this function for you to shake it. Please try to find bugs,
inconsistencies or things that could be written better.

Thanks in advance for your help
-------------------------------------------------------------------bitstrst­r.c
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
static unsigned char mask[] = { 255,127,63,31,15,7,3,1};
static unsigned char maskI[]= { 255,1,3,7,15,31,63,127};
/*------------------------------------------------------------------------
  Procedure:     bitBitstr ID:1
  Purpose:       Finds the bit position of the given bit-pattern in
                 the given text. The algorithm is as follows:
                 (1) It builds a table of CHAR_BIT lines, each line
                 holding a one bit right shifted pattern.
                 (2) For each byte in the text it will compare the
                 byte with each byte in the shifted pattern table. A
                 match is found if all bytes are equal.
                 (3) The first byte and the last byte must be masked
                 because they contain some bit positions with unused
                 bits.
  Input:         The bit-string to be searched and its length, and
                 the bit-pattern and its length. Both lengths in bits
  Output:        Zero if the pattern isn't found, or 1+bit-index of
                 the starting bit of the pattern if the pattern is
                 found.
  Errors:        If there is no memory it will return "pattern not
                 found" (zero).
------------------------------------------------------------------------*/
size_t bitBitstr(unsigned char *text,size_t textSize,unsigned char *pat,size_t patSize)
{
        size_t patByteLength = (patSize+CHAR_BIT-1)/CHAR_BIT;
        size_t i,j,k,stop;
        unsigned char *ShiftTable,*ptable,*pText;

        if (patSize > textSize) /* eliminate wrong arguments */
                return 0;
        /* Allocate the shift table. If no memory abort with zero.. */
        ShiftTable = calloc(patByteLength,CHAR_BIT);
        if (ShiftTable == NULL) {
                return 0;
        }
        /* The first line of the shift table is unshifted. Just copy it */
        memcpy(ShiftTable,pat,patByteLength);
        ptable = ShiftTable + patByteLength; /* Advance poinbter to next line */
        for (i=1; i<CHAR_BIT; i++) {
                /* Copy the last line into the new line */
                memcpy(ptable,ptable-patByteLength,patByteLength);
                /* Now shift it by one bit */
                k=patByteLength;
            for (j = 0; k--; ptable++) {
                j = (j << CHAR_BIT) | *ptable;
                *ptable = j >> 1;
            }
        }

        /* We should stop when the number of bytes left is less than the pattern length */
        stop = (textSize+CHAR_BIT-1)/CHAR_BIT - patByteLength;

        /* For each byte in the input text */
        for (k=0; k <= stop; k++) {
                /* For each line in the shift table */
                for (i=0; i<CHAR_BIT;i++) {
                        ptable = ShiftTable+i*patByteLength; /* will point to the table line */
                        pText = text+k; /* Will point at the current start character in the text */
                        /* The first character needs a mask since there could be
                           unused bits */
                        if ((*pText++ & mask) != *ptable++)
                                break;
                        /* Search each byte in the input text until last byte minus 1 */
                        for (j=1; j<patByteLength-1; j++) {
                                if (*pText++ != *ptable++)
                                        break;
                        }
                        if (j == patByteLength-1) {
                                /* Test the last byte with the complemented mask */
                                if ((*pText & maskI[patSize&(CHAR_BIT-1)]) == *ptable) {
                                        free(ShiftTable);
                                        return 1+(k*CHAR_BIT)+i;
                                }
                        }
                }
        }
        /* There was no match */
        free(ShiftTable);
        return 0;

}

#ifdef TEST
#include <stdio.h>
int main(void)
{
        unsigned char *txt="\xfb\x1\x2\x3\x4\xba\xdb\xad\x5\x7\x2\x77";
        unsigned char *pat="\xba\xdb\xad\x1";
        size_t txtlen=12*CHAR_BIT;
        size_t patlen=1+3*CHAR_BIT;
        size_t result = bitBitstr(txt,txtlen,pat,patlen);
        if (result != 41)
                printf("failed (%u)\n",result);

}

#endif



Two comments off the bat:

First, your parameterization for different CHAR_BIT values is
incomplete. Specifically the masking operations remain dependent on
eight bit chars.

Second, given the amount of storage you're allocating, and the
considerable amount of work you're doing preparing the shifted
patterns, might a binary KMP not be a better choice? It strikes me
that you wouldn’t need all that much more auxiliary storage, and while
its speed going bit-by bit wont be great, you're still doing a lot of
work looking at each input byte with each of the shifted patterns, and
I suspect it might well be faster for all but the shortest text/
pattern combinations.
 
F

Flash Gordon

Two comments off the bat:

First, your parameterization for different CHAR_BIT values is
incomplete. Specifically the masking operations remain dependent on
eight bit chars.

Second, given the amount of storage you're allocating, and the
considerable amount of work you're doing preparing the shifted
patterns, might a binary KMP not be a better choice? It strikes me
that you wouldn’t need all that much more auxiliary storage, and while
its speed going bit-by bit wont be great, you're still doing a lot of
work looking at each input byte with each of the shifted patterns, and
I suspect it might well be faster for all but the shortest text/
pattern combinations.


Additional points if you stick with this method...

use malloc rather than calloc, since the allocated memory will all be
overwritten

fall back to a method not requiring extra memory if the allocation fails
rather than failing

the user cannot tell a failure from the bit string matching on the first bit
 
J

jacob navia

(e-mail address removed) a écrit :
Two comments off the bat:

First, your parameterization for different CHAR_BIT values is
incomplete. Specifically the masking operations remain dependent on
eight bit chars.

Yes, I will enclose those in #if (CHAR_BIT == 8)
Second, given the amount of storage you're allocating, and the
considerable amount of work you're doing preparing the shifted
patterns, might a binary KMP not be a better choice? It strikes me
that you wouldn’t need all that much more auxiliary storage, and while
its speed going bit-by bit wont be great, you're still doing a lot of
work looking at each input byte with each of the shifted patterns, and
I suspect it might well be faster for all but the shortest text/
pattern combinations.

According to the literature I have read (it was user923005 that gave
me a set of literature pointers) this will be MUCH faster than adapting an
existing algorithm with bit shifting. Anyway, I haven't implemented any
algorithm here (yet), nor KMP nor skip test, nor any other, I am using a
completely naive algorithm in the part that goes

for (j=1; j<patbytelen; j++) {
if ... etc
}

That can be surely be improved, but I want to debug THIS first,
then (maybe, if I have the time) I can implement a better one.

Thanks for your remarks
 
J

jacob navia

Flash Gordon a écrit :
Additional points if you stick with this method...

use malloc rather than calloc, since the allocated memory will all be
overwritten

Yes, but I used calloc because of the multiplication!

calloc should do the multiplication without overflow. If *I* do the
multiplication it could overflow. But maybe it is better to test for
overflow in C, and then call malloc, for big bit strings...

:)

fall back to a method not requiring extra memory if the allocation fails
rather than failing

That would be very complicated! Within the cpontainer library of course, I will
call the standard function for no memory errors, as all other containers. This
function was just a piece to test the algorithm before incorporating it into
the rest of the software. Bitstrings are interesting but I am spending a lot of time
in them...
the user cannot tell a failure from the bit string matching on the first
bit

Yes. In the library that will be different.

Thanks for your remarks
 
B

Ben Bacarisse

jacob navia said:
Flash Gordon a écrit :

Yes, but I used calloc because of the multiplication!

You later do the same multiplication anyway. The calloc is:

ShiftTable = calloc(patByteLength,CHAR_BIT);

and later you have:

for (i=0; i<CHAR_BIT;i++) {
ptable = ShiftTable+i*patByteLength;
/* ... */
}
calloc should do the multiplication without overflow. If *I* do the
multiplication it could overflow. But maybe it is better to test for
overflow in C, and then call malloc, for big bit strings...

That would solve the problem.

<snip>
 
K

Keith Thompson

jacob navia said:
Flash Gordon a écrit :

Yes, but I used calloc because of the multiplication!

calloc should do the multiplication without overflow. If *I* do the
multiplication it could overflow. But maybe it is better to test for
overflow in C, and then call malloc, for big bit strings...

:)
[...]

Yes, calloc *should* do the multiplication without overflow (or at
least return NULL if the implied multiplication would have
overflowed^H^H^H^H^H^H^H^H^H^H wrapped around), but not all
implementations do this correctly.
 
F

Flash Gordon

jacob said:
Flash Gordon a écrit :

Yes, but I used calloc because of the multiplication!

calloc should do the multiplication without overflow. If *I* do the
multiplication it could overflow. But maybe it is better to test for
overflow in C, and then call malloc, for big bit strings...

:)

I suspected your reason, but I think the inefficiencies outweigh it.
That would be very complicated!

Well, the initial check is easy, and you then basically have two
implementations of the function, one with the current method and one
with a different method that does not require extra memory.
Within the cpontainer library of course,
I will
call the standard function for no memory errors, as all other
containers. This
function was just a piece to test the algorithm before incorporating it
into
the rest of the software. Bitstrings are interesting but I am spending a
lot of time
in them...

They raise interesting problems, but I've not had need of them, although
I have needed effectively bit arrays for other reasons.
Yes. In the library that will be different.

Thanks for your remarks

No problem.
 
B

bartc

jacob navia said:
Flash Gordon a écrit :

Yes, but I used calloc because of the multiplication!

calloc should do the multiplication without overflow. If *I* do the
multiplication it could overflow. But maybe it is better to test for
overflow in C, and then call malloc, for big bit strings...

Precalculate the largest value of patByteLength that will not overflow when
multiplied by CHAR_BIT. If patByteLength is less than or equal to this
limit, use malloc, otherwise...
That would be very complicated!

Not really. You have a shifttable of 791 bytes (or whatever comfortable
size) as a local array.

If the patByteLength*CHAR_BIT is less or equal than 791, then ShiftTable
points to this local array, otherwise it's allocated. And a quick check when
freeing.

Of more concern is the speed of the whole thing. If the speed is irrelevant,
and you have bit indexing, then that would a quick solution until you have
time for something better.

But, what are the typical usage patterns? You're assuming a long bitstring
will be searched within a long bitstring, but perhaps a short bitstring
within a long one is more typical (or even a short bitstring within another
short one). (A short bitstring is one that fits into a machine word.) These
cases can be streamlined.

Another assumption I think you've made are that both *text and *pat are
aligned to a byte boundary. Is this going to always be the case? Someone
calls this function and it returns a match at position N. Then he wants to
search for something else starting at N+patSize; that new *text may not
necessarily be aligned, and may be too big to copy and shift.
 
B

Barry Schwarz

Here is a function to do the equivalent of strstr with bitstrings.

The basic ideas is to prepare a table of shifted patterns, and compare
bytes instead of bits, what makes things faster and easier.

I would like to thank Mr Morris Keesan and Mr Peter Nilsson of this discussion
group for their help with the bit shifting.

I present this function for you to shake it. Please try to find bugs,
inconsistencies or things that could be written better.

Thanks in advance for your help
-------------------------------------------------------------------bitstrstr.c
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
static unsigned char mask[] = { 255,127,63,31,15,7,3,1};
static unsigned char maskI[]= { 255,1,3,7,15,31,63,127};

Since these values are intended to be masks and not decimal values, I
suggest initializing them with 0x type values just for the
self-documentation benefit.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top