Q: Recoding some masking routine

F

feck

Hi,

Got a query, don't know if anyone would be kind enough to help...

I have to either match a string exactly eg
"123456789ABCDEF123456789ABCDEF" == "123456789ABCDEF123456789ABCDEF",
or if a packed byte bit mask is set mask1 = 1, mask2 = 4,
"123456789ABCDEF123456789ABCDEF" == "123456789ABCDEF1234x6789ABCDEx".
The mask means that the character at x can match either the original
number, or a set character ('x'). The original string can be up to 80
chars long with 10 8-bit bytes containg the masks.

I have done this, by a plodding for loop with lots of ifs, alas it's
untidy and slow. Any pointers to tidy & speed would be a great help.

Thank-you.
 
C

CBFalconer

Got a query, don't know if anyone would be kind enough to help...

I have to either match a string exactly eg
"123456789ABCDEF123456789ABCDEF" == "123456789ABCDEF123456789ABCDEF",
or if a packed byte bit mask is set mask1 = 1, mask2 = 4,
"123456789ABCDEF123456789ABCDEF" == "123456789ABCDEF1234x6789ABCDEx".
The mask means that the character at x can match either the original
number, or a set character ('x'). The original string can be up to 80
chars long with 10 8-bit bytes containg the masks.

I have done this, by a plodding for loop with lots of ifs, alas it's
untidy and slow. Any pointers to tidy & speed would be a great help.

The optimum will probably depend on your data. I suggest an
initial check with strcmp. If that shows equal, no more fuss. You
might want to write your own str_cmp, and have it return a pointer
to the unequal location. You can then proceed accordingly.

--
If you want to post a followup via groups.google.com, ensure
you quote enough for the article to make sense. Google is only
an interface to Usenet; it's not Usenet itself. Don't assume
your readers can, or ever will, see any previous articles.
More details at: <http://cfaj.freeshell.org/google/>
 
F

feck

What is the relationship between x and the mask?
if that byte is masked, the field can either be it's original value
or one other pre-defined character
so string to compare against = "12345678", mask = 2 (00000010),
pre-defined char = "Z"

Strings to compare:-
"12345678" = match
"123456Z8" = match

"12345Z78" = NO match
"12345578" = NO match

Hope that makes it a little clearer.
 
M

mark_bluemel

Hi,

Got a query, don't know if anyone would be kind enough to help...

I have to either match a string exactly eg
"123456789ABCDEF123456789ABCDEF" == "123456789ABCDEF123456789ABCDEF",
or if a packed byte bit mask is set mask1 = 1, mask2 = 4,
"123456789ABCDEF123456789ABCDEF" == "123456789ABCDEF1234x6789ABCDEx".
The mask means that the character at x can match either the original
number, or a set character ('x'). The original string can be up to 80
chars long with 10 8-bit bytes containg the masks.

I have done this, by a plodding for loop with lots of ifs, alas it's
untidy and slow. Any pointers to tidy & speed would be a great help.

As far as I understand the required behaviour, you have a string
("original") of up to 80 bytes which you are examining, and another
string ("pattern") of the same size (?) against which you are
comparing it. In addition to simple character equality, you need to
test bytes at specific locations - bytes at these locations in
"original" can either match the corresponding byte in "pattern" or
match a single specific other byte. Locations for this dual matching
are specified by bit masks in an additional byte array, with one byte
in the array corresponding to 8 bytes in the original string. Is that
correct so far?

My natural instinct here is to split the matching down into 8-byte
"chunks". This was fairly easy to do and IMHO didn't seem hugely ugly
(I'm not going to post my code yet, as it's far from clear that I've
understood the requirement). How ugly is your code? Could you post a
small self-contained example to show what you've currently tried?

I haven't measured speed in my case. Why do you feel it's slow in
yours? What measures have you used? Have you profiled to the code to
see where the time is going?
 
B

Ben Bacarisse

Hi,

Got a query, don't know if anyone would be kind enough to help...

I have to either match a string exactly eg
"123456789ABCDEF123456789ABCDEF" == "123456789ABCDEF123456789ABCDEF",
or if a packed byte bit mask is set mask1 = 1, mask2 = 4,
"123456789ABCDEF123456789ABCDEF" == "123456789ABCDEF1234x6789ABCDEx".
The mask means that the character at x can match either the original
number, or a set character ('x'). The original string can be up to 80
chars long with 10 8-bit bytes containg the masks.

I have done this, by a plodding for loop with lots of ifs, alas it's
untidy and slow. Any pointers to tidy & speed would be a great help.

It would help more if you posted what *you* have. We could then
decide if we see anything better. Anyway, in case it help, I'd do it
like this in the first instance:

#include <limits.h>

#define BIT_SET(mask, x) ((mask)[(x) / CHAR_BIT] & (1 << (x) % CHAR_BIT))

int quasi_match(const char *s1, const char *s2,
const unsigned char *mask, char optional)
{
int i = 0;
while (s1 && (s1 == s2 || BIT_SET(mask, i) && s2 == optional))
i += 1;
return s1 == s2;
}

The meaning (numbering) of the mask bits can be changed to reflect
your usage simply by altering the arithmetic in the macro.

There is quite a lot of arithmetic in the macro but:
(a) the compiler should make /CHAR_BIT and %CHAR_BIT fast on most
machines;
(b) it is only used when the characters *don't* match.
 
F

feck

As far as I understand the required behaviour, you have a string
("original") of up to 80 bytes
No, can be up to 40, but actually shorter (37) (5 mask bytes * 8 bits
= 40 'masks').
which you are examining, and another
string ("pattern") of the same size (?) against which you are
comparing it. In addition to simple character equality, you need to
test bytes at specific locations - bytes at these locations in
"original" can either match the corresponding byte in "pattern" or
match a single specific other byte. Locations for this dual matching
are specified by bit masks in an additional byte array, with one byte
in the array corresponding to 8 bytes in the original string. Is that
correct so far?
100% correct. Either I'm good at explaining (*grin*) or you can read
between the lines (more likely *sigh*).
My natural instinct here is to split the matching down into 8-byte
"chunks". This was fairly easy to do and IMHO didn't seem hugely ugly
(I'm not going to post my code yet, as it's far from clear that I've
understood the requirement). How ugly is your code?
So ugly the code wears a mask!
Could you post a small self-contained example to show what you've currently tried?
Of course I can:-


//str1 is the string to compare
//str2 is the const comparison string,
//it has six extra bytes, 5 for the masks and 1 that is the actual
value we need (Action)
//DEF_STR_LEN is the length of str2
//Z is the 'mask' character so the strings can match, or if mask
applies can match to original or the 'Z'.
//Like I said, loads of if's and just looks a mess to me (And I wrote
it!). I HAVEN'T profiled it yet (dreading it!),
//but this is called frequently so the fastest it can be (within
reason).

if (!strncmp((char *)str1,
(char *)str2,
(DEF_STR_LEN - 6))
) Action = str2[(DEF_STR_LEN - 1)];
else if (groupMask)
{
j = 0;
do
{
//Frig for strs that aren't a multiple of 8.
if (j == 4) l = 4; else l = 7;
{
k = 0;
do
{
if (*(str1+(j*8)+k) == str2[((j*8) +
k)]) match = 1;
else if (str2[(DEF_STR_LEN - 2 + j)]
<< k & 0x80) if(*(str1+(j*8)+k) == 'Z') match = 1;
else match = 0;
} while ((k++ < l) && match);
}
} while ((++j < 5) && match);
if (match) action = str2[(DEF_STR_LEN - 1)];
}
I haven't measured speed in my case. Why do you feel it's slow in
yours? What measures have you used? Have you profiled to the code to
see where the time is going?
Not profiled (yet), but I can't see how it's going to wizz!

Kind Regards.
 
F

feck

On Thu, 24 May 2007 17:31:42 GMT, (e-mail address removed) wrote:

SNIP
if (!strncmp((char *)str1,
(char *)str2,
(DEF_STR_LEN - 6))
) Action = str2[(DEF_STR_LEN - 1)];
else if (groupMask)
{
j = 0;
do
{
//Frig for strs that aren't a multiple of 8.
if (j == 4) l = 4; else l = 7;
{
k = 0;
do
{
if (*(str1+(j*8)+k) == str2[((j*8) +
k)]) match = 1;
else if (str2[(DEF_STR_LEN - 2 + j)]
<< k & 0x80) if(*(str1+(j*8)+k) == 'Z') match = 1;
Oooops, cut & paste! That should read:-
else if ((str2[(DEF_STR_LEN - 2 + j)] << k & 0x80) && (*(str1+(j*8)+k)
== 'Z')) match = 1;
else match = 0;
} while ((k++ < l) && match);
}
} while ((++j < 5) && match);
if (match) action = str2[(DEF_STR_LEN - 1)];
}
SNIP
 
M

mark_bluemel

Here's my off-the-cuff version. I like Ben's version for terseness,
but I wanted to produce something which expressed my understanding of
the problem - which is that the masks naturally split the original and
pattern strings into chunks.

I'd start with this, or something like it, and then look at profiling
it to find out where time was being eaten and focus on how that could
be trimmed.

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>

int matches(char *original,char *pattern, unsigned char *masks,char
fixed);
int chunkMatches(char *original,char *pattern, unsigned char mask,char
fixed,int chunkSize);

int main(void) {
/* One good and 2 bad strings */
char *original[] = {
"12Z456789ZBCDEF123456789ABCDEF",
"12Z456789ZBCDEF123X56789ABCDEF",
"12Z456789XBCDEF123456789ABCDEF",
NULL
};
char *pattern = "123456789ABCDEF123456789ABCDEF";

/* allow the "fixed" value in char 3 of first 8 byte chunk
char 2 of second 8 byte chunk, but not in the third */
unsigned char masks[3]= { 1 << 5, 1 << 6, 0 };
char fixed = 'Z';
char **o;
for(o=original;*o != NULL;o++) {
printf("[%s] %s match [%s](%c)\n",
*o,matches(*o,pattern,masks,'Z')?"did":"did
not",pattern,fixed);
}
}

int matches(char *original,char *pattern, unsigned char *masks,char
fixed) {
int chunks;
int oLen = strlen(original);
int pLen = strlen(pattern);

if (oLen != pLen) {
return 0;
}

chunks = oLen/CHAR_BIT + ((oLen % CHAR_BIT) != 0);

while(chunks--) {
int chunkSize = strlen(original);
if (chunkSize > CHAR_BIT) {
chunkSize = CHAR_BIT;
}
if (!chunkMatches(original,pattern,*masks,fixed,chunkSize)){
return 0;
}
original += chunkSize;
pattern += chunkSize;
masks += 1;
}

return 1;
}

int chunkMatches(char *original,char *pattern, unsigned char mask,char
fixed,int chunkSize) {
unsigned char currentMask = 1<<(CHAR_BIT - 1);

while(chunkSize--) {
if (*original == *pattern) {
;
} else if ((currentMask & mask) && (*original == fixed)) {
;
} else {
return 0;
}
original++;
pattern++;
currentMask >>= 1;
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top