how to simplify/shorten this loop of bit shifting code?

B

ben

i have a bit of code, that works absolutely fine as is, but seems over
complicated/long winded. is there anyway to shorten/simplify it?

the code is below. description of it: it's like strcpy in that it
copies one block of data to another block of data until the block that
is being copied contains a zero/null. the difference with this code is
that it's doing 4bits at a time (all the values are 4bits) and the two
blocks of data may not be aligned the same - that is the start of the
data that's being copied could be the rightmost 4bits of a normal byte,
and the start of the data that's being written to could be the leftmost
4bits of a byte for example - so out of sync with each other as it
were. also the terminating NULL is a 4bit value.



// 'datatocopy' is a u_int8_t pointer to the start of the data to copy.

// 'tocopyto' is a u_int8_t pointer to the start of the space to
// copy to.

// 'st' is an int for adding to datatocopy to step along it (the number
// of 4bit blocks from the left (starting at 0)).

// 'st2' dito, but for tocopyto.

while( ((*(datatocopy + ((st % 2 ? st-1 : st)/2 ))) >> (st % 2 ? 0 :
4) & 0xf) != 0 ) { // steps along and reads 4bits at a time

*(tocopyto + ((st2 % 2 ? st2 - 1 : st2) / 2) ) |= 0xf << (st2
% 2 ? 0 : 4); // zero the 4 bits...

*(tocopyto + ((st2 % 2 ? st2 - 1 : st2) / 2) ) ^= 0xf << (st2
% 2 ? 0 : 4); // ...about to be written to

tmp = *(datatocopy + ((st % 2 ? st-1 : st)/2 )); // get the
8bit byte that includes required 4bits

if( st % 2 ) { // it's odd (required 4bits are on the right).
tmp |= 0xf0; // delete the,
tmp ^= 0xf0; // left 4bits.
} else // it's even (required 4bits on the left).
tmp >>= 4; // shift to the right.

*(tocopyto + ((st2 % 2 ? st2 - 1 : st2) / 2)) |= tmp << ( st2
% 2 ? 0 : 4); // copy the 4bits

st++, st2++; // increment 4bit block counters
}



i'm using an intermediate temperory variable half way through - is that
necessary? it just seems an awful lot to do what it's doing.

i'd just like to really consolidate it somehow - possible? easy?
pointers to how to?

any help much apprecaited.

thanks, ben.
 
B

ben

i missed something obvious, but i'd still like to know if it could be
consolidated further? thanks.


// 'datatocopy' is a u_int8_t pointer to the start of the data to copy.

// 'tocopyto' is a u_int8_t pointer to the start of the space to
// copy to.

// 'st' is an int for adding to datatocopy to step along it (the number
// of 4bit blocks from the left (starting at 0)).

// 'st2' dito, but for tocopyto.


while( (tmp = ((*(datatocopy + ((st % 2 ? st-1 : st)/2 ))) >> (st % 2
? 0 : 4) & 0xf)) != 0 ) { // step along and read 4bits at a time

*(tocopyto + ((st2 % 2 ? st2 - 1 : st2) / 2) ) |= 0xf << (st2
% 2 ? 0 : 4); // zero the 4 bits....

*(tocopyto + ((st2 % 2 ? st2 - 1 : st2) / 2) ) ^= 0xf << (st2
% 2 ? 0 : 4); // ....about to be written to

*(tocopyto + ((st2 % 2 ? st2 - 1 : st2) / 2)) |= tmp << ( st2
% 2 ? 0 : 4); // copy the bits

st++, st2++;
}
 
C

Christian Bau

i missed something obvious, but i'd still like to know if it could be
consolidated further? thanks.


// 'datatocopy' is a u_int8_t pointer to the start of the data to copy.

// 'tocopyto' is a u_int8_t pointer to the start of the space to
// copy to.

// 'st' is an int for adding to datatocopy to step along it (the number
// of 4bit blocks from the left (starting at 0)).

// 'st2' dito, but for tocopyto.


while( (tmp = ((*(datatocopy + ((st % 2 ? st-1 : st)/2 ))) >> (st % 2
? 0 : 4) & 0xf)) != 0 ) { // step along and read 4bits at a time

*(tocopyto + ((st2 % 2 ? st2 - 1 : st2) / 2) ) |= 0xf << (st2
% 2 ? 0 : 4); // zero the 4 bits....

Quite unlikely that |= will set anything to zero.
*(tocopyto + ((st2 % 2 ? st2 - 1 : st2) / 2) ) ^= 0xf << (st2
% 2 ? 0 : 4); // ....about to be written to

*(tocopyto + ((st2 % 2 ? st2 - 1 : st2) / 2)) |= tmp << ( st2
% 2 ? 0 : 4); // copy the bits

st++, st2++;
}

(st2 % 2 ? st2 - 1 : st2) / 2

is exactly the same as

st2 / 2

I would most likely write something like

for (;;) {
src_index = st / 2;
src_shift = st % 2 ? 0 : 4;
src_value = (src [src_index] >> src_shift) & 0x0f;
if (src_value == 0) break;

dst_index = st2 / 2;
dst_shift = st2 % 2 ? 0 : 4;
dst [dst_index] &= ~(0xf << dst_shift);
dst [dst_index] |= src_value << dst_shift;
}
 
C

Chris Torek

i have a bit of code, that works absolutely fine as is, but seems over
complicated/long winded. is there anyway to shorten/simplify it?

Certainly.

(See original for longer description and original code and variable
names -- the idea is to copy 4-bit units from C99-style 8-bit
uint8_t storage units, from source to destination, not necessarily
at the same 4-bit offset. The mask is apparently 0x0f, then 0xf0,
then 0x0f, then 0xf0, etc., based on nybble offset of 0, 1, 2, etc.,
as counted by "st" and "st2".)
// 'st' is an int for adding to datatocopy to step along it (the number
// of 4bit blocks from the left (starting at 0)).

while( ((*(datatocopy + ((st % 2 ? st-1 : st)/2 ))) >> (st % 2 ? 0 :
4) & 0xf) != 0 ) { // steps along and reads 4bits at a time

Note that if st is always nonnegative, there is no need to test
"st % 2" before dividing by 2, as 0/2 is 0, 1/2 is 0, 2/2 is 1, 3/2
is 1, and so on. Just making use of that fact would simplify the
code quite a bit.

If the loop is expected to run for relatively many 4-bit "nybbles",
though, I would compute an initial set of loop invariants, e.g.:

s: points to the uint8_t that contains the source nybble (4-bit unit)
d: points to the uint8_t that contains the destination nybble
sm: source mask, 0x0f or 0xf0 based on whether st is even or odd
dm: destination mask, 0x0f or 0xf0, based on st2 respectively

s = &datatocopy[st / 2];
d = &tocopyto[st2 / 2];
sm = st % 2 ? 0xf0 : 0x0f;
dm = st2 % 2 ? 0xf0 : 0x0f;

At this point you have a choice of "compact code" vs "fast code".
There are two possibilities: sm == dm, or sm != dm, and if you
expect the loop to run quite a lot, you might want to check. If
the two masks are the same, you can copy 8-bit units at a time
with no masking at all (since uint8_t is exactly 8 bits) until
you hit the first 8-bit byte that contains a 4-bit nybble that
is zero:

if (sm == dm) {
... take care of first upper nybble if required
(but note that this might be 0!) ...
/* then do whole bytes while possible */
if (did not already stop)
while ((*s | (*s >> 4)) & 0x0f) {
*d++ = *s++;
st += 2;
st2 += 2;
}
... then take care of last nybble if required ...
} else {
...
}

Even when the masks differ you can still read and copy quickly,
although this gets into "diminishing returns" territory, by reading
two source 8-bit-bytes, composing a destination byte, storing that,
and then reading one new source byte and reusing the "left over"
nybble in the other previously-read source byte.

Assuming you go for "compact code" instead, though, we just set up
the loop invariants as shown above, then do this:

while ((bits = *s & sm) != 0) {
/*
* We have some nonzero bits to copy, saved up in "bits".
* Move them to whichever half is required...
*/
if (sm != dm) {
if (sm & 0xf0) /* same as "if st % 2" */
bits >>= 4;
else
bits <<= 4;
}
*d = (*d & ~dm) | bits;
st++;
st2++;
#if 0
/* restore invariants -- this is one way, but not as compact */
if (sm & 0x0f) /* same as "if st%2 == 0" */
sm = 0xf0;
else {
sm = 0x0f;
s++;
}
if (dm & 0x0f)
dm = 0xf0;
else {
dm = 0x0f;
d++;
}
#else
/*
* The other way is to just recompute them from scratch,
* same as at the top of the loop. It is likely that the
* compiler will re-use that code (and/or we could compute just
* s and sm here, and compute d and dm just after the "while").
*/
s = &datatocopy[st / 2];
d = &tocopyto[st2 / 2];
sm = st % 2 ? 0xf0 : 0x0f; /* or see below */
dm = st2 % 2 ? 0xf0 : 0x0f;
#endif
}

Finally, it is worth considering just doing one "offsets do not
match" special case, because it removes a number of inside-the-loop
tests even when copying 4 bits at a time. We also get to use some
special properties: in 8 bits, ~0x0f is 0xf0, and vice versa; and
since st and st2 are (presumably) never negative, st%2 and st&1
produce the same values. (The compiler should optimize this away
for you, of course, if st and st2 have unsigned types, so it just
becomes a question of which you think is more readable.)

s = &datatocopy[st / 2];
d = &tocopyto[st2 / 2];
mask = st & 1 ? 0xf0 : 0x0f;

if ((st & 1) == (st2 & 1)) {
/*
* Nybble masks match, so do it the "easy way" (no shifting).
* Note that we want to increment the 8-bit pointers only when
* the values in st and st2 are odd, which is also easy to do
* by an arithmetic trick (provided we add to s and d *before*
* increment st and st2).
*/
while ((bits = *s & mask) != 0) {
*d = (*d & ~mask) | bits;
s += st & 1; /* +1 if odd, +0 if even */
d += st2 & 1;
st++;
st2++;
mask = ~mask; /* 0x0f => 0xf0; 0xf0 => 0x0f */
}
} else {
/*
* Nybble masks do not match; we have to alternate shifting
* left and right by 4. We can either unroll the loop a bit,
* or just test either st or st2 inside it (as here).
*/
while ((bits = *s & mask) != 0) {
if (st & 1) {
/*
* Source is odd (upper) nybble so move down.
* We know that (source) mask == 0xf0 here so
* dest mask should be 0x0f. Might as well
* increment s (and not d) here too.
*/
bits >>= 4;
*d = (*d & ~0x0f) | bits;
s++;
} else {
/*
* Source is even (lower) nybble so move up.
* Again could use constants...
*/
bits <<= 4;
*d = (*d & ~0xf0) | bits;
d++;
}
st++;
st2++;
mask = ~mask;
}
}

Unrolling the "do not mask" loop is ugly, but just for illustration,
here is a version using a goto (the other way is to have two separate
unrolled loops!):

} else {
if (st & 1)
goto opposite_mask_st_odd;
/* unrolled loop, in which st is even and st2 is odd at top: */
for (;;) {
/* here st is even and st2 is odd */
if ((bits = *s & 0x0f) == 0)
break;
*d = (*d & ~0xf0) | (bits << 4);
st++, st2++, d++;
opposite_mask_st_odd:
/* here st is odd and st2 is even */
if ((bits = *s & 0xf0) == 0)
break;
*d = (*d & ~0x0f) | (bits >> 4);
st++, st2++, s++;
}
}

Unrolling actually eliminates the "mask" variable entirely since
we just do each half, and eliminates the even/odd test inside the
loop -- so it winds up being about as long, code-wise.

Note that one can unroll even the "mask is the same" loop, but in
that case it is not quite as clever -- it just lets you use constants
for the masks, and eliminate s++ and d++ in one half of the unrolled
loop. If this were assembly, I probably would do it; in C... well,
*maybe* if profiling showed a lot of time going here.

(Incidentally, I will add here that I spent a lot of time optimizing
assembly-language memcpy() on SPARCV9, where you use some special
registers and instructions that require 64-*byte* alignment. This
needs a lot of leading and trailing mopping-up work, but is
fundamentally the same problem.)
 
B

ben

Christian Bau said:
Quite unlikely that |= will set anything to zero.

of course. dear oh dear. should have made use of the ~ flip operating
having set to 1's. :/ thanks for pointing that out.
(st2 % 2 ? st2 - 1 : st2) / 2

is exactly the same as

st2 / 2

right, i see. so if you've got an odd number and you divide by two (in
an int) you're guaranteed that'll round up downwards? is that ok to
rely on that? i guess it must be. ok thanks.
I would most likely write something like

for (;;) {
src_index = st / 2;
src_shift = st % 2 ? 0 : 4;
src_value = (src [src_index] >> src_shift) & 0x0f;
if (src_value == 0) break;

dst_index = st2 / 2;
dst_shift = st2 % 2 ? 0 : 4;
dst [dst_index] &= ~(0xf << dst_shift);
dst [dst_index] |= src_value << dst_shift;
}

that looks great although there doesn't seem to be any incrementing of
steppers (or am i just not seeing that? - don't think so) but that
shouldn't be hard to add. i'll give it a go now. it certainly looks
clearner and simpler than my version - great, thanks very much for your
time :)

ben.
 
B

ben

btw, this took me *ages* to digest - still haven't completely - an
amazing reply.


Chris Torek said:
Note that if st is always nonnegative, there is no need to test
"st % 2" before dividing by 2, as 0/2 is 0, 1/2 is 0, 2/2 is 1, 3/2
is 1, and so on. Just making use of that fact would simplify the
code quite a bit.

yes i now realise that. i thought it was not advised to rely on that
but that is obviously not the case.
If the loop is expected to run for relatively many 4-bit "nybbles",
though, I would compute an initial set of loop invariants, e.g.:

s: points to the uint8_t that contains the source nybble (4-bit unit)
d: points to the uint8_t that contains the destination nybble
sm: source mask, 0x0f or 0xf0 based on whether st is even or odd
dm: destination mask, 0x0f or 0xf0, based on st2 respectively

s = &datatocopy[st / 2];
d = &tocopyto[st2 / 2];
sm = st % 2 ? 0xf0 : 0x0f;
dm = st2 % 2 ? 0xf0 : 0x0f;

At this point you have a choice of "compact code" vs "fast code".
There are two possibilities: sm == dm, or sm != dm, and if you
expect the loop to run quite a lot, you might want to check. If
the two masks are the same, you can copy 8-bit units at a time
with no masking at all (since uint8_t is exactly 8 bits) until
you hit the first 8-bit byte that contains a 4-bit nybble that
is zero:

if (sm == dm) {
... take care of first upper nybble if required

did you mean lower there? as in the right most 4bits of the first 8bit
byte? i could be wrong.
(but note that this might be 0!) ...
/* then do whole bytes while possible */
if (did not already stop)
while ((*s | (*s >> 4)) & 0x0f) {

i really can't get my head round that line. it's goal is to find out if
either nybble is zero, right? but it doesn't does it? (again, i could
easily be wrong here but that's the way it seems to me). eg:

01000000 -*s
00000100 -*s>>4
01000100 - the OR'd answer
00001111 -0xf
00000100 -the &'d answer

so that would be a posative and the loop would get run. and the same
happens if the two nybbles are swapped round i think. it seems the only
way that'd give a negative answer to the while() is if *s was 0 ? in
which case that line might aswell be while(*s)

i must have must have something wrong?

*d++ = *s++;
st += 2;
st2 += 2;
}
... then take care of last nybble if required ...
} else {
...
}

Even when the masks differ you can still read and copy quickly,
although this gets into "diminishing returns" territory, by reading
two source 8-bit-bytes, composing a destination byte, storing that,
and then reading one new source byte and reusing the "left over"
nybble in the other previously-read source byte.

Assuming you go for "compact code" instead, though, we just set up
the loop invariants as shown above, then do this:

while ((bits = *s & sm) != 0) {
/*
* We have some nonzero bits to copy, saved up in "bits".
* Move them to whichever half is required...
*/
if (sm != dm) {
if (sm & 0xf0) /* same as "if st % 2" */
bits >>= 4;
else
bits <<= 4;
}
*d = (*d & ~dm) | bits;
st++;
st2++;
#if 0
/* restore invariants -- this is one way, but not as compact */
if (sm & 0x0f) /* same as "if st%2 == 0" */
sm = 0xf0;
else {
sm = 0x0f;
s++;
}
if (dm & 0x0f)
dm = 0xf0;
else {
dm = 0x0f;
d++;
}
#else
/*
* The other way is to just recompute them from scratch,
* same as at the top of the loop. It is likely that the
* compiler will re-use that code (and/or we could compute just
* s and sm here, and compute d and dm just after the "while").
*/
s = &datatocopy[st / 2];
d = &tocopyto[st2 / 2];
sm = st % 2 ? 0xf0 : 0x0f; /* or see below */
dm = st2 % 2 ? 0xf0 : 0x0f;
#endif
}

Finally, it is worth considering just doing one "offsets do not
match" special case, because it removes a number of inside-the-loop
tests even when copying 4 bits at a time. We also get to use some
special properties: in 8 bits, ~0x0f is 0xf0, and vice versa; and
since st and st2 are (presumably) never negative, st%2 and st&1
produce the same values. (The compiler should optimize this away
for you, of course, if st and st2 have unsigned types, so it just
becomes a question of which you think is more readable.)

s = &datatocopy[st / 2];
d = &tocopyto[st2 / 2];
mask = st & 1 ? 0xf0 : 0x0f;

if ((st & 1) == (st2 & 1)) {
/*
* Nybble masks match, so do it the "easy way" (no shifting).
* Note that we want to increment the 8-bit pointers only when
* the values in st and st2 are odd, which is also easy to do
* by an arithmetic trick (provided we add to s and d *before*
* increment st and st2).
*/
while ((bits = *s & mask) != 0) {
*d = (*d & ~mask) | bits;
s += st & 1; /* +1 if odd, +0 if even */
d += st2 & 1;
st++;
st2++;
mask = ~mask; /* 0x0f => 0xf0; 0xf0 => 0x0f */ i like that line :)
}
} else {
/*
* Nybble masks do not match; we have to alternate shifting
* left and right by 4. We can either unroll the loop a bit,
* or just test either st or st2 inside it (as here).
*/
while ((bits = *s & mask) != 0) {
if (st & 1) {
/*
* Source is odd (upper) nybble so move down.

hmm, maybe i've got the terminology wrong - i thought upper was on the
left and lower on the right. odd is when the nybble is on the right,
right?
* We know that (source) mask == 0xf0 here so
* dest mask should be 0x0f. Might as well
* increment s (and not d) here too.
*/
bits >>= 4;
*d = (*d & ~0x0f) | bits;
s++;
} else {
/*
* Source is even (lower) nybble so move up.
* Again could use constants...
*/
bits <<= 4;
*d = (*d & ~0xf0) | bits;
d++;
}
st++;
st2++;
mask = ~mask;
}
}

yes i think that's the one i'm going to go for.
Unrolling the "do not mask" loop is ugly, but just for illustration,
here is a version using a goto (the other way is to have two separate
unrolled loops!):

} else {
if (st & 1)
goto opposite_mask_st_odd;
/* unrolled loop, in which st is even and st2 is odd at top: */
for (;;) {
/* here st is even and st2 is odd */
if ((bits = *s & 0x0f) == 0)
break;
*d = (*d & ~0xf0) | (bits << 4);
st++, st2++, d++;
opposite_mask_st_odd:
/* here st is odd and st2 is even */
if ((bits = *s & 0xf0) == 0)
break;
*d = (*d & ~0x0f) | (bits >> 4);
st++, st2++, s++;
}
}

Unrolling actually eliminates the "mask" variable entirely since
we just do each half, and eliminates the even/odd test inside the
loop -- so it winds up being about as long, code-wise.

Note that one can unroll even the "mask is the same" loop, but in
that case it is not quite as clever -- it just lets you use constants
for the masks, and eliminate s++ and d++ in one half of the unrolled
loop. If this were assembly, I probably would do it; in C... well,
*maybe* if profiling showed a lot of time going here.

(Incidentally, I will add here that I spent a lot of time optimizing
assembly-language memcpy() on SPARCV9, where you use some special
registers and instructions that require 64-*byte* alignment. This
needs a lot of leading and trailing mopping-up work, but is
fundamentally the same problem.)

wow that was quite a stunning reply Chris :) thanks very much. it's
made my head spin a bit. still haven't fathommed it all, but a good
amount of it went in successfully. i'm going to go over it again
tomorrow. very interesting and educational though that's for sure.

thanks-a-lot, ben.
 
C

Chris Torek

yes i now realise that. i thought it was not advised to rely on that
but that is obviously not the case.

For negative numbers, there are multiple possibilities in C89 --
(-1)/2 can be either 0 or -1. For positive numbers (or C99) there
is only one allowed answer (1/2 => 0). That was why I put in the
"assuming these are nonnegative".
did you mean lower there? as in the right most 4bits of the first 8bit
byte? i could be wrong.

Naming can get tricky -- if bits 0,1,2,3 make up the value 0x0f and
bits 4,5,6,7 make up the value 0xf0, I would call the latter an "upper
nybble", but some documentation says the opposite (but tends to number
"bit 0" from the left so that bits 0 through 3 make up the value 0xf0!).
Since you are taking an otherwise-atomic unit -- a C99-style "uint8_t"
-- apart, you wind up imposing an ordering: do you take the left
("upper", as I called it) 0xf0 bits first, then the right ("lower")
bits, or the other way around? In my previous article I assumed it
was lower, then upper -- little-endian nybble order, as it were. If
this is wrong the code needs some adjustment.
i really can't get my head round that line. it's goal is to find out if
either nybble is zero, right? but it doesn't does it?

Whoops, right you are. It stops when *both* are zero.

There seems to be no really good shortcut for "either one zero":
ORing the two nybbles tests for "both zero", ANDing them fails if
the bits do not match up, and XOR is no help at all. You could
use a lookup table, I suppose.
 
N

Norbert Juffa

Chris Torek said:
Whoops, right you are. It stops when *both* are zero.

There seems to be no really good shortcut for "either one zero":
ORing the two nybbles tests for "both zero", ANDing them fails if
the bits do not match up, and XOR is no help at all. You could
use a lookup table, I suppose.

One could use the nibble-analogon to Alan Mycroft's trick for finding
a zero byte in a 32-bit word: ((x - 0x01010101) & (0x80808080 & ~x))
is non-zero iff at least one of the four bytes is a zero byte. For
the nibble test, this gives us ((*s - 0x11) & (0x88 & (~ *s))).

But there is only 2-way parallelism here so I doubt it would be much
faster than the obvious (!((*s & 0x0f) && (*s >> 4))).

-- Norbert
 
B

ben

Chris Torek said:
For negative numbers, there are multiple possibilities in C89 --
(-1)/2 can be either 0 or -1. For positive numbers (or C99) there
is only one allowed answer (1/2 => 0). That was why I put in the
"assuming these are nonnegative".

that's good to know.
Naming can get tricky -- if bits 0,1,2,3 make up the value 0x0f and
bits 4,5,6,7 make up the value 0xf0, I would call the latter an "upper
nybble", but some documentation says the opposite (but tends to number
"bit 0" from the left so that bits 0 through 3 make up the value 0xf0!).
Since you are taking an otherwise-atomic unit -- a C99-style "uint8_t"
-- apart, you wind up imposing an ordering: do you take the left
("upper", as I called it) 0xf0 bits first, then the right ("lower")
bits, or the other way around? In my previous article I assumed it
was lower, then upper -- little-endian nybble order, as it were. If
this is wrong the code needs some adjustment.

the way you name it corresponds with how i see it - left=upper
right=lower. i must have misunderstood which actual nybble needed to be
dealt with at that moment (twice i think) which made me think there was
a naming issue.
Whoops, right you are. It stops when *both* are zero.

There seems to be no really good shortcut for "either one zero":
ORing the two nybbles tests for "both zero", ANDing them fails if
the bits do not match up, and XOR is no help at all. You could
use a lookup table, I suppose.

i see Norbert has posted a way - i'll have a go with that.

great, thanks again :)

ben.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top