bitcounting on 30000+ bits, but not byte aligned

Eric · Jun 24, 2004

I have an array that contains over 30000+ bits. The size varies at
runtime. I need to move through this chunk of memory and count bits
every so often. Example: First 9 bits has 2 ones, next 10 bits has 4
ones, next 9 bits has 8 ones ... I know the number of bits that need
to be counted at runtime only. But I do know it will be two different
lengths that are consecutive numbers, it could be just one length but
that will be rare. I know the pattern the lengths come in at runtime.
The lengths will normally be around 32. I am using this to scale
down large 1bit images using a box filter. The different lengths come
from a DDA that creates a filter that evenly spreads out the small box
size and large box size filter over the image. So I will actually do
this same bitcounting 30000+ times depending on the height of the
image. Speed is very important. I don't need asm though since it
will run on different platforms. I think there should be some way to
precompute alot of this since all the rows will have the same pattern
of bit lenghts that I need to count. I know about using a precomputed
table for counting bits in 1 byte and I am sure that will part of it.
But what else can I precompute to save time? Thank you.

Walter Tross · Jun 24, 2004

Eric 2004-06-24 :

I have an array that contains over 30000+ bits. The size varies at
runtime. I need to move through this chunk of memory and count bits
every so often. Example: First 9 bits has 2 ones, next 10 bits has 4
ones, next 9 bits has 8 ones ... I know the number of bits that need
to be counted at runtime only. But I do know it will be two different
lengths that are consecutive numbers, it could be just one length but
that will be rare. I know the pattern the lengths come in at runtime.
The lengths will normally be around 32. I am using this to scale
down large 1bit images using a box filter. The different lengths come
from a DDA that creates a filter that evenly spreads out the small box
size and large box size filter over the image. So I will actually do
this same bitcounting 30000+ times depending on the height of the
image. Speed is very important. I don't need asm though since it
will run on different platforms. I think there should be some way to
precompute alot of this since all the rows will have the same pattern
of bit lenghts that I need to count. I know about using a precomputed
table for counting bits in 1 byte and I am sure that will part of it.
But what else can I precompute to save time? Thank you.

I guess you would be more on-topic in comp.graphics.algorithms, although
I would try to explain about the "lengths" a bit better, if I were you, I
didn't get it, although I have done 1-bit image processing professionally.
All that comes to my mind is: if the all zeroes and/or the all ones cases
are frequent, handle them whith an if before doing anything else.
BTW, I guess you know this trick for counting bits:
bits = 0;
while (x) {
x &= x - 1;
bits++;
}

Walter Tross

CFG · Jun 24, 2004

Why don't just use lookup table?

bits = lut[x]; // lut contains 256 elements and x is a byte

instead of

while (x) {
x &= x - 1;
bits++;
}

Walter Tross · Jun 24, 2004

CFG 2004-06-25 :

Why don't just use lookup table?

bits = lut[x]; // lut contains 256 elements and x is a byte

instead of

while (x) {
x &= x - 1;
bits++;
}

Maybe I was not clear enough. I said "BTW", and I meant the trick to be
used mainly to build the lookup table. It can also be used instead of the
lookup table when ones are known to be very rare (it practically includes
the if (!x) which I would put in front of the rest if ones are less rare)

Walter Tross

John Cochran · Jun 25, 2004

I have an array that contains over 30000+ bits. The size varies at
runtime. I need to move through this chunk of memory and count bits
every so often. Example: First 9 bits has 2 ones, next 10 bits has 4
ones, next 9 bits has 8 ones ... I know the number of bits that need
to be counted at runtime only. But I do know it will be two different
lengths that are consecutive numbers, it could be just one length but
that will be rare. I know the pattern the lengths come in at runtime.
The lengths will normally be around 32. I am using this to scale
down large 1bit images using a box filter. The different lengths come
from a DDA that creates a filter that evenly spreads out the small box
size and large box size filter over the image. So I will actually do
this same bitcounting 30000+ times depending on the height of the
image. Speed is very important. I don't need asm though since it
will run on different platforms. I think there should be some way to
precompute alot of this since all the rows will have the same pattern
of bit lenghts that I need to count. I know about using a precomputed
table for counting bits in 1 byte and I am sure that will part of it.
But what else can I precompute to save time? Thank you.

You already mention using a precomputed table which is a very good
start. The only other thing I can think of is if your range doesn't
start or stop on an even byte boundary. If that's the case, you will
also want a small table of masks to and with the byte prior to using
the lookup table. Some quick pseudo code would be.

handle 1st byte which is potentially partial.
handle all full bytes.
handle last byte which is also potentially partial.

The 1st and last bytes would by handled by ANDing them prior to using
the lookup table.

Also if you have a reasonably small upper limit to the length you're
checking, you may wish to handle the group of whole bytes by using
a switch statement. Something like this:

switch(number_of_bytes) {
case 8: count += bit_count[*p++];
case 7: count += bit_count[*p++];
case 6: count += bit_count[*p++];
case 5: count += bit_count[*p++];
case 4: count += bit_count[*p++];
case 3: count += bit_count[*p++];
case 2: count += bit_count[*p++];
case 1: count += bit_count[*p++];
}

CFG · Jun 25, 2004

Also if you have a reasonably small upper limit to the length you're
checking, you may wish to handle the group of whole bytes by using
a switch statement. Something like this:

switch(number_of_bytes) {
case 8: count += bit_count[*p++];
case 7: count += bit_count[*p++];
case 6: count += bit_count[*p++];
case 5: count += bit_count[*p++];
case 4: count += bit_count[*p++];
case 3: count += bit_count[*p++];
case 2: count += bit_count[*p++];
case 1: count += bit_count[*p++];
}

I would rewrite this switch as follows:
switch(number_of_bytes) {
case 8: count += bit_count[p[7]];
case 7: count += bit_count[p[6]];
case 6: count += bit_count[p[5]];
case 5: count += bit_count[p[4]];
case 4: count += bit_count[p[3]];
case 3: count += bit_count[p[2]];
case 2: count += bit_count[p[1]];
case 1: count += bit_count[p[0]];
}
p+= 8;

John Cochran · Jun 25, 2004

I would rewrite this switch as follows:
switch(number_of_bytes) {
case 8: count += bit_count[p[7]];
case 7: count += bit_count[p[6]];
case 6: count += bit_count[p[5]];
case 5: count += bit_count[p[4]];
case 4: count += bit_count[p[3]];
case 3: count += bit_count[p[2]];
case 2: count += bit_count[p[1]];
case 1: count += bit_count[p[0]];
}
p+= 8;

^^^^^^

p += number_of_bytes;

Eric · Jun 25, 2004

You already mention using a precomputed table which is a very good
start. The only other thing I can think of is if your range doesn't
start or stop on an even byte boundary. If that's the case, you will
also want a small table of masks to and with the byte prior to using
the lookup table. Some quick pseudo code would be.

handle 1st byte which is potentially partial.
handle all full bytes.
handle last byte which is also potentially partial.

The 1st and last bytes would by handled by ANDing them prior to using
the lookup table.

This is exactly what I was thinking of. I think I will also need a
table of offsets from the start of the array to tell me the byte each
group starts on because most of the time the last byte will be partial
causing the next group to start on the same byte, but sometimes it
will land evenly on a byte boundary causing the next group to start on
the next byte.

Also if you have a reasonably small upper limit to the length you're
checking, you may wish to handle the group of whole bytes by using
a switch statement. Something like this:

I hadn't thought of doing this. Thanks. I may use this along with
adding a default: that can handle any number of bytes. That way I get
the speed up for the most common cases, but I don't break if I need a
to use a longer group of whole bytes. I could even build a 2 byte
table and use it a long with my 1 byte table. I would need to test
that though. The 2 byte table may cause to many cash misses. I am
trying to keep the all the look up tables and the array of bits in
cache. Glad L2 cache is so big now. Should I try to keep it in L1 or
is L2 fine? I need to find that out. The array of bits probably
doesn't get much benefit from being in cache though since most of the
bytes are only read once. Although the memory address doesn't change.

switch(number_of_bytes) {
case 8: count += bit_count[*p++];
case 7: count += bit_count[*p++];
case 6: count += bit_count[*p++];
case 5: count += bit_count[*p++];
case 4: count += bit_count[*p++];
case 3: count += bit_count[*p++];
case 2: count += bit_count[*p++];
case 1: count += bit_count[*p++];
}

I don't really see how this will count the bits though. Say
number_of_bytes = 4 then count += bit_count[*p++}, so that gives me
the bitcount of one byte and you exit the switch statment. Should it
not be written as.
switch(number_of_bytes) {
case 8: count = bit_count[*p++] + bit_count[*p++] + ...;
case 7: count = bit_count[*p++] + bit_count[*p++] + ...;
case 6: count = bit_count[*p++] + bit_count[*p++] + ...;
case 5: count = bit_count[*p++] + bit_count[*p++] + ...;
case 4: count = bit_count[*p++] + bit_count[*p++] + ...;
case 3: count = bit_count[*p++] + bit_count[*p++] + ...;
case 2: count = bit_count[*p++] + bit_count[*p++];
case 1: count = bit_count[*p++];
}

John Cochran · Jun 25, 2004

switch(number_of_bytes) {
case 8: count += bit_count[*p++];
case 7: count += bit_count[*p++];
case 6: count += bit_count[*p++];
case 5: count += bit_count[*p++];
case 4: count += bit_count[*p++];
case 3: count += bit_count[*p++];
case 2: count += bit_count[*p++];
case 1: count += bit_count[*p++];
}

I don't really see how this will count the bits though. Say
number_of_bytes = 4 then count += bit_count[*p++}, so that gives me
the bitcount of one byte and you exit the switch statment. Should it
not be written as.
switch(number_of_bytes) {
case 8: count = bit_count[*p++] + bit_count[*p++] + ...;
case 7: count = bit_count[*p++] + bit_count[*p++] + ...;
case 6: count = bit_count[*p++] + bit_count[*p++] + ...;
case 5: count = bit_count[*p++] + bit_count[*p++] + ...;
case 4: count = bit_count[*p++] + bit_count[*p++] + ...;
case 3: count = bit_count[*p++] + bit_count[*p++] + ...;
case 2: count = bit_count[*p++] + bit_count[*p++];
case 1: count = bit_count[*p++];
}[/QUOTE]
Nope. You forgot about the unique behaivor of switch statements in C and C++.
Namely that their default behaivor is to fall through to the next case unless
a "break" statement is encountered. Notice that the example I gave you
doesn't have any break statements. Also notice that the order of the case
statements is decending. Your idea of having a default clause is a good one.
The method that I would use would be

switch(number_of_bytes) {
default:
// Loop to handle many bytes....
break;
case 8: count = bit_count[*p++];
case 7: count = bit_count[*p++];
case 6: count = bit_count[*p++];
case 5: count = bit_count[*p++];
case 4: count = bit_count[*p++];
case 3: count = bit_count[*p++];
case 2: count = bit_count[*p++];
case 1: count = bit_count[*p++];
case 0: ;
}

Notice the additional case 0 and the fact that I placed the default FIRST.
The reason for the added case 0 is because in the first example without a
default, the 0 case did no work. In the new version, the default does work
and I don't see any good reason to have the zero case tested twice (once by
the switch, again by the loop in the default case). The reason that I placed
the default case first is to allow the fastpath (cases 0 through 8) to
possibly fall off the end of the switch statement and avoid an unneeded
jump opcode in the compiled version. The default path will have a jump
opcode to skip past the other cases, but since it's the slow path anyway,
an extra jump shouldn't be noticed.

Come to think of it, if your loop in the default case processes only
(number_of_bytes - 8) bytes, you can get rid of the break statement and
allow it to fall through to the rest of the switch statement.

CFG · Jun 25, 2004

My bad

josh · Jun 27, 2004

Eric said:
I have an array that contains over 30000+ bits. The size varies at
runtime. I need to move through this chunk of memory and count bits
every so often. Example: First 9 bits has 2 ones, next 10 bits has 4
ones, next 9 bits has 8 ones ... I know the number of bits that need
to be counted at runtime only. But I do know it will be two different
lengths that are consecutive numbers, it could be just one length but
that will be rare. I know the pattern the lengths come in at runtime.
The lengths will normally be around 32. I am using this to scale
down large 1bit images using a box filter. The different lengths come
from a DDA that creates a filter that evenly spreads out the small box
size and large box size filter over the image. So I will actually do
this same bitcounting 30000+ times depending on the height of the
image. Speed is very important. I don't need asm though since it
will run on different platforms. I think there should be some way to
precompute alot of this since all the rows will have the same pattern
of bit lenghts that I need to count. I know about using a precomputed
table for counting bits in 1 byte and I am sure that will part of it.
But what else can I precompute to save time? Thank you.

So you might get something like:
32 33 33 33 32 32 33 33 32 ...
but it would be the same for every row?
Is there an upper limit on the length of a run?

You might try doing it sideways, but that wouldn't play very nicely with
the cache. If you could flip the bitmap on its diagonal and interleave
things in blocks of unsigned long, you could cut your passes by a factor
of sizeof(unsigned long)*CHAR_BIT. But moving all the data around would
probably negate any benefit you'd get from that.

You might run the DDA once and store each position as a byte increment
and a mask:
struct RunRec
{
unsigned cbAdvance;
std::vector<unsigned char> Mask;
};
Then at each step, advance the pointer by cbAdvance bytes, and the value
there with the Mask, and count how many bits you have. Then you don't
need bit shifting for every row. (of course, std::vector<unsigned char>
is probably not the best choice...)

-josh

I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023
Why does my php code work on some websites but not others.	1	Jan 6, 2023
Extract bits from byte array	1	Jun 26, 2006
My sliding panel in React Js with graphs on it renders nothing but a blank screen--need help	0	Dec 9, 2019
struct module : getting at individual bits inside byte field	2	Feb 23, 2009
Python Gurobi Optimizing Cost has no errors but I get no sensible solution	0	Aug 30, 2022
getting at individual bits inside byte field: struct module : bitwiseoperator	4	Feb 24, 2009
'binary string' and vector<byte> specialization	9	Sep 15, 2010

bitcounting on 30000+ bits, but not byte aligned

Eric

Walter Tross

CFG

Walter Tross

John Cochran

CFG

John Cochran

Eric

John Cochran

CFG

josh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads