efficient approach to check a set bit

B

Ben Bacarisse

bartc said:
Ben Bacarisse said:
A solution using such a de Bruijn sequence has already been posted[1].
On my x86_64 machine it is the fastest solution so far:

mine & and shift: 84ns
mod 37 + lookup: 43ns
de Bruijn multiply plus lookup: 19ns

All are compiler with optimisation (the function ended up inlined and
%37 was done as previouslt descibed).

The de Bruijn method omits the x & -x from the bithacks site since we
don't need it here:

int bit_set_4(uint32_t x)
{
static const int MultiplyDeBruijnBitPosition[32] = {
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
return MultiplyDeBruijnBitPosition[x * 0x077CB531U >> 27];
}

That's pretty good, about as fast (on my machine) as a function
containing just a single BSF inline instruction.

But it might be worth pointing out that this code I think only works
properly if there is one and only one bit set. For x==0, it gives the
same result as x==1, and for x's with extra bits set, it seems to go
wrong.

You mean pointed out more explicitly than my saying that I removed the
x & -x from the linked bithack because we don't need it in this
example? More explicitly than all the text up thread about "if you
really have only one bit set then..."? :)
BSF (find lowest set bit) also doesn't deal with x = 0, but gives
expected results on the rest.

To add to my timings:

gcc asm("bsf"): 14ns;
 
B

bartc

Ben Bacarisse said:
bartc said:
Ben Bacarisse said:
A solution using such a de Bruijn sequence has already been posted[1].
On my x86_64 machine it is the fastest solution so far:

mine & and shift: 84ns
mod 37 + lookup: 43ns
de Bruijn multiply plus lookup: 19ns

All are compiler with optimisation (the function ended up inlined and
%37 was done as previouslt descibed).
int bit_set_4(uint32_t x)
{
static const int MultiplyDeBruijnBitPosition[32] = {
0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
return MultiplyDeBruijnBitPosition[x * 0x077CB531U >> 27];
}

That's pretty good, about as fast (on my machine) as a function
containing just a single BSF inline instruction.

But it might be worth pointing out that this code I think only works
properly if there is one and only one bit set. For x==0, it gives the
same result as x==1, and for x's with extra bits set, it seems to go
wrong.

You mean pointed out more explicitly than my saying that I removed the
x & -x from the linked bithack because we don't need it in this
example? More explicitly than all the text up thread about "if you
really have only one bit set then..."? :)

Yeah but that was last week.
To add to my timings:

gcc asm("bsf"): 14ns;

I hadn't looked at your actual figures too closely before, but they don't
seem that fast.

On my gcc -O0 3.4.5 x86-32 setup, I'm getting about 1.5 secs for 320 million
calls, about 5ns each (summary of code below). With your original code, 3.5
to 7 secs without inlining (11 to 22ns). My interpreted language, even, does
11ns. Properly inlined assembler I think was some 0.6secs, around 2ns.

(With gcc -O1/2/3, the timings just go to zero.)

Obvuiously your tests must be very different or include extra overheads.

/* bit_set_4() defined with unsigned long (32-bit) parameter */
int main(void) {
int a;
int i,j,k,n;
#define N 1000000

//starttimer();

for (n=0; n<N; ++n){
j=1;
for (i=0; i<32; ++i) {
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
a=bit_set_4(j);
j<<=1;
}
}
//stoptimer();

}

I'm sure that's 320 million calls to bit_set_4() in there.
 
B

Ben Bacarisse

bartc said:
Ben Bacarisse said:
bartc said:
A solution using such a de Bruijn sequence has already been posted[1].
On my x86_64 machine it is the fastest solution so far:

mine & and shift: 84ns
mod 37 + lookup: 43ns
de Bruijn multiply plus lookup: 19ns

All are compiler with optimisation (the function ended up inlined and
%37 was done as previouslt descibed).
To add to my timings:

gcc asm("bsf"): 14ns;

I hadn't looked at your actual figures too closely before, but they
don't seem that fast.

Ah, yes. I used the wrong divisor when I divided to get ns per op
from total seconds. Double checking, I now get:

mask and shift: 5.22ns minus overheads: 4.99ns
mod 37: 2.71ns 2.48ns
de Bruijn: 1.18ns 0.95ns
asm(BSF): 0.89ns 0.66ns
overhead (estimate): 0.23ns
On my gcc -O0 3.4.5 x86-32 setup, I'm getting about 1.5 secs for 320
million calls, about 5ns each (summary of code below). With your
original code, 3.5 to 7 secs without inlining (11 to 22ns). My
interpreted language, even, does 11ns. Properly inlined assembler I
think was some 0.6secs, around 2ns.

The above seem reasonable now. For example, 0.66ns is
16.5 cycles of my processor. The Intel manual suggests that BSF has a
latency of 16 cycles. The de Bruijn code is therefore about 1.4 times
the speed of the single instruction.
(With gcc -O1/2/3, the timings just go to zero.)

My timing are for optimised code (-O2 seems to be as fast as it gets).
Obvuiously your tests must be very different or include extra
overheads.

No, just faulty arithmetic!

<snip>
 
M

Moi

But it might be worth pointing out that this code I think only works
properly if there is one and only one bit set. For x==0, it gives the
same result as x==1, and for x's with extra bits set, it seems to go
wrong.

BSF (find lowest set bit) also doesn't deal with x = 0, but gives
expected results on the rest.


GNU's libc on linux has the ffs() ffsl() and ffsll() functions,
which are basically a BSF instruction (if present on the hardware),
; but with 1 added, so the rightmost bit will be returned as 1.
A return value of zero indicates "no bits were set".

Personally I would prefer -1 as a return value
for "no bits set", but that's a matter of taste.

HTH,
AvK
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top