bitset<32> and bitset<64> efficiency

N

Ninds

Hi,

I would like to know whether using bitset<32> for bit operations on a 32bit machine would generally be as efficient as using a 32bit int.
Moreover, if the answer is yes would it also hold for bitset<64> and 64bit int on a 64 bit arch.
I realise the standard says nothing about the implementation so there is no definitive answer but what is 'likely' to be the case ?


Thanks
N
 
J

Johannes Bauer

I would like to know whether using bitset<32> for bit operations on a 32bit machine would generally be as efficient as using a 32bit int.
Moreover, if the answer is yes would it also hold for bitset<64> and 64bit int on a 64 bit arch.
I realise the standard says nothing about the implementation so there is no definitive answer but what is 'likely' to be the case ?

Why don't you try it out for your constellation? I just tried:

#include <bitset>
#include <string>
#include <iostream>

int main(int argc, char ** argv) {
std::bitset<64> i, j;
std::cerr << sizeof(i) << std::endl;
__asm__ __volatile__("nop");
i.set(argc);
__asm__ __volatile__("nop");
std::cerr << i << std::endl;
__asm__ __volatile__("nop");
j.set(64 - argc);
__asm__ __volatile__("nop");
std::cerr << j << std::endl;
__asm__ __volatile__("nop");
i ^= j;
__asm__ __volatile__("nop");
std::cerr << i << std::endl;
return 0;
}

on Linux x86_64 with g++ 4.6.2. The "set" always results in a call, but
the xor is done as if it were a int, i.e.:

// First set
400ce0: ba 01 00 00 00 mov $0x1,%edx
400ce5: 48 63 f3 movslq %ebx,%rsi
400ce8: 48 89 e7 mov %rsp,%rdi
400ceb: e8 60 01 00 00 callq 400e50
<std::bitset<64ul>::set(unsigned long, bool)>

[...]
// Second set
400d07: be 40 00 00 00 mov $0x40,%esi
400d0c: 48 8d 7c 24 10 lea 0x10(%rsp),%rdi
400d11: ba 01 00 00 00 mov $0x1,%edx
400d16: 29 de sub %ebx,%esi
400d18: 48 63 f6 movslq %esi,%rsi
400d1b: e8 30 01 00 00 callq 400e50
<std::bitset<64ul>::set(unsigned long, bool)>

[...]
// XOR operation
400d39: 48 8b 44 24 10 mov 0x10(%rsp),%rax
400d3e: 48 31 04 24 xor %rax,(%rsp)

A bit curious, I'd have thought "set" would get inlined/optimized away,
but YMMV.

Best regards,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
W

W Karas

I would like to know whether using bitset<32> for bit operations on a 32bit machine would generally be as efficient as using a 32bit int.
Moreover, if the answer is yes would it also hold for bitset<64> and 64bit int on a 64 bit arch.
I realise the standard says nothing about the implementation so there is no definitive answer but what is 'likely' to be the case ?



Why don't you try it out for your constellation? I just tried:



#include <bitset>

#include <string>

#include <iostream>



int main(int argc, char ** argv) {

std::bitset<64> i, j;

std::cerr << sizeof(i) << std::endl;

__asm__ __volatile__("nop");

i.set(argc);

__asm__ __volatile__("nop");

std::cerr << i << std::endl;

__asm__ __volatile__("nop");

j.set(64 - argc);

__asm__ __volatile__("nop");

std::cerr << j << std::endl;

__asm__ __volatile__("nop");

i ^= j;

__asm__ __volatile__("nop");

std::cerr << i << std::endl;

return 0;

}



on Linux x86_64 with g++ 4.6.2. The "set" always results in a call, but

the xor is done as if it were a int, i.e.:



// First set

400ce0: ba 01 00 00 00 mov $0x1,%edx

400ce5: 48 63 f3 movslq %ebx,%rsi

400ce8: 48 89 e7 mov %rsp,%rdi

400ceb: e8 60 01 00 00 callq 400e50

<std::bitset<64ul>::set(unsigned long, bool)>



[...]

// Second set

400d07: be 40 00 00 00 mov $0x40,%esi

400d0c: 48 8d 7c 24 10 lea 0x10(%rsp),%rdi

400d11: ba 01 00 00 00 mov $0x1,%edx

400d16: 29 de sub %ebx,%esi

400d18: 48 63 f6 movslq %esi,%rsi

400d1b: e8 30 01 00 00 callq 400e50

<std::bitset<64ul>::set(unsigned long, bool)>



[...]

// XOR operation

400d39: 48 8b 44 24 10 mov 0x10(%rsp),%rax

400d3e: 48 31 04 24 xor %rax,(%rsp)



A bit curious, I'd have thought "set" would get inlined/optimized away,

but YMMV.



Best regards,

Johannes

It seems that typical C++ compilers will often fail to inline, even when doing so would result in object code that was BOTH smaller and faster. It's a very frustrating aspect of using C++. Can anyone comment on why this is the case? In the case of GCC, I suppose one cannot look a gift-horse in the mouth. But the problem seems to exist with compilers which must be licensed at significant cost as well.
 
I

Ian Collins

W Karas wrote:

Please clean up the mess google makes of your replies!
It seems that typical C++ compilers will often fail to inline, even
when doing so would result in object code that was BOTH smaller and
faster. It's a very frustrating aspect of using C++. Can anyone
comment on why this is the case? In the case of GCC, I suppose one
cannot look a gift-horse in the mouth. But the problem seems to
exist with compilers which must be licensed at significant cost as
well.

Typical C++ compilers do a decent job of inlining when optimisation is
enabled. Both compilers I tried (g++ and Sun CC) inline the original
example (which was too mangled in your reply to re-quote).
 
N

Ninds

Hi,



I would like to know whether using bitset<32> for bit operations on a 32bit machine would generally be as efficient as using a 32bit int.

Moreover, if the answer is yes would it also hold for bitset<64> and 64bit int on a 64 bit arch.

I realise the standard says nothing about the implementation so there is no definitive answer but what is 'likely' to be the case ?





Thanks

N

Hi,
Thanks for the replies .. but I am afraid I am none the wiser. To put my question in context:
I wanted to write a small concise bit of code to compute the SHA256 hash that would be portable (hence free of asm) and isolated. I realised from the outset that with these constraints it wouldn't be the speediest kid on the block but I didn't want it to be ridiculously inefficient either. I couldn't think of a cleaner solution that using std::bitset this way I didn't needto concern myself with the underlying architecture and at the same time I would be working with the language rather than against it.

I now have a prototype which seems to work fine ans is simply C++ implementation of the pseudo code on Wikipedia:

#include<iostream>
#include<string>
#include<sstream>
#include<vector>
#include<iomanip>
#include<bitset>

const unsigned int h[8] ={
0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19};

const unsigned int k[64] ={
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2};

template<size_t M>
struct bit_op
{
template<size_t N>
static std::bitset<N> &RoL(std::bitset<N> &data)
{
std::bitset<N> leftShift = data<<M;
data>>=(N-M);
data |= leftShift;
return data;
}

template<size_t N>
static std::bitset<N> &RoR(std::bitset<N> &data)
{
std::bitset<N> rightShift = data>>M;
data<<=(N-M);
data |= rightShift;
return data;
}

template<size_t N>
static std::bitset<N> roL(const std::bitset<N> &data)
{
std::bitset<N> temp(data);
return RoL(temp);
}

template<size_t N>
static std::bitset<N> roR(const std::bitset<N> &data)
{
std::bitset<N> temp(data);
return RoR(temp);
}
};

void clean(std::vector<std::bitset<32> > &chunk)
{
for(int i(0); i< chunk.size(); ++i)
{
chunk.reset();
}
}

int getChunck(std::istream &theStream, std::vector<std::bitset<32> > &chunk)
{
clean(chunk);
int Count =0;
while(!theStream.eof() && Count < 512/8)
{
char c=0;
theStream.get(c);
if(c!=0)
{
std::bitset<32> data = ((unsigned long )c);
data<<=((3-(Count & 3))*8);
chunk[(Count>>2)] |= data;
++Count;
}
}
return Count;
}

void process(std::vector<std::bitset<32> > &w,std::vector<std::bitset<32> >&hvec)
{
for (int i(16); i< 64; ++i)
{
std::bitset<32> s0 = (bit_op<7>::roR(w[i-15])) ^ (bit_op<18>::roR(w[i-15])) ^ (w[i-15]>>3);
std::bitset<32> s1 = (bit_op<17>::roR(w[i-2])) ^ (bit_op<19>::roR(w[i-2])) ^ (w[i-2]>>10);
w = std::bitset<32>(w[i-16].to_ulong() + s0.to_ulong() + w[i-7].to_ulong() + s1.to_ulong());
}
std::bitset<32> a = hvec[0];
std::bitset<32> b = hvec[1];
std::bitset<32> c = hvec[2];
std::bitset<32> d = hvec[3];
std::bitset<32> e = hvec[4];
std::bitset<32> f = hvec[5];
std::bitset<32> g = hvec[6];
std::bitset<32> h = hvec[7];

for(int i(0); i< 64; ++i)
{
std::bitset<32> S0 = (bit_op<2>::roR(a)) ^ (bit_op<13>::roR(a)) ^ (bit_op<22>::roR(a));
std::bitset<32> maj = (a & b) ^ (a & c) ^ (b & c);
std::bitset<32> t2(S0.to_ulong() + maj.to_ulong());
std::bitset<32> S1 = (bit_op<6>::roR(e)) ^ (bit_op<11>::roR(e)) ^ (bit_op<25>::roR(e));
std::bitset<32> ch = (e & f) ^ ((~e) & g);
std::bitset<32> t1 = h.to_ulong() + S1.to_ulong() + ch.to_ulong() + k + w.to_ulong() ;

h = g;
g = f;
f = e;
e = d.to_ulong() + t1.to_ulong();
d = c;
c = b;
b = a;
a = t1.to_ulong() + t2.to_ulong() ;
}
hvec[0] = hvec[0].to_ulong() + a.to_ulong();
hvec[1] = hvec[1].to_ulong() + b.to_ulong();
hvec[2] = hvec[2].to_ulong() + c.to_ulong();
hvec[3] = hvec[3].to_ulong() + d.to_ulong();
hvec[4] = hvec[4].to_ulong() + e.to_ulong();
hvec[5] = hvec[5].to_ulong() + f.to_ulong();
hvec[6] = hvec[6].to_ulong() + g.to_ulong();
hvec[7] = hvec[7].to_ulong() + h.to_ulong();
}
 
N

Ninds

Hi,



I would like to know whether using bitset<32> for bit operations on a 32bit machine would generally be as efficient as using a 32bit int.

Moreover, if the answer is yes would it also hold for bitset<64> and 64bit int on a 64 bit arch.

I realise the standard says nothing about the implementation so there is no definitive answer but what is 'likely' to be the case ?





Thanks

N
Hi,
Thanks for the replies .. but I am afraid I am none the wiser. To put my question in context:
I wanted to write a small concise bit of code to compute the SHA256 hash that would be portable (hence free of asm) and isolated. I realised from the outset that with these constraints it wouldn't be the speediest kid on the block but I didn't want it to be ridiculously inefficient either. I couldn't think of a cleaner solution that using std::bitset this way I didn't needto concern myself with the underlying architecture and at the same time I would be working with the language rather than against it.
I am however curious as to why rotations methods are not implemented for bitset.

I now have a prototype which seems to work fine implemented from the pseudocode on Wikipedia.

#include<iostream>
#include<string>
#include<sstream>
#include<vector>
#include<iomanip>
#include<bitset>

const unsigned int h[8] ={
0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19};

const unsigned int k[64] ={
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2};

template<size_t M>
struct bit_op
{
template<size_t N>
static std::bitset<N> &RoL(std::bitset<N> &data)
{
std::bitset<N> leftShift = data<<M;
data>>=(N-M);
data |= leftShift;
return data;
}

template<size_t N>
static std::bitset<N> &RoR(std::bitset<N> &data)
{
std::bitset<N> rightShift = data>>M;
data<<=(N-M);
data |= rightShift;
return data;
}

template<size_t N>
static std::bitset<N> roL(const std::bitset<N> &data)
{
std::bitset<N> temp(data);
return RoL(temp);
}

template<size_t N>
static std::bitset<N> roR(const std::bitset<N> &data)
{
std::bitset<N> temp(data);
return RoR(temp);
}
};

void clean(std::vector<std::bitset<32> > &chunk)
{
for(int i(0); i< chunk.size(); ++i)
{
chunk.reset();
}
}

int getChunck(std::istream &theStream, std::vector<std::bitset<32> > &chunk)
{
clean(chunk);
int Count =0;
while(!theStream.eof() && Count < 512/8)
{
char c=0;
theStream.get(c);
if(c!=0)
{
std::bitset<32> data = ((unsigned long )c);
data<<=((3-(Count & 3))*8);
chunk[(Count>>2)] |= data;
++Count;
}
}
return Count;
}

void process(std::vector<std::bitset<32> > &w,std::vector<std::bitset<32> >&hvec)
{
for (int i(16); i< 64; ++i)
{
std::bitset<32> s0 = (bit_op<7>::roR(w[i-15])) ^ (bit_op<18>::roR(w[i-15])) ^ (w[i-15]>>3);
std::bitset<32> s1 = (bit_op<17>::roR(w[i-2])) ^ (bit_op<19>::roR(w[i-2])) ^ (w[i-2]>>10);
w = std::bitset<32>(w[i-16].to_ulong() + s0.to_ulong() + w[i-7].to_ulong() + s1.to_ulong());
}
std::bitset<32> a = hvec[0];
std::bitset<32> b = hvec[1];
std::bitset<32> c = hvec[2];
std::bitset<32> d = hvec[3];
std::bitset<32> e = hvec[4];
std::bitset<32> f = hvec[5];
std::bitset<32> g = hvec[6];
std::bitset<32> h = hvec[7];

for(int i(0); i< 64; ++i)
{
std::bitset<32> S0 = (bit_op<2>::roR(a)) ^ (bit_op<13>::roR(a)) ^ (bit_op<22>::roR(a));
std::bitset<32> maj = (a & b) ^ (a & c) ^ (b & c);
std::bitset<32> t2(S0.to_ulong() + maj.to_ulong());
std::bitset<32> S1 = (bit_op<6>::roR(e)) ^ (bit_op<11>::roR(e)) ^ (bit_op<25>::roR(e));
std::bitset<32> ch = (e & f) ^ ((~e) & g);
std::bitset<32> t1 = h.to_ulong() + S1.to_ulong() + ch.to_ulong() + k + w.to_ulong() ;

h = g;
g = f;
f = e;
e = d.to_ulong() + t1.to_ulong();
d = c;
c = b;
b = a;
a = t1.to_ulong() + t2.to_ulong() ;
}
hvec[0] = hvec[0].to_ulong() + a.to_ulong();
hvec[1] = hvec[1].to_ulong() + b.to_ulong();
hvec[2] = hvec[2].to_ulong() + c.to_ulong();
hvec[3] = hvec[3].to_ulong() + d.to_ulong();
hvec[4] = hvec[4].to_ulong() + e.to_ulong();
hvec[5] = hvec[5].to_ulong() + f.to_ulong();
hvec[6] = hvec[6].to_ulong() + g.to_ulong();
hvec[7] = hvec[7].to_ulong() + h.to_ulong();
}

int main(int argc, char* argv[])
{
if(argc<2)
{
std::cout <<"Need a string to hash";
return -1;
}
std::string theTestString = argv[1] ;
for(int i(2); i< argc; ++i)
{
theTestString+=" ";
theTestString+= argv;
}
std::istringstream theStream(theTestString);
std::vector<std::bitset<32> > hvec(8);

for(int i(0); i< 8; ++i)
{
hvec = h;
}

std::vector<std::bitset<32> > chunk(64);
bool finished = false;
unsigned int byteCount = 0;
unsigned int chunkCount = 0;
int count =0;
while((count=getChunck(theStream,chunk))==64)
{
process(chunk,hvec);
byteCount+=count;
}
byteCount+=count;
int chunkIndex = 0;
int byteIndex = 0;
if(count !=0)
{
chunkIndex = count/4;
byteIndex = count - (count/4)*4;
}
chunk[chunkIndex]|= (1<<((4-byteIndex)*8-1));
std::bitset<64> length(byteCount);
length<<=3;
unsigned long lower = (length & ((std::bitset<64>().set())>>32)).to_ulong();
unsigned long upper = (length & ((std::bitset<64>().set())<<32)).to_ulong();

if(chunkIndex > 13)
{
process(chunk,hvec);
clean(chunk);
}
chunk[14]|=upper;
chunk[15]|=lower;
process(chunk,hvec);

std::vector<unsigned char> theComputed;
for(int i(0); i< 8; ++i)
{
for(int j(0); j<4; ++j)
{
std::cout << std::hex << ((hvec>>((3-j)*8))&std::bitset<32>(255)).to_ulong() <<"-";
}
}
}
 
J

Johannes Bauer

Just make sure to compile with optimizations on:

Oh well, I'm not *that* stupid:

[~/tmp]: g++ -Wall -O3 x.cc -o x
[~/tmp]: objdump --demangle -d x | grep call | grep bitset | grep '::set'
400ceb: e8 60 01 00 00 callq 400e50
<std::bitset<64ul>::set(unsigned long, bool)>
400d1b: e8 30 01 00 00 callq 400e50
<std::bitset<64ul>::set(unsigned long, bool)>
[~/tmp]: g++ -dumpversion
4.6.2

Is is definitely not inlined even though I specified -O3.

Regards,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
J

Johannes Bauer

Typical C++ compilers do a decent job of inlining when optimisation is
enabled. Both compilers I tried (g++ and Sun CC) inline the original
example (which was too mangled in your reply to re-quote).

Which version did you try? I used 4.6.2 on Linux x86_64 with -O3 and got
no inlining.

Best regards,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
N

Ninds

Which version did you try? I used 4.6.2 on Linux x86_64 with -O3 and got

no inlining.



Best regards,

Johannes



--



Ah, der neueste und bis heute genialste Streich unsere großen

Kosmologen: Die Geheim-Vorhersage.

- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>

Is it possible that:
1. For your case, on x86_64 there would be no need for inlining since bitset<64> degenerates exactly to native int ?
2. The same case on a 32bit machine would require a call since the set op is no longer atomic
3. For bitset<128> you case makes a call ?

N
 
J

Johannes Bauer

On 29.11.2012 15:58, Ninds wrote:

[Quoting chaos]

As others have mentioned, PLEASE take care of the quoting mess that your
newsreader produces. It makes your messages hard to decipher.
Is it possible that:
1. For your case, on x86_64 there would be no need for inlining since bitset<64> degenerates exactly to native int ?

Well, a bitset<64> on my example is 8 bytes wide while a native int is 4
bytes. But if it did exactly degenerate into a native int, it would
still make lots of sense of the compiler to inline the code in order to
simplify it and get rid of the call overhead.
2. The same case on a 32bit machine would require a call since the set op is no longer atomic

This doesn't make sense. How does atomicity come into play here? There's
no guarantee a bitset does atomically alter the bits.
3. For bitset<128> you case makes a call ?
Yes.

Regards,
Johannes
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
J

Johannes Bauer

Try with g++ 4.7 and -Ofast..

-Ofast doesn't change anything with g++ on 4.6.2, underlining the fact
that your "Just make sure to compile with optimizations on" was
ill-advised. Apparently older g++ versions don't inline this construct
for whatever reason.

Regards,
johannes


--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
N

Ninds

On 29.11.2012 15:58, Ninds wrote:



[Quoting chaos]



As others have mentioned, PLEASE take care of the quoting mess that your

newsreader produces. It makes your messages hard to decipher.


Is it possible that:
1. For your case, on x86_64 there would be no need for inlining since bitset<64> degenerates exactly to native int ?



Well, a bitset<64> on my example is 8 bytes wide while a native int is 4

bytes. But if it did exactly degenerate into a native int, it would

still make lots of sense of the compiler to inline the code in order to

simplify it and get rid of the call overhead.


2. The same case on a 32bit machine would require a call since the set op is no longer atomic



This doesn't make sense. How does atomicity come into play here? There's

no guarantee a bitset does atomically alter the bits.


3. For bitset<128> you case makes a call ?



Yes.



Regards,

Johannes





--
Zumindest nicht �ffentlich!

Ah, der neueste und bis heute genialste Streich unsere gro�en

Kosmologen: Die Geheim-Vorhersage.

- Karl Kaos �ber R�diger Thomas in dsa <[email protected]>

Perhaps I was a bit unclear.
What I meant was that if the native int is 32 bits then it is plausible that the compiler generates different code for the set method for bitset<32> than for say bitset<33>. I just mean that for bitset<32> the setting of all the bits is atomic (I accept it's an ill chosen term) whereas for bitset<33> it's not the case.
Moreover for similar reasons the code for the set method on a x86_64 for bitset<64> maybe quite different from the code generated for the set method on a 32bit machine.
 
M

Marc

Ninds said:
I would like to know whether using bitset<32> for bit operations on a
32bit machine would generally be as efficient as using a 32bit int.
Moreover, if the answer is yes would it also hold for bitset<64> and
64bit int on a 64 bit arch.
I realise the standard says nothing about the implementation so there
is no definitive answer but what is 'likely' to be the case ?

When bitset does the same thing as you would do with an integer (say
with operator| for instance), you can expect roughly the same code in
the end. But if you use .set(i), the standard requires that bitset test
whether i is in the range and throw otherwise. If you were not going to
do such a test with integers, that will make bitset slower.

If you are concerned about the generated code, you will have to look
through it for anything suspicious. Note that if you can't read asm,
with gcc, you can get a first idea of what optimizations happen looking
at the file produced by -fdump-tree-optimized (looks like C code).
 
J

Johannes Bauer

That is not true. A version of gcc older than yours does inlining when
minimal optimizations are turned on:

Hmm, in any case, this shows that the OP should look closely, as it
apparently depends on some more factors -- I'm really curious why the
optimization is not happening on my system, actually. Since you're on
x86-32, I tried to compile with "-m32", but still I cannot get my g++ to
optimize the 'set' call. weird. Output of g++/uname at bottom.

Best regards,
Johannes

[~/tmp]: LC_ALL=C g++ -v
Using built-in specs.
COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.6.2/g++
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.2/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with:
/opt/tmp-portage/portage/sys-devel/gcc-4.6.2/work/gcc-4.6.2/configure
--prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.6.2
--includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.6.2/include
--datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.6.2
--mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.6.2/man
--infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.6.2/info
--with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.6.2/include/g++-v4
--host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec
--disable-fixed-point --without-ppl --without-cloog --enable-lto
--enable-nls --without-included-gettext --with-system-zlib
--disable-werror --enable-secureplt --enable-multilib
--disable-libmudflap --disable-libssp --enable-libgomp
--with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.6.2/python
--enable-checking=release --enable-java-awt=gtk
--enable-languages=c,c++,java,fortran --enable-shared
--enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu
--enable-targets=all --with-bugurl=http://bugs.gentoo.org/
--with-pkgversion='Gentoo 4.6.2 p1.4, pie-0.5.0'
Thread model: posix
gcc version 4.6.2 (Gentoo 4.6.2 p1.4, pie-0.5.0)

[~/tmp]: uname -a
Linux joequad 3.3.7 #6 SMP PREEMPT Wed Nov 28 14:59:17 CET 2012 x86_64
Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz GenuineIntel GNU/Linux


--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
W

W Karas

In your g++ 4.6 try to increase -finline-limit to an appropriate value:



:~$ g++-4.6 -O1 -c x.cc -o x

:~$ objdump --demangle -d x | grep call | grep bitset | grep '::set'

f: e8 00 00 00 00 callq 14 <std::bitset<64ul>::set(unsigned

long, bool)+0x14>



:~$ g++-4.6 -O1 -finline-limit=26 -c x.cc -o x

:~$ objdump --demangle -d x | grep call | grep bitset | grep '::set'



:~$ g++-4.6 -O1 -finline-limit=25 -c x.cc -o x

:~$ objdump --demangle -d x | grep call | grep bitset | grep '::set'

f: e8 00 00 00 00 callq 14 <std::bitset<64ul>::set(unsigned

long, bool)+0x14>

17: e8 00 00 00 00 callq 1c <std::bitset<64ul>::set(unsigned

long, bool)+0x1c>

I would get rid of this compile option. -O even at the minimum level should enable inlining. There should be an option that would specify the cumulative number of additional instructions that recursive inlining of a call could add before the recursion was stopped. The default for this should generally be zero, since bigger and nominally fewer instructions executed can be more than offset by a larger cache working set. A configurable limit on depth of inlining is desirable, but it should only apply when direct or indirect inlining of the same function is detected.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,481
Members
44,900
Latest member
Nell636132

Latest Threads

Top