bitset<32> and bitset<64> efficiency

Discussion in 'C++' started by Ninds, Nov 28, 2012.

  1. Ninds

    Ninds Guest

    Hi,

    I would like to know whether using bitset<32> for bit operations on a 32bit machine would generally be as efficient as using a 32bit int.
    Moreover, if the answer is yes would it also hold for bitset<64> and 64bit int on a 64 bit arch.
    I realise the standard says nothing about the implementation so there is no definitive answer but what is 'likely' to be the case ?


    Thanks
    N
     
    Ninds, Nov 28, 2012
    #1
    1. Advertising

  2. On 28.11.2012 09:58, Ninds wrote:

    > I would like to know whether using bitset<32> for bit operations on a 32bit machine would generally be as efficient as using a 32bit int.
    > Moreover, if the answer is yes would it also hold for bitset<64> and 64bit int on a 64 bit arch.
    > I realise the standard says nothing about the implementation so there is no definitive answer but what is 'likely' to be the case ?


    Why don't you try it out for your constellation? I just tried:

    #include <bitset>
    #include <string>
    #include <iostream>

    int main(int argc, char ** argv) {
    std::bitset<64> i, j;
    std::cerr << sizeof(i) << std::endl;
    __asm__ __volatile__("nop");
    i.set(argc);
    __asm__ __volatile__("nop");
    std::cerr << i << std::endl;
    __asm__ __volatile__("nop");
    j.set(64 - argc);
    __asm__ __volatile__("nop");
    std::cerr << j << std::endl;
    __asm__ __volatile__("nop");
    i ^= j;
    __asm__ __volatile__("nop");
    std::cerr << i << std::endl;
    return 0;
    }

    on Linux x86_64 with g++ 4.6.2. The "set" always results in a call, but
    the xor is done as if it were a int, i.e.:

    // First set
    400ce0: ba 01 00 00 00 mov $0x1,%edx
    400ce5: 48 63 f3 movslq %ebx,%rsi
    400ce8: 48 89 e7 mov %rsp,%rdi
    400ceb: e8 60 01 00 00 callq 400e50
    <std::bitset<64ul>::set(unsigned long, bool)>

    [...]
    // Second set
    400d07: be 40 00 00 00 mov $0x40,%esi
    400d0c: 48 8d 7c 24 10 lea 0x10(%rsp),%rdi
    400d11: ba 01 00 00 00 mov $0x1,%edx
    400d16: 29 de sub %ebx,%esi
    400d18: 48 63 f6 movslq %esi,%rsi
    400d1b: e8 30 01 00 00 callq 400e50
    <std::bitset<64ul>::set(unsigned long, bool)>

    [...]
    // XOR operation
    400d39: 48 8b 44 24 10 mov 0x10(%rsp),%rax
    400d3e: 48 31 04 24 xor %rax,(%rsp)

    A bit curious, I'd have thought "set" would get inlined/optimized away,
    but YMMV.

    Best regards,
    Johannes

    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
     
    Johannes Bauer, Nov 28, 2012
    #2
    1. Advertising

  3. Ninds

    W Karas Guest

    On Wednesday, November 28, 2012 5:02:55 AM UTC-5, Johannes Bauer wrote:
    > On 28.11.2012 09:58, Ninds wrote:
    >
    >
    >
    > > I would like to know whether using bitset<32> for bit operations on a 32bit machine would generally be as efficient as using a 32bit int.

    >
    > > Moreover, if the answer is yes would it also hold for bitset<64> and 64bit int on a 64 bit arch.

    >
    > > I realise the standard says nothing about the implementation so there is no definitive answer but what is 'likely' to be the case ?

    >
    >
    >
    > Why don't you try it out for your constellation? I just tried:
    >
    >
    >
    > #include <bitset>
    >
    > #include <string>
    >
    > #include <iostream>
    >
    >
    >
    > int main(int argc, char ** argv) {
    >
    > std::bitset<64> i, j;
    >
    > std::cerr << sizeof(i) << std::endl;
    >
    > __asm__ __volatile__("nop");
    >
    > i.set(argc);
    >
    > __asm__ __volatile__("nop");
    >
    > std::cerr << i << std::endl;
    >
    > __asm__ __volatile__("nop");
    >
    > j.set(64 - argc);
    >
    > __asm__ __volatile__("nop");
    >
    > std::cerr << j << std::endl;
    >
    > __asm__ __volatile__("nop");
    >
    > i ^= j;
    >
    > __asm__ __volatile__("nop");
    >
    > std::cerr << i << std::endl;
    >
    > return 0;
    >
    > }
    >
    >
    >
    > on Linux x86_64 with g++ 4.6.2. The "set" always results in a call, but
    >
    > the xor is done as if it were a int, i.e.:
    >
    >
    >
    > // First set
    >
    > 400ce0: ba 01 00 00 00 mov $0x1,%edx
    >
    > 400ce5: 48 63 f3 movslq %ebx,%rsi
    >
    > 400ce8: 48 89 e7 mov %rsp,%rdi
    >
    > 400ceb: e8 60 01 00 00 callq 400e50
    >
    > <std::bitset<64ul>::set(unsigned long, bool)>
    >
    >
    >
    > [...]
    >
    > // Second set
    >
    > 400d07: be 40 00 00 00 mov $0x40,%esi
    >
    > 400d0c: 48 8d 7c 24 10 lea 0x10(%rsp),%rdi
    >
    > 400d11: ba 01 00 00 00 mov $0x1,%edx
    >
    > 400d16: 29 de sub %ebx,%esi
    >
    > 400d18: 48 63 f6 movslq %esi,%rsi
    >
    > 400d1b: e8 30 01 00 00 callq 400e50
    >
    > <std::bitset<64ul>::set(unsigned long, bool)>
    >
    >
    >
    > [...]
    >
    > // XOR operation
    >
    > 400d39: 48 8b 44 24 10 mov 0x10(%rsp),%rax
    >
    > 400d3e: 48 31 04 24 xor %rax,(%rsp)
    >
    >
    >
    > A bit curious, I'd have thought "set" would get inlined/optimized away,
    >
    > but YMMV.
    >
    >
    >
    > Best regards,
    >
    > Johannes


    It seems that typical C++ compilers will often fail to inline, even when doing so would result in object code that was BOTH smaller and faster. It's a very frustrating aspect of using C++. Can anyone comment on why this is the case? In the case of GCC, I suppose one cannot look a gift-horse in the mouth. But the problem seems to exist with compilers which must be licensed at significant cost as well.
     
    W Karas, Nov 29, 2012
    #3
  4. Ninds

    Ian Collins Guest

    W Karas wrote:

    Please clean up the mess google makes of your replies!

    > It seems that typical C++ compilers will often fail to inline, even
    > when doing so would result in object code that was BOTH smaller and
    > faster. It's a very frustrating aspect of using C++. Can anyone
    > comment on why this is the case? In the case of GCC, I suppose one
    > cannot look a gift-horse in the mouth. But the problem seems to
    > exist with compilers which must be licensed at significant cost as
    > well.


    Typical C++ compilers do a decent job of inlining when optimisation is
    enabled. Both compilers I tried (g++ and Sun CC) inline the original
    example (which was too mangled in your reply to re-quote).

    --
    Ian Collins
     
    Ian Collins, Nov 29, 2012
    #4
  5. Ninds

    Ninds Guest

    On Wednesday, 28 November 2012 08:58:59 UTC, Ninds wrote:
    > Hi,
    >
    >
    >
    > I would like to know whether using bitset<32> for bit operations on a 32bit machine would generally be as efficient as using a 32bit int.
    >
    > Moreover, if the answer is yes would it also hold for bitset<64> and 64bit int on a 64 bit arch.
    >
    > I realise the standard says nothing about the implementation so there is no definitive answer but what is 'likely' to be the case ?
    >
    >
    >
    >
    >
    > Thanks
    >
    > N


    Hi,
    Thanks for the replies .. but I am afraid I am none the wiser. To put my question in context:
    I wanted to write a small concise bit of code to compute the SHA256 hash that would be portable (hence free of asm) and isolated. I realised from the outset that with these constraints it wouldn't be the speediest kid on the block but I didn't want it to be ridiculously inefficient either. I couldn't think of a cleaner solution that using std::bitset this way I didn't needto concern myself with the underlying architecture and at the same time I would be working with the language rather than against it.

    I now have a prototype which seems to work fine ans is simply C++ implementation of the pseudo code on Wikipedia:

    #include<iostream>
    #include<string>
    #include<sstream>
    #include<vector>
    #include<iomanip>
    #include<bitset>

    const unsigned int h[8] ={
    0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19};

    const unsigned int k[64] ={
    0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
    0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
    0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
    0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
    0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
    0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
    0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
    0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2};

    template<size_t M>
    struct bit_op
    {
    template<size_t N>
    static std::bitset<N> &RoL(std::bitset<N> &data)
    {
    std::bitset<N> leftShift = data<<M;
    data>>=(N-M);
    data |= leftShift;
    return data;
    }

    template<size_t N>
    static std::bitset<N> &RoR(std::bitset<N> &data)
    {
    std::bitset<N> rightShift = data>>M;
    data<<=(N-M);
    data |= rightShift;
    return data;
    }

    template<size_t N>
    static std::bitset<N> roL(const std::bitset<N> &data)
    {
    std::bitset<N> temp(data);
    return RoL(temp);
    }

    template<size_t N>
    static std::bitset<N> roR(const std::bitset<N> &data)
    {
    std::bitset<N> temp(data);
    return RoR(temp);
    }
    };

    void clean(std::vector<std::bitset<32> > &chunk)
    {
    for(int i(0); i< chunk.size(); ++i)
    {
    chunk.reset();
    }
    }

    int getChunck(std::istream &theStream, std::vector<std::bitset<32> > &chunk)
    {
    clean(chunk);
    int Count =0;
    while(!theStream.eof() && Count < 512/8)
    {
    char c=0;
    theStream.get(c);
    if(c!=0)
    {
    std::bitset<32> data = ((unsigned long )c);
    data<<=((3-(Count & 3))*8);
    chunk[(Count>>2)] |= data;
    ++Count;
    }
    }
    return Count;
    }

    void process(std::vector<std::bitset<32> > &w,std::vector<std::bitset<32> >&hvec)
    {
    for (int i(16); i< 64; ++i)
    {
    std::bitset<32> s0 = (bit_op<7>::roR(w[i-15])) ^ (bit_op<18>::roR(w[i-15])) ^ (w[i-15]>>3);
    std::bitset<32> s1 = (bit_op<17>::roR(w[i-2])) ^ (bit_op<19>::roR(w[i-2])) ^ (w[i-2]>>10);
    w = std::bitset<32>(w[i-16].to_ulong() + s0.to_ulong() + w[i-7].to_ulong() + s1.to_ulong());
    }
    std::bitset<32> a = hvec[0];
    std::bitset<32> b = hvec[1];
    std::bitset<32> c = hvec[2];
    std::bitset<32> d = hvec[3];
    std::bitset<32> e = hvec[4];
    std::bitset<32> f = hvec[5];
    std::bitset<32> g = hvec[6];
    std::bitset<32> h = hvec[7];

    for(int i(0); i< 64; ++i)
    {
    std::bitset<32> S0 = (bit_op<2>::roR(a)) ^ (bit_op<13>::roR(a)) ^ (bit_op<22>::roR(a));
    std::bitset<32> maj = (a & b) ^ (a & c) ^ (b & c);
    std::bitset<32> t2(S0.to_ulong() + maj.to_ulong());
    std::bitset<32> S1 = (bit_op<6>::roR(e)) ^ (bit_op<11>::roR(e)) ^ (bit_op<25>::roR(e));
    std::bitset<32> ch = (e & f) ^ ((~e) & g);
    std::bitset<32> t1 = h.to_ulong() + S1.to_ulong() + ch.to_ulong() + k + w.to_ulong() ;

    h = g;
    g = f;
    f = e;
    e = d.to_ulong() + t1.to_ulong();
    d = c;
    c = b;
    b = a;
    a = t1.to_ulong() + t2.to_ulong() ;
    }
    hvec[0] = hvec[0].to_ulong() + a.to_ulong();
    hvec[1] = hvec[1].to_ulong() + b.to_ulong();
    hvec[2] = hvec[2].to_ulong() + c.to_ulong();
    hvec[3] = hvec[3].to_ulong() + d.to_ulong();
    hvec[4] = hvec[4].to_ulong() + e.to_ulong();
    hvec[5] = hvec[5].to_ulong() + f.to_ulong();
    hvec[6] = hvec[6].to_ulong() + g.to_ulong();
    hvec[7] = hvec[7].to_ulong() + h.to_ulong();
    }
     
    Ninds, Nov 29, 2012
    #5
  6. Ninds

    Ninds Guest

    On Wednesday, 28 November 2012 08:58:59 UTC, Ninds wrote:
    > Hi,
    >
    >
    >
    > I would like to know whether using bitset<32> for bit operations on a 32bit machine would generally be as efficient as using a 32bit int.
    >
    > Moreover, if the answer is yes would it also hold for bitset<64> and 64bit int on a 64 bit arch.
    >
    > I realise the standard says nothing about the implementation so there is no definitive answer but what is 'likely' to be the case ?
    >
    >
    >
    >
    >
    > Thanks
    >
    > N

    Hi,
    Thanks for the replies .. but I am afraid I am none the wiser. To put my question in context:
    I wanted to write a small concise bit of code to compute the SHA256 hash that would be portable (hence free of asm) and isolated. I realised from the outset that with these constraints it wouldn't be the speediest kid on the block but I didn't want it to be ridiculously inefficient either. I couldn't think of a cleaner solution that using std::bitset this way I didn't needto concern myself with the underlying architecture and at the same time I would be working with the language rather than against it.
    I am however curious as to why rotations methods are not implemented for bitset.

    I now have a prototype which seems to work fine implemented from the pseudocode on Wikipedia.

    #include<iostream>
    #include<string>
    #include<sstream>
    #include<vector>
    #include<iomanip>
    #include<bitset>

    const unsigned int h[8] ={
    0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19};

    const unsigned int k[64] ={
    0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
    0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
    0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
    0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
    0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
    0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
    0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
    0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2};

    template<size_t M>
    struct bit_op
    {
    template<size_t N>
    static std::bitset<N> &RoL(std::bitset<N> &data)
    {
    std::bitset<N> leftShift = data<<M;
    data>>=(N-M);
    data |= leftShift;
    return data;
    }

    template<size_t N>
    static std::bitset<N> &RoR(std::bitset<N> &data)
    {
    std::bitset<N> rightShift = data>>M;
    data<<=(N-M);
    data |= rightShift;
    return data;
    }

    template<size_t N>
    static std::bitset<N> roL(const std::bitset<N> &data)
    {
    std::bitset<N> temp(data);
    return RoL(temp);
    }

    template<size_t N>
    static std::bitset<N> roR(const std::bitset<N> &data)
    {
    std::bitset<N> temp(data);
    return RoR(temp);
    }
    };

    void clean(std::vector<std::bitset<32> > &chunk)
    {
    for(int i(0); i< chunk.size(); ++i)
    {
    chunk.reset();
    }
    }

    int getChunck(std::istream &theStream, std::vector<std::bitset<32> > &chunk)
    {
    clean(chunk);
    int Count =0;
    while(!theStream.eof() && Count < 512/8)
    {
    char c=0;
    theStream.get(c);
    if(c!=0)
    {
    std::bitset<32> data = ((unsigned long )c);
    data<<=((3-(Count & 3))*8);
    chunk[(Count>>2)] |= data;
    ++Count;
    }
    }
    return Count;
    }

    void process(std::vector<std::bitset<32> > &w,std::vector<std::bitset<32> >&hvec)
    {
    for (int i(16); i< 64; ++i)
    {
    std::bitset<32> s0 = (bit_op<7>::roR(w[i-15])) ^ (bit_op<18>::roR(w[i-15])) ^ (w[i-15]>>3);
    std::bitset<32> s1 = (bit_op<17>::roR(w[i-2])) ^ (bit_op<19>::roR(w[i-2])) ^ (w[i-2]>>10);
    w = std::bitset<32>(w[i-16].to_ulong() + s0.to_ulong() + w[i-7].to_ulong() + s1.to_ulong());
    }
    std::bitset<32> a = hvec[0];
    std::bitset<32> b = hvec[1];
    std::bitset<32> c = hvec[2];
    std::bitset<32> d = hvec[3];
    std::bitset<32> e = hvec[4];
    std::bitset<32> f = hvec[5];
    std::bitset<32> g = hvec[6];
    std::bitset<32> h = hvec[7];

    for(int i(0); i< 64; ++i)
    {
    std::bitset<32> S0 = (bit_op<2>::roR(a)) ^ (bit_op<13>::roR(a)) ^ (bit_op<22>::roR(a));
    std::bitset<32> maj = (a & b) ^ (a & c) ^ (b & c);
    std::bitset<32> t2(S0.to_ulong() + maj.to_ulong());
    std::bitset<32> S1 = (bit_op<6>::roR(e)) ^ (bit_op<11>::roR(e)) ^ (bit_op<25>::roR(e));
    std::bitset<32> ch = (e & f) ^ ((~e) & g);
    std::bitset<32> t1 = h.to_ulong() + S1.to_ulong() + ch.to_ulong() + k + w.to_ulong() ;

    h = g;
    g = f;
    f = e;
    e = d.to_ulong() + t1.to_ulong();
    d = c;
    c = b;
    b = a;
    a = t1.to_ulong() + t2.to_ulong() ;
    }
    hvec[0] = hvec[0].to_ulong() + a.to_ulong();
    hvec[1] = hvec[1].to_ulong() + b.to_ulong();
    hvec[2] = hvec[2].to_ulong() + c.to_ulong();
    hvec[3] = hvec[3].to_ulong() + d.to_ulong();
    hvec[4] = hvec[4].to_ulong() + e.to_ulong();
    hvec[5] = hvec[5].to_ulong() + f.to_ulong();
    hvec[6] = hvec[6].to_ulong() + g.to_ulong();
    hvec[7] = hvec[7].to_ulong() + h.to_ulong();
    }

    int main(int argc, char* argv[])
    {
    if(argc<2)
    {
    std::cout <<"Need a string to hash";
    return -1;
    }
    std::string theTestString = argv[1] ;
    for(int i(2); i< argc; ++i)
    {
    theTestString+=" ";
    theTestString+= argv;
    }
    std::istringstream theStream(theTestString);
    std::vector<std::bitset<32> > hvec(8);

    for(int i(0); i< 8; ++i)
    {
    hvec = h;
    }

    std::vector<std::bitset<32> > chunk(64);
    bool finished = false;
    unsigned int byteCount = 0;
    unsigned int chunkCount = 0;
    int count =0;
    while((count=getChunck(theStream,chunk))==64)
    {
    process(chunk,hvec);
    byteCount+=count;
    }
    byteCount+=count;
    int chunkIndex = 0;
    int byteIndex = 0;
    if(count !=0)
    {
    chunkIndex = count/4;
    byteIndex = count - (count/4)*4;
    }
    chunk[chunkIndex]|= (1<<((4-byteIndex)*8-1));
    std::bitset<64> length(byteCount);
    length<<=3;
    unsigned long lower = (length & ((std::bitset<64>().set())>>32)).to_ulong();
    unsigned long upper = (length & ((std::bitset<64>().set())<<32)).to_ulong();

    if(chunkIndex > 13)
    {
    process(chunk,hvec);
    clean(chunk);
    }
    chunk[14]|=upper;
    chunk[15]|=lower;
    process(chunk,hvec);

    std::vector<unsigned char> theComputed;
    for(int i(0); i< 8; ++i)
    {
    for(int j(0); j<4; ++j)
    {
    std::cout << std::hex << ((hvec>>((3-j)*8))&std::bitset<32>(255)).to_ulong() <<"-";
    }
    }
    }
     
    Ninds, Nov 29, 2012
    #6
  7. On 29.11.2012 02:11, Luca Risolia wrote:

    > Just make sure to compile with optimizations on:


    Oh well, I'm not *that* stupid:

    [~/tmp]: g++ -Wall -O3 x.cc -o x
    [~/tmp]: objdump --demangle -d x | grep call | grep bitset | grep '::set'
    400ceb: e8 60 01 00 00 callq 400e50
    <std::bitset<64ul>::set(unsigned long, bool)>
    400d1b: e8 30 01 00 00 callq 400e50
    <std::bitset<64ul>::set(unsigned long, bool)>
    [~/tmp]: g++ -dumpversion
    4.6.2

    Is is definitely not inlined even though I specified -O3.

    Regards,
    Johannes

    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
     
    Johannes Bauer, Nov 29, 2012
    #7
  8. On 29.11.2012 02:46, Ian Collins wrote:

    > Typical C++ compilers do a decent job of inlining when optimisation is
    > enabled. Both compilers I tried (g++ and Sun CC) inline the original
    > example (which was too mangled in your reply to re-quote).


    Which version did you try? I used 4.6.2 on Linux x86_64 with -O3 and got
    no inlining.

    Best regards,
    Johannes

    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
     
    Johannes Bauer, Nov 29, 2012
    #8
  9. Ninds

    Ninds Guest

    On Thursday, 29 November 2012 13:25:17 UTC, Johannes Bauer wrote:
    > On 29.11.2012 02:46, Ian Collins wrote:
    >
    >
    >
    > > Typical C++ compilers do a decent job of inlining when optimisation is

    >
    > > enabled. Both compilers I tried (g++ and Sun CC) inline the original

    >
    > > example (which was too mangled in your reply to re-quote).

    >
    >
    >
    > Which version did you try? I used 4.6.2 on Linux x86_64 with -O3 and got
    >
    > no inlining.
    >
    >
    >
    > Best regards,
    >
    > Johannes
    >
    >
    >
    > --
    >
    > >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    >
    > > Zumindest nicht öffentlich!

    >
    > Ah, der neueste und bis heute genialste Streich unsere großen
    >
    > Kosmologen: Die Geheim-Vorhersage.
    >
    > - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>


    Is it possible that:
    1. For your case, on x86_64 there would be no need for inlining since bitset<64> degenerates exactly to native int ?
    2. The same case on a 32bit machine would require a call since the set op is no longer atomic
    3. For bitset<128> you case makes a call ?

    N
     
    Ninds, Nov 29, 2012
    #9
  10. On 29.11.2012 15:58, Ninds wrote:

    [Quoting chaos]

    As others have mentioned, PLEASE take care of the quoting mess that your
    newsreader produces. It makes your messages hard to decipher.

    > Is it possible that:
    > 1. For your case, on x86_64 there would be no need for inlining since bitset<64> degenerates exactly to native int ?


    Well, a bitset<64> on my example is 8 bytes wide while a native int is 4
    bytes. But if it did exactly degenerate into a native int, it would
    still make lots of sense of the compiler to inline the code in order to
    simplify it and get rid of the call overhead.

    > 2. The same case on a 32bit machine would require a call since the set op is no longer atomic


    This doesn't make sense. How does atomicity come into play here? There's
    no guarantee a bitset does atomically alter the bits.

    > 3. For bitset<128> you case makes a call ?


    Yes.

    Regards,
    Johannes


    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
     
    Johannes Bauer, Nov 29, 2012
    #10
  11. On 29.11.2012 19:00, Luca Risolia wrote:
    > On 29/11/2012 14:24, Johannes Bauer wrote:
    >
    >> Is is definitely not inlined even though I specified -O3.

    >
    > Try with g++ 4.7 and -Ofast..


    -Ofast doesn't change anything with g++ on 4.6.2, underlining the fact
    that your "Just make sure to compile with optimizations on" was
    ill-advised. Apparently older g++ versions don't inline this construct
    for whatever reason.

    Regards,
    johannes


    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
     
    Johannes Bauer, Nov 29, 2012
    #11
  12. Ninds

    Ninds Guest

    On Thursday, 29 November 2012 17:47:58 UTC, Johannes Bauer wrote:
    > On 29.11.2012 15:58, Ninds wrote:
    >
    >
    >
    > [Quoting chaos]
    >
    >
    >
    > As others have mentioned, PLEASE take care of the quoting mess that your
    >
    > newsreader produces. It makes your messages hard to decipher.
    >
    >
    >
    > > Is it possible that:

    >
    > > 1. For your case, on x86_64 there would be no need for inlining since bitset<64> degenerates exactly to native int ?

    >
    >
    >
    > Well, a bitset<64> on my example is 8 bytes wide while a native int is 4
    >
    > bytes. But if it did exactly degenerate into a native int, it would
    >
    > still make lots of sense of the compiler to inline the code in order to
    >
    > simplify it and get rid of the call overhead.
    >
    >
    >
    > > 2. The same case on a 32bit machine would require a call since the set op is no longer atomic

    >
    >
    >
    > This doesn't make sense. How does atomicity come into play here? There's
    >
    > no guarantee a bitset does atomically alter the bits.
    >
    >
    >
    > > 3. For bitset<128> you case makes a call ?

    >
    >
    >
    > Yes.
    >
    >
    >
    > Regards,
    >
    > Johannes
    >
    >
    >
    >
    >
    > --
    >
    > >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    >
    > > Zumindest nicht �ffentlich!

    >
    > Ah, der neueste und bis heute genialste Streich unsere gro�en
    >
    > Kosmologen: Die Geheim-Vorhersage.
    >
    > - Karl Kaos �ber R�diger Thomas in dsa <hidbv3$om2$>


    Perhaps I was a bit unclear.
    What I meant was that if the native int is 32 bits then it is plausible that the compiler generates different code for the set method for bitset<32> than for say bitset<33>. I just mean that for bitset<32> the setting of all the bits is atomic (I accept it's an ill chosen term) whereas for bitset<33> it's not the case.
    Moreover for similar reasons the code for the set method on a x86_64 for bitset<64> maybe quite different from the code generated for the set method on a 32bit machine.
     
    Ninds, Nov 29, 2012
    #12
  13. Ninds

    Marc Guest

    Ninds wrote:

    > I would like to know whether using bitset<32> for bit operations on a
    > 32bit machine would generally be as efficient as using a 32bit int.
    > Moreover, if the answer is yes would it also hold for bitset<64> and
    > 64bit int on a 64 bit arch.
    > I realise the standard says nothing about the implementation so there
    > is no definitive answer but what is 'likely' to be the case ?


    When bitset does the same thing as you would do with an integer (say
    with operator| for instance), you can expect roughly the same code in
    the end. But if you use .set(i), the standard requires that bitset test
    whether i is in the range and throw otherwise. If you were not going to
    do such a test with integers, that will make bitset slower.

    If you are concerned about the generated code, you will have to look
    through it for anything suspicious. Note that if you can't read asm,
    with gcc, you can get a first idea of what optimizations happen looking
    at the file produced by -fdump-tree-optimized (looks like C code).
     
    Marc, Nov 30, 2012
    #13
  14. On 30.11.2012 19:24, Luca Risolia wrote:
    > On 29/11/2012 19:40, Johannes Bauer wrote:
    >> -Ofast doesn't change anything with g++ on 4.6.2, underlining the fact
    >> that your "Just make sure to compile with optimizations on" was
    >> ill-advised. Apparently older g++ versions don't inline this construct
    >> for whatever reason.

    >
    > That is not true. A version of gcc older than yours does inlining when
    > minimal optimizations are turned on:


    Hmm, in any case, this shows that the OP should look closely, as it
    apparently depends on some more factors -- I'm really curious why the
    optimization is not happening on my system, actually. Since you're on
    x86-32, I tried to compile with "-m32", but still I cannot get my g++ to
    optimize the 'set' call. weird. Output of g++/uname at bottom.

    Best regards,
    Johannes

    [~/tmp]: LC_ALL=C g++ -v
    Using built-in specs.
    COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.6.2/g++
    COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.6.2/lto-wrapper
    Target: x86_64-pc-linux-gnu
    Configured with:
    /opt/tmp-portage/portage/sys-devel/gcc-4.6.2/work/gcc-4.6.2/configure
    --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.6.2
    --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.6.2/include
    --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.6.2
    --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.6.2/man
    --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.6.2/info
    --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.6.2/include/g++-v4
    --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec
    --disable-fixed-point --without-ppl --without-cloog --enable-lto
    --enable-nls --without-included-gettext --with-system-zlib
    --disable-werror --enable-secureplt --enable-multilib
    --disable-libmudflap --disable-libssp --enable-libgomp
    --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.6.2/python
    --enable-checking=release --enable-java-awt=gtk
    --enable-languages=c,c++,java,fortran --enable-shared
    --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu
    --enable-targets=all --with-bugurl=http://bugs.gentoo.org/
    --with-pkgversion='Gentoo 4.6.2 p1.4, pie-0.5.0'
    Thread model: posix
    gcc version 4.6.2 (Gentoo 4.6.2 p1.4, pie-0.5.0)

    [~/tmp]: uname -a
    Linux joequad 3.3.7 #6 SMP PREEMPT Wed Nov 28 14:59:17 CET 2012 x86_64
    Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz GenuineIntel GNU/Linux


    --
    >> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

    > Zumindest nicht öffentlich!

    Ah, der neueste und bis heute genialste Streich unsere großen
    Kosmologen: Die Geheim-Vorhersage.
    - Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$>
     
    Johannes Bauer, Dec 2, 2012
    #14
  15. Ninds

    W Karas Guest

    On Sunday, December 2, 2012 7:37:44 AM UTC-5, Luca Risolia wrote:
    > On 02/12/2012 09:31, Johannes Bauer wrote:
    >
    > > Hmm, in any case, this shows that the OP should look closely, as it

    >
    > > apparently depends on some more factors -- I'm really curious why the

    >
    > > optimization is not happening on my system, actually. Since you're on

    >
    > > x86-32, I tried to compile with "-m32", but still I cannot get my g++ to

    >
    > > optimize the 'set' call. weird. Output of g++/uname at bottom.

    >
    >
    >
    > In your g++ 4.6 try to increase -finline-limit to an appropriate value:
    >
    >
    >
    > :~$ g++-4.6 -O1 -c x.cc -o x
    >
    > :~$ objdump --demangle -d x | grep call | grep bitset | grep '::set'
    >
    > f: e8 00 00 00 00 callq 14 <std::bitset<64ul>::set(unsigned
    >
    > long, bool)+0x14>
    >
    >
    >
    > :~$ g++-4.6 -O1 -finline-limit=26 -c x.cc -o x
    >
    > :~$ objdump --demangle -d x | grep call | grep bitset | grep '::set'
    >
    >
    >
    > :~$ g++-4.6 -O1 -finline-limit=25 -c x.cc -o x
    >
    > :~$ objdump --demangle -d x | grep call | grep bitset | grep '::set'
    >
    > f: e8 00 00 00 00 callq 14 <std::bitset<64ul>::set(unsigned
    >
    > long, bool)+0x14>
    >
    > 17: e8 00 00 00 00 callq 1c <std::bitset<64ul>::set(unsigned
    >
    > long, bool)+0x1c>


    I would get rid of this compile option. -O even at the minimum level should enable inlining. There should be an option that would specify the cumulative number of additional instructions that recursive inlining of a call could add before the recursion was stopped. The default for this should generally be zero, since bigger and nominally fewer instructions executed can be more than offset by a larger cache working set. A configurable limit on depth of inlining is desirable, but it should only apply when direct or indirect inlining of the same function is detected.
     
    W Karas, Dec 3, 2012
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GSK
    Replies:
    0
    Views:
    402
  2. The Eeediot
    Replies:
    3
    Views:
    539
    =?Utf-8?B?QW5kcmV3IENvcmxleSwgTUNTRCwgTUNEQkE=?=
    Nov 16, 2004
  3. shaun roe
    Replies:
    2
    Views:
    496
    Andrew Koenig
    Nov 6, 2004
  4. SpOiLeR
    Replies:
    5
    Views:
    1,368
    SpOiLeR
    Mar 16, 2005
  5. Thormod Johansen

    Vector and list iterators and efficiency

    Thormod Johansen, Mar 26, 2007, in forum: C++
    Replies:
    2
    Views:
    482
    mlimber
    Mar 26, 2007
Loading...

Share This Page