Optimizing C

Discussion in 'C Programming' started by Richard G. Riley, Mar 13, 2006.

  1. Without resorting to asm chunks I'm working on a few small routines
    which manipulate bitmasks. I'm looking for any guidance on writing C
    in a manner which tilts the compilers hand in, if possible, a
    compiler/underlying processor independant way : althought to be fair I
    cant see this stuff on anything other than x86, but who knows.

    I found some ok info here:

    http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm
    http://www.devx.com/amd/Article/21314

    But any techniques for doing the standard bit fiddles without
    occurring extra unnecessary cpu bandwidth would be nice. By standard
    bit fiddles I mean : shift & rotate, anding, oring, shift til bit
    found etc. The operations will be occurring on sequences of
    words and the smallest word will be the standard CPU word size (32
    bits). By sequences of words : I might, for example, be locating
    the first non zero bit from the left in sequence of 64 words.

    Any pointers appreciated.
     
    Richard G. Riley, Mar 13, 2006
    #1
    1. Advertising

  2. Richard G. Riley

    Eric Sosman Guest

    Richard G. Riley wrote:

    > Without resorting to asm chunks I'm working on a few small routines
    > which manipulate bitmasks. I'm looking for any guidance on writing C
    > in a manner which tilts the compilers hand in, if possible, a
    > compiler/underlying processor independant way : althought to be fair I
    > cant see this stuff on anything other than x86, but who knows.
    >
    > I found some ok info here:
    >
    > http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm
    > http://www.devx.com/amd/Article/21314
    >
    > But any techniques for doing the standard bit fiddles without
    > occurring extra unnecessary cpu bandwidth would be nice. By standard
    > bit fiddles I mean : shift & rotate, anding, oring, shift til bit
    > found etc. The operations will be occurring on sequences of
    > words and the smallest word will be the standard CPU word size (32
    > bits). By sequences of words : I might, for example, be locating
    > the first non zero bit from the left in sequence of 64 words.


    First, it's not at all clear what you're trying to speed
    up, nor what you have measured and found to be too slow. A
    "generic" optimization is often ill-directed; optimizing "on
    general principles" is often ill-advised. If you have a specific
    operation you'd like to speed up, it'd be a good idea to show
    the actual code you're currently using, along with measurements
    of its current speed and an estimate of the amount of speedup
    you simply MUST achieve to make the code useful.

    As to finding the first non-zero bit in an array of 64 ints,
    the first thing to do is nail down what you mean by "first," lest
    endianness issues pollute the solutions. Having done that, you
    should also decide what form the answer should take -- it may be
    simpler (or harder) to get a bit mask than a bit index, and you
    may benefit by adjusting the rest of the code to use the more
    easily obtained answer.

    If one-bits are expected to be fairly sparse, you'll probably
    want to search through largish "chunks" to find one that isn't
    all zeroes, then search within that chunk for the proper bit.
    You could do the search an `int' at a time with an ordinary loop,
    or you might use memchr() to search byte-by-byte. The memchr()
    method might (or might not) take longer for the search, but note
    that it leaves a smaller space for the bit-by-bit search to explore.
    You'd need to implement both and measure.

    If you've used memchr() and have located a non-zero byte, it
    might make sense to use a precalculated 256-element table to finish
    the job without further "computation." Then again, it might not:
    modern CPU's are much faster than memory, and might be able to
    complete a multi-step computation in less time than a memory fetch.
    More measurements are needed.

    If you've searched int-by-int you could dive directly into
    the second search, or you could first use a couple of comparisons
    to decide which of the four bytes holds the "first" non-zero and
    then proceed as if memchr() had located that byte. Which will be
    faster? Measure, measure, measure.

    If you need a bit index and don't want to use a table, the
    obvious test-and-shift-and-count approach is probably best. If
    you need a mask as opposed to an index, the `x & (x - 1)' trick
    may be faster. Measure, measure, -- is there an echo in here?

    Finally, an impression of the two Web pages you cited. A
    quick glance suggests they're filled with well-intentioned and
    sometimes reasonable advice, but ALL of it is out of context.
    (Necessarily so, of course: the authors of the Web pages have no
    idea what kind of program you're trying to write.) Even expert
    advice can become nonsense when taken out of context. Your
    grandmother's doctor is (presumably) an expert; do you therefore
    take the same medicines that are prescribed for your grandmother?
    IMHO, it is best to read and understand advice of the sort these
    pages offer, but NOT to follow it until and unless you have reason
    to believe it applies to your situation.

    --
    Eric Sosman
    lid
     
    Eric Sosman, Mar 13, 2006
    #2
    1. Advertising

  3. Richard G. Riley wrote:
    > Without resorting to asm chunks I'm working on a few small routines
    > which manipulate bitmasks. I'm looking for any guidance on writing C
    > in a manner which tilts the compilers hand in, if possible, a
    > compiler/underlying processor independant way : althought to be fair I
    > cant see this stuff on anything other than x86, but who knows.
    >
    > I found some ok info here:
    >
    > http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm
    > http://www.devx.com/amd/Article/21314
    >
    > But any techniques for doing the standard bit fiddles without
    > occurring extra unnecessary cpu bandwidth would be nice. By standard
    > bit fiddles I mean : shift & rotate, anding, oring, shift til bit
    > found etc. The operations will be occurring on sequences of
    > words and the smallest word will be the standard CPU word size (32
    > bits). By sequences of words : I might, for example, be locating
    > the first non zero bit from the left in sequence of 64 words.
    >
    > Any pointers appreciated.


    Maybe http://www.jjj.de/bitwizardry/ is a good start.

    a+, ld.
     
    Laurent Deniau, Mar 13, 2006
    #3
  4. On 2006-03-13, Eric Sosman <> wrote:
    > Richard G. Riley wrote:
    >
    >> Without resorting to asm chunks I'm working on a few small routines
    >> which manipulate bitmasks. I'm looking for any guidance on writing C
    >> in a manner which tilts the compilers hand in, if possible, a
    >> compiler/underlying processor independant way : althought to be fair I
    >> cant see this stuff on anything other than x86, but who knows.
    >>
    >> I found some ok info here:
    >>
    >> http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm
    >> http://www.devx.com/amd/Article/21314
    >>
    >> But any techniques for doing the standard bit fiddles without
    >> occurring extra unnecessary cpu bandwidth would be nice. By standard
    >> bit fiddles I mean : shift & rotate, anding, oring, shift til bit
    >> found etc. The operations will be occurring on sequences of
    >> words and the smallest word will be the standard CPU word size (32
    >> bits). By sequences of words : I might, for example, be locating
    >> the first non zero bit from the left in sequence of 64 words.

    >
    > First, it's not at all clear what you're trying to speed
    > up, nor what you have measured and found to be too slow. A


    Hi Eric,

    I'm looking for advice/pointers on the operators mentioned above. Its
    not a case of anything being too slow : its a question of ensuring or
    approaching optimum performance as early as possible. I did a lot of
    this stuff at assembler level years ago when writing high performance
    vga libraries, but want, if possible to leave it all in C now.

    From my opening paragraph :

    shift & rotate operations : left & right shift operations
    anding : bitwise and operations
    oring : bitwise or operations

    I am writing some small libraries for generating image masks,
    transparancy masks and edge silhouettes. Some more of the details are below.

    > "generic" optimization is often ill-directed; optimizing "on
    > general principles" is often ill-advised. If you have a specific
    > operation you'd like to speed up, it'd be a good idea to show
    > the actual code you're currently using, along with measurements
    > of its current speed and an estimate of the amount of speedup
    > you simply MUST achieve to make the code useful.


    I'm looking for general guidelines : nothing spectacuarly
    specific. Although you do give some good guidlines, some of which had
    crossed my mind already, but certainly not all.

    >
    > As to finding the first non-zero bit in an array of 64 ints,
    > the first thing to do is nail down what you mean by "first," lest
    > endianness issues pollute the solutions. Having done that, you


    Thats in the details : but here we can assume that the sequence is
    high bit == left most pixel. Also that any colour issues are segregated
    into 3 or more color planes : initial code can assume a monochrome
    bitmap.

    I dont really think endian comes into it since the operations will
    be taken care of at the word level. Even if it wasnt monochrome then
    the most specific X bits would be leftmost pixel where X is the color
    depth : in this case I would see the only code issues being that of
    increasing the bit width of the test mask and the number of shifts
    done in order to check the next pixel.

    (Just to clarify the endian thing : all work is in memory buffers and
    is in no way related to video HW addressing.)

    > should also decide what form the answer should take -- it may be
    > simpler (or harder) to get a bit mask than a bit index, and you
    > may benefit by adjusting the rest of the code to use the more
    > easily obtained answer.


    Thats something to think about ok. Although I always considered just
    anding and shifting 0x80000000 until a non zero result was
    indicated. The mask can be generated from the bit by subtracting one.
    (e.g 0x4000000-1 -> 0x3fffffff) and readding the bit.

    After left pixel is detected we detect right most and if we are
    creating a mask of words for rest of scanline its then trivial.

    >
    > If one-bits are expected to be fairly sparse, you'll probably
    > want to search through largish "chunks" to find one that isn't
    > all zeroes, then search within that chunk for the proper bit.
    > You could do the search an `int' at a time with an ordinary loop,
    > or you might use memchr() to search byte-by-byte. The memchr()
    > method might (or might not) take longer for the search, but note
    > that it leaves a smaller space for the bit-by-bit search to explore.
    > You'd need to implement both and measure.


    Yes, check word for 0 before looking for a set bit : simple enough &
    very practical operation with no real overhead at time of word
    read. I dont expect the images to be sparse so I'm not sure its
    worth the overhead of calling a library to find it : considering left
    most pixel detection: so In C, this seems fairly optimal to me (and
    this type of thing I'm looking for advice on)

    (early caveat : all pseudo c code assuming 32bit word for clarity and not
    getting lost in type sizes)


    bitmap =0;
    while(columnsToCheck-- && !(bitmap = *imagePtr++));
    if(bitmap){
    /* now find leftmost pixel */
    }


    should be optimal enough?

    >
    > If you've used memchr() and have located a non-zero byte, it


    memchr() would certinaly not be on the agenda I think : it needs a
    character to compare whereas the implicit recognition of "zero"
    or "non zero" is faster. Also, I read in words (properly aligned of
    course) and not bytes : bytes would certainly strangle the application.

    > might make sense to use a precalculated 256-element table to finish
    > the job without further "computation." Then again, it might not:


    Cou you expand on that table bit a little : I dont know about this. Is
    it something CPU specific?

    > modern CPU's are much faster than memory, and might be able to
    > complete a multi-step computation in less time than a memory fetch.
    > More measurements are needed.
    >
    > If you've searched int-by-int you could dive directly into
    > the second search, or you could first use a couple of comparisons
    > to decide which of the four bytes holds the "first" non-zero and
    > then proceed as if memchr() had located that byte. Which will be
    > faster? Measure, measure, measure.
    >
    > If you need a bit index and don't want to use a table, the
    > obvious test-and-shift-and-count approach is probably best. If
    > you need a mask as opposed to an index, the `x & (x - 1)' trick


    Could you expand on that? I would see it as simple as

    (find left pixel mask)

    /* we know the bitmask is non zero so a bit check is a must*/
    for(bit=0x8000000;!(bit&bitmap);bit>>=1);
    /*word bit mask protects all bits from leftmost pixel on*/
    leftWordMask = bit | (bit-1);

    > may be faster. Measure, measure, -- is there an echo in here?


    Ive used a few profilers on Vax, Convex and on MS : but will be using
    gcov for initial run time analysis on linux - but I'm not sure if its
    what I want for real execution time profiling. But I'm certainly
    looking for "C hints" on faster ways with bitmaps first : then I can
    decide on a design approach : I think the code above (silly errors not
    withstanding), appears to be relatively quick and the core "loop" areas are
    small enough where I would expect their execution footprint to get cached.

    >
    > Finally, an impression of the two Web pages you cited. A
    > quick glance suggests they're filled with well-intentioned and
    > sometimes reasonable advice, but ALL of it is out of context.


    Yes & no : it has some general advice on standard stuff which can help
    : alignment, powers of two etc. It even discusses use of machine words
    at some stage instead of smaller units which can cause excessive
    overhead I think. Certainly helpful to anyone embarking on optimizing
    anything in C for the first time.

    > (Necessarily so, of course: the authors of the Web pages have no
    > idea what kind of program you're trying to write.) Even expert
    > advice can become nonsense when taken out of context. Your
    > grandmother's doctor is (presumably) an expert; do you therefore
    > take the same medicines that are prescribed for your grandmother?
    > IMHO, it is best to read and understand advice of the sort these
    > pages offer, but NOT to follow it until and unless you have reason
    > to believe it applies to your situation.
    >


    I appreciate your reply,

    thanks for taking the time to do so.
     
    Richard G. Riley, Mar 13, 2006
    #4
  5. On 2006-03-13, Laurent Deniau <> wrote:
    >
    > Maybe http://www.jjj.de/bitwizardry/ is a good start.
    >
    > a+, ld.


    Great! Thanks : suprised I didnt google it up in the first place. Nice
    resource, cheers.
     
    Richard G. Riley, Mar 13, 2006
    #5
  6. Richard G. Riley wrote:
    > Without resorting to asm chunks I'm working on a few small routines
    > which manipulate bitmasks. I'm looking for any guidance on writing C
    > in a manner which tilts the compilers hand in, if possible, a
    > compiler/underlying processor independant way : althought to be fair I
    > cant see this stuff on anything other than x86, but who knows.


    Perhaps useless info if you only run on x86 but...

    The Motorola 68020 has special bitfield opcodes which some
    C compilers will generate *if* you declare the data as C bitfields.
    (One of these special opcodes -- 'bfffo' -- is unusual and can come in
    *very* handy sometimes, but I doubt any compiler will use *it*.)

    James Dow Allen
     
    James Dow Allen, Mar 13, 2006
    #6
  7. Richard G. Riley

    Eric Sosman Guest

    Richard G. Riley wrote On 03/13/06 10:43,:
    > On 2006-03-13, Eric Sosman <> wrote:
    >
    >>Richard G. Riley wrote:
    >>
    >>
    >>>Without resorting to asm chunks I'm working on a few small routines
    >>>which manipulate bitmasks. I'm looking for any guidance on writing C
    >>>in a manner which tilts the compilers hand in, if possible, a
    >>>compiler/underlying processor independant way : althought to be fair I
    >>>cant see this stuff on anything other than x86, but who knows.
    >>>
    >>>I found some ok info here:
    >>>
    >>>http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm
    >>>http://www.devx.com/amd/Article/21314
    >>>
    >>>But any techniques for doing the standard bit fiddles without
    >>>occurring extra unnecessary cpu bandwidth would be nice. By standard
    >>>bit fiddles I mean : shift & rotate, anding, oring, shift til bit
    >>>found etc. The operations will be occurring on sequences of
    >>>words and the smallest word will be the standard CPU word size (32
    >>>bits). By sequences of words : I might, for example, be locating
    >>>the first non zero bit from the left in sequence of 64 words.

    >>
    >> First, it's not at all clear what you're trying to speed
    >>up, nor what you have measured and found to be too slow. A

    >
    >
    > Hi Eric,
    >
    > I'm looking for advice/pointers on the operators mentioned above. Its
    > not a case of anything being too slow : its a question of ensuring or
    > approaching optimum performance as early as possible. [...]


    D.E. Knuth (elaborating on a remark by Hoare): "We should
    forget about small efficiencies, say about 97% of the time:
    premature optimization is the root of all evil."

    W.A. Wulf: "More computing sins are committed in the name
    of efficiency (without necessarily achieving it) than for any
    other single reason, including blind stupidity."

    Jackson's Laws of Program Optimization:
    FIRST LAW: Don't do it.
    SECOND LAW (for experts only): Don't do it yet.

    See also <http://www.flounder.com/optimization.htm>.

    ... and that's all I'll offer on the matter. Until you
    have made measurements or done calculations that demonstrate
    a performance problem, "approaching optimum performance as
    early as possible" is folly. If you wish to behave foolishly,
    go ahead -- but do it on your own; I'm not going to assist
    you to your ruin.

    --
     
    Eric Sosman, Mar 13, 2006
    #7
  8. On 2006-03-13, Eric Sosman <> wrote:
    >
    >
    > D.E. Knuth (elaborating on a remark by Hoare): "We should
    > forget about small efficiencies, say about 97% of the time:
    > premature optimization is the root of all evil."
    >
    > W.A. Wulf: "More computing sins are committed in the name
    > of efficiency (without necessarily achieving it) than for any
    > other single reason, including blind stupidity."
    >
    > Jackson's Laws of Program Optimization:
    > FIRST LAW: Don't do it.
    > SECOND LAW (for experts only): Don't do it yet.
    >
    > See also <http://www.flounder.com/optimization.htm>.
    >
    > ... and that's all I'll offer on the matter. Until you
    > have made measurements or done calculations that demonstrate
    > a performance problem, "approaching optimum performance as
    > early as possible" is folly. If you wish to behave foolishly,
    > go ahead -- but do it on your own; I'm not going to assist
    > you to your ruin.
    >


    Thanks for those references : all very valid if designing a large
    system. I am designing/impementing a very low level system with fairly
    specific requirements : and apriori knowledge of what can be achieved
    at the earliest stages to guide the design to embrace known optimal
    techniques is very much relevant. The key here is "known optimal
    techniques" : if there are such techniques already worked out (as the
    link another poster gave recomemnds) then it would be foolish to dismiss
    those out of hand.
     
    Richard G. Riley, Mar 13, 2006
    #8
  9. Richard G. Riley wrote:
    > Without resorting to asm chunks I'm working on a few small routines
    > which manipulate bitmasks. I'm looking for any guidance on writing C
    > in a manner which tilts the compilers hand in, if possible, a
    > compiler/underlying processor independant way : althought to be fair I
    > cant see this stuff on anything other than x86, but who knows.
    >
    > I found some ok info here:
    >
    > http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm
    > http://www.devx.com/amd/Article/21314
    >
    > But any techniques for doing the standard bit fiddles without
    > occurring extra unnecessary cpu bandwidth would be nice. By standard
    > bit fiddles I mean : shift & rotate, anding, oring, shift til bit
    > found etc. The operations will be occurring on sequences of
    > words and the smallest word will be the standard CPU word size (32
    > bits). By sequences of words : I might, for example, be locating
    > the first non zero bit from the left in sequence of 64 words.
    >
    > Any pointers appreciated.


    Isolate the bit scan routine into its own function, and write it in C
    as simply as possible. Then special-case the function for x86/x86-64
    using the appropriate bit scan machine instruction (bsf, bsr). Make
    sure the function actually gets inlined.

    This is really OT, but I mention it because I've yet to see a compiler
    that actually generates bit scan instructions, and the speedup from
    using them is around 10x (in my experience). Bit scanning is heavily
    used in some areas of networking, particularly in packet
    classification.

    However, this is probably the _only_ thing you should code in assembly,
    as bit twiddling C code is normally compiled to very tight assembly.
    In fact, it's most often faster than non-expert hand written assembly,
    simply because the compiler knows more about the minutae of the machine
    architecture and instruction scheduling.


    Mark F. Haigh
     
    Mark F. Haigh, Mar 13, 2006
    #9
  10. On 2006-03-13, Mark F. Haigh <> wrote:
    > Richard G. Riley wrote:
    >> Without resorting to asm chunks I'm working on a few small routines
    >> which manipulate bitmasks. I'm looking for any guidance on writing C
    >> in a manner which tilts the compilers hand in, if possible, a
    >> compiler/underlying processor independant way : althought to be fair I
    >> cant see this stuff on anything other than x86, but who knows.
    >>
    >> I found some ok info here:
    >>
    >> http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm
    >> http://www.devx.com/amd/Article/21314
    >>
    >> But any techniques for doing the standard bit fiddles without
    >> occurring extra unnecessary cpu bandwidth would be nice. By standard
    >> bit fiddles I mean : shift & rotate, anding, oring, shift til bit
    >> found etc. The operations will be occurring on sequences of
    >> words and the smallest word will be the standard CPU word size (32
    >> bits). By sequences of words : I might, for example, be locating
    >> the first non zero bit from the left in sequence of 64 words.
    >>
    >> Any pointers appreciated.

    >
    > Isolate the bit scan routine into its own function, and write it in C
    > as simply as possible. Then special-case the function for x86/x86-64
    > using the appropriate bit scan machine instruction (bsf, bsr). Make
    > sure the function actually gets inlined.
    >
    > This is really OT, but I mention it because I've yet to see a compiler
    > that actually generates bit scan instructions, and the speedup from
    > using them is around 10x (in my experience). Bit scanning is heavily
    > used in some areas of networking, particularly in packet
    > classification.
    >
    > However, this is probably the _only_ thing you should code in assembly,
    > as bit twiddling C code is normally compiled to very tight assembly.
    > In fact, it's most often faster than non-expert hand written assembly,
    > simply because the compiler knows more about the minutae of the machine
    > architecture and instruction scheduling.
    >


    Thanks Mark : I'll take a look and might try that at a later date :
    I'm "off" with assembler at the moment but I'll put that on the back
    burner until the "optimized C" parts are in place - although to be
    honest theres not a lot there other than good common sense
    programming. The link another poster gave is interesting too

    http://www.jjj.de/bitwizardry/

    But I've just located an x86 reference PDF and they are certainly the
    instructions for boundary bit checks.

    >
    > Mark F. Haigh
    >
    >


    cheers,

    richard.


    --
    Debuggers : you know it makes sense.
    http://heather.cs.ucdavis.edu/~matloff/UnixAndC/CLanguage/Debug.html#tth_sEc
     
    Richard G. Riley, Mar 13, 2006
    #10
  11. Eric Sosman <> writes:
    [...]
    > If one-bits are expected to be fairly sparse, you'll probably
    > want to search through largish "chunks" to find one that isn't
    > all zeroes, then search within that chunk for the proper bit.
    > You could do the search an `int' at a time with an ordinary loop,
    > or you might use memchr() to search byte-by-byte. The memchr()
    > method might (or might not) take longer for the search, but note
    > that it leaves a smaller space for the bit-by-bit search to explore.
    > You'd need to implement both and measure.

    [...]

    I don't think memchr() would be useful. It scans for a byte equal to
    a specified value; you want to scan for bytes *unequal* to 0.

    If you're searchin for a 1 bit in a chunk of memory, checking first
    whether each word (where "word" probably means something like an
    unsigned int or unsigned long) is equal to 0 seems like an obvious
    optimization. But if you care about absolute portability (i.e., code
    that will work with any possible conforming C implementation), it can
    get you into trouble.

    It's guaranteed that, for any integer type, all-bits-zero is a
    representation of the value 0. (Neither the C90 nor the C99 standard
    says this, but n1124 does; this change was made in TC2 in response to
    DR #263. I've never heard of an implementation that doesn't have this
    property anyway, and I'd be comfortable depending on it.) For
    one's-complement and sign-magnitude representations, there are two
    distinct representations of zero (+0 and -0), but you can avoid that
    by using an unsigned type. But unsigned types *can* have padding
    bits, so even if buf==0, you might still have missed a 1 bit.

    unsigned char can't have padding bits, and every object can be viewed
    as an array of unsigned char, so you can safely compare each byte to 0
    to find the first 1 bit. But this is likely to be less efficient than
    scanning by words.

    You can scan by words *if* the unsigned type you're using has no
    padding bits. You can test this automatically by checking whether,
    for example, looking at the values of CHAR_BIT*sizeof(unsigned long)
    and LONG_MAX; if the range takes up all the bits, there can be no
    padding bits. If this test fails, you have to fall back to some other
    algorithm -- which means you have two different algorithms to test.

    You also have to account for alignment issues (unless you use unsigned
    char).

    If you don't care about absolute portability, you can get away with
    making whatever assumptions you can get away with. I suggest
    documenting any assumptions you make that aren't guaranteed by the
    standard. If you can test your assumptions, either during compilation
    or at run time, you can fall back to a "#error" directive or use
    assert(), so code will fail early and cleanly if your assumptions are
    violated.

    To summarize: portable source-level micro-optimization is hard.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    We must do something. This is something. Therefore, we must do this.
     
    Keith Thompson, Mar 13, 2006
    #11
  12. On 2006-03-13, Keith Thompson <> wrote:
    > Eric Sosman <> writes:
    > [...]
    >> If one-bits are expected to be fairly sparse, you'll probably
    >> want to search through largish "chunks" to find one that isn't
    >> all zeroes, then search within that chunk for the proper bit.
    >> You could do the search an `int' at a time with an ordinary loop,
    >> or you might use memchr() to search byte-by-byte. The memchr()
    >> method might (or might not) take longer for the search, but note
    >> that it leaves a smaller space for the bit-by-bit search to explore.
    >> You'd need to implement both and measure.

    > [...]
    >
    > I don't think memchr() would be useful. It scans for a byte equal to
    > a specified value; you want to scan for bytes *unequal* to 0.


    I dont either : see other reply. But to sumamrise : it uses bytes
    which is plain silly and it requires a per character comparison which
    is not required when checking for "non zero".

    >
    > If you're searchin for a 1 bit in a chunk of memory, checking first
    > whether each word (where "word" probably means something like an
    > unsigned int or unsigned long) is equal to 0 seems like an obvious
    > optimization. But if you care about absolute portability (i.e., code
    > that will work with any possible conforming C implementation), it can
    > get you into trouble.
    >
    > It's guaranteed that, for any integer type, all-bits-zero is a
    > representation of the value 0. (Neither the C90 nor the C99 standard
    > says this, but n1124 does; this change was made in TC2 in response to
    > DR #263. I've never heard of an implementation that doesn't have this
    > property anyway, and I'd be comfortable depending on it.) For
    > one's-complement and sign-magnitude representations, there are two
    > distinct representations of zero (+0 and -0), but you can avoid that
    > by using an unsigned type. But unsigned types *can* have padding
    > bits, so even if buf==0, you might still have missed a 1 bit.


    Could you please explain this? If I have calculated (or even knw..) that the
    underlying word size is 32 bit and I use an unsigned int to represent
    this in C, then how do padding bits make any difference? Isnt the
    variable guarenteed to go from 0 to 2^32-1?

    Keeping in mind that it is 32 bit "image data" put into 32 bit memory
    storage (array of unsigned ints).

    I think the term "padding bits" is confusing me here. Its not another
    one of these horrible things that means for truly portable code we
    cant assume that a 4 byte word doesnt actually take 4 bytes in memory
    is it? And since I'm accessing that data via an "unsigned long" array
    would it make any difference to the code?

    (I think example code in other replay was something like ...)
    ...
    unsigned int * refImageData = &arrImageDataScanLine[0];
    ...

    while(columnsToCheck-- && !(imageWord = *refImageData++));


    This is surely portable? Slap me down if its not : that I can take.

    >
    > unsigned char can't have padding bits, and every object can be viewed
    > as an array of unsigned char, so you can safely compare each byte to 0
    > to find the first 1 bit. But this is likely to be less efficient than
    > scanning by words.
    >
    > You can scan by words *if* the unsigned type you're using has no
    > padding bits. You can test this automatically by checking whether,
    > for example, looking at the values of CHAR_BIT*sizeof(unsigned long)
    > and LONG_MAX; if the range takes up all the bits, there can be no
    > padding bits. If this test fails, you have to fall back to some other
    > algorithm -- which means you have two different algorithms to test.
    >
    > You also have to account for alignment issues (unless you use unsigned
    > char).


    Assume guarenteed to align to natural word alignment. Otherwise it would kill
    the compiler produced optimisations triggered by using the natural
    word size data type. This type of thing was covered well, IMO, in the
    two initial links I gave in the OP.

    >
    > If you don't care about absolute portability, you can get away with
    > making whatever assumptions you can get away with. I suggest
    > documenting any assumptions you make that aren't guaranteed by the
    > standard. If you can test your assumptions, either during compilation
    > or at run time, you can fall back to a "#error" directive or use
    > assert(), so code will fail early and cleanly if your assumptions are
    > violated.
    >
    > To summarize: portable source-level micro-optimization is hard.
    >


    I dont doubt it : but I'm happy for it to be optimzed for x86 and
    "work" for other platforms :-;
     
    Richard G. Riley, Mar 13, 2006
    #12
  13. "Richard G. Riley" <> writes:
    > On 2006-03-13, Keith Thompson <> wrote:

    [...]
    >> It's guaranteed that, for any integer type, all-bits-zero is a
    >> representation of the value 0. (Neither the C90 nor the C99 standard
    >> says this, but n1124 does; this change was made in TC2 in response to
    >> DR #263. I've never heard of an implementation that doesn't have this
    >> property anyway, and I'd be comfortable depending on it.) For
    >> one's-complement and sign-magnitude representations, there are two
    >> distinct representations of zero (+0 and -0), but you can avoid that
    >> by using an unsigned type. But unsigned types *can* have padding
    >> bits, so even if buf==0, you might still have missed a 1 bit.

    >
    > Could you please explain this? If I have calculated (or even knw..) that the
    > underlying word size is 32 bit and I use an unsigned int to represent
    > this in C, then how do padding bits make any difference? Isnt the
    > variable guarenteed to go from 0 to 2^32-1?


    No, it isn't. The guaranteed range of unsigned int is 0 to 32768.
    But, for example, an implementation could legally have:
    CHAR_BIT == 8
    sizeof(unsigned int) == 4
    INT_MAX == 32767
    An unsigned int then has 32 bits: 16 value bits and 16 padding bits.

    For details, see section 6.2.6.2 of the C99 standard.

    [...]

    >> To summarize: portable source-level micro-optimization is hard.
    >>

    >
    > I dont doubt it : but I'm happy for it to be optimzed for x86 and
    > "work" for other platforms :-;


    If you make too many assumptions (for example, that unsigned int has
    no padding bits), you code might be optimized for x86 and *not* "work"
    for some other platforms.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    We must do something. This is something. Therefore, we must do this.
     
    Keith Thompson, Mar 13, 2006
    #13
  14. On 2006-03-13, Keith Thompson <> wrote:
    > "Richard G. Riley" <> writes:
    >> On 2006-03-13, Keith Thompson <> wrote:

    > [...]
    >>> It's guaranteed that, for any integer type, all-bits-zero is a
    >>> representation of the value 0. (Neither the C90 nor the C99 standard
    >>> says this, but n1124 does; this change was made in TC2 in response to
    >>> DR #263. I've never heard of an implementation that doesn't have this
    >>> property anyway, and I'd be comfortable depending on it.) For
    >>> one's-complement and sign-magnitude representations, there are two
    >>> distinct representations of zero (+0 and -0), but you can avoid that
    >>> by using an unsigned type. But unsigned types *can* have padding
    >>> bits, so even if buf==0, you might still have missed a 1 bit.

    >>
    >> Could you please explain this? If I have calculated (or even knw..) that the
    >> underlying word size is 32 bit and I use an unsigned int to represent
    >> this in C, then how do padding bits make any difference? Isnt the
    >> variable guarenteed to go from 0 to 2^32-1?

    >
    > No, it isn't. The guaranteed range of unsigned int is 0 to 32768.
    > But, for example, an implementation could legally have:
    > CHAR_BIT == 8
    > sizeof(unsigned int) == 4
    > INT_MAX == 32767
    > An unsigned int then has 32 bits: 16 value bits and 16 padding bits.
    >
    > For details, see section 6.2.6.2 of the C99 standard.


    Wow. That would break a lot of code. Assumimg some situation like
    that, is their a constant "MAX BITS FOR ULONG" or "MAX WORD SIZE IN BITS"? e.g
    to get a platform independant assurance of the number of bits one can
    use in a single HW word? Having said that, in my sample code in the
    initial reply to Eric, I would only need to recalculate my "start
    mask" from 0x80000000 to "whatever" when I calculate "usable bits per
    HW word" in program init. Since the overhead of doing that calc is
    almost less than zero compared to cother computation, I could do that.

    >
    > [...]
    >
    >>> To summarize: portable source-level micro-optimization is hard.
    >>>

    >>
    >> I dont doubt it : but I'm happy for it to be optimzed for x86 and
    >> "work" for other platforms :-;

    >
    > If you make too many assumptions (for example, that unsigned int has
    > no padding bits), you code might be optimized for x86 and *not* "work"
    > for some other platforms.
    >
     
    Richard G. Riley, Mar 13, 2006
    #14
  15. Richard G. Riley

    Jordan Abel Guest

    On 2006-03-13, Keith Thompson <> wrote:
    >>> To summarize: portable source-level micro-optimization is hard.
    >>>

    >>
    >> I dont doubt it : but I'm happy for it to be optimzed for x86 and
    >> "work" for other platforms :-;

    >
    > If you make too many assumptions (for example, that unsigned int has
    > no padding bits), you code might be optimized for x86 and *not* "work"
    > for some other platforms.


    That's not to say that portable source-level optimization isn't done.

    There is a macro in stdio.h on freebsd that is optimized for pcc on VAX.

    Source-level optimization only requires non-universal assumptions about
    what's fast, not about what works.
     
    Jordan Abel, Mar 13, 2006
    #15
  16. Jordan Abel <> writes:
    > On 2006-03-13, Keith Thompson <> wrote:
    >>>> To summarize: portable source-level micro-optimization is hard.
    >>>
    >>> I dont doubt it : but I'm happy for it to be optimzed for x86 and
    >>> "work" for other platforms :-;

    >>
    >> If you make too many assumptions (for example, that unsigned int has
    >> no padding bits), you code might be optimized for x86 and *not* "work"
    >> for some other platforms.

    >
    > That's not to say that portable source-level optimization isn't done.
    >
    > There is a macro in stdio.h on freebsd that is optimized for pcc on VAX.
    >
    > Source-level optimization only requires non-universal assumptions about
    > what's fast, not about what works.


    I think you missed half of my point.

    *Correctly portable* source-level optimization only requires
    non-universal assumptions about what's fast, but it's entirely
    possible (and too common) to do source-level optimization that makes
    the code faster on some platforms, and incorrect on others.

    I discussed a specific instance of this upthread: the assumption that
    a particular type has no padding bits.

    It's also possible, with a little extra work, to write code that
    *tests* this assumption, and falls back to a more portable algorithm
    if the test fails; this is useful only if you get everything right.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    We must do something. This is something. Therefore, we must do this.
     
    Keith Thompson, Mar 13, 2006
    #16
  17. "Richard G. Riley" <> wrote in news:isjge3-9u8.ln1
    @cern.frontroom:

    > On 2006-03-13, Eric Sosman <> wrote:
    >>
    >>
    >> D.E. Knuth (elaborating on a remark by Hoare): "We should
    >> forget about small efficiencies, say about 97% of the time:
    >> premature optimization is the root of all evil."
    >>
    >> W.A. Wulf: "More computing sins are committed in the name
    >> of efficiency (without necessarily achieving it) than for any
    >> other single reason, including blind stupidity."
    >>
    >> Jackson's Laws of Program Optimization:
    >> FIRST LAW: Don't do it.
    >> SECOND LAW (for experts only): Don't do it yet.
    >>
    >> See also <http://www.flounder.com/optimization.htm>.
    >>
    >> ... and that's all I'll offer on the matter. Until you
    >> have made measurements or done calculations that demonstrate
    >> a performance problem, "approaching optimum performance as
    >> early as possible" is folly. If you wish to behave foolishly,
    >> go ahead -- but do it on your own; I'm not going to assist
    >> you to your ruin.
    >>

    >
    > Thanks for those references : all very valid if designing a large
    > system.


    All very valid for any type of system.

    > I am designing/impementing a very low level system with fairly
    > specific requirements :


    We do not know what does specific requirements, constraints, and
    operating parameters (they would also be off-topic here, I am assuming).

    > and apriori


    I think you are missing the point that what will work best in that
    narrow environment cannot be known *a priori* (see
    http://wordnet.princeton.edu/perl/webwn?s=a priori), only
    *a posteriori*:

    Adverb

    * S: (adv) a priori (derived by logic, without observed facts)
    o antonym
    + W: (adv) a posteriori [Opposed to: a priori] (derived from observed
    facts)

    Sinan
    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Mar 14, 2006
    #17
  18. On 2006-03-14, A. Sinan Unur <> wrote:
    > "Richard G. Riley" <> wrote in news:isjge3-9u8.ln1
    > @cern.frontroom:
    >
    >> On 2006-03-13, Eric Sosman <> wrote:
    >>>
    >>>
    >>> D.E. Knuth (elaborating on a remark by Hoare): "We should
    >>> forget about small efficiencies, say about 97% of the time:
    >>> premature optimization is the root of all evil."
    >>>
    >>> W.A. Wulf: "More computing sins are committed in the name
    >>> of efficiency (without necessarily achieving it) than for any
    >>> other single reason, including blind stupidity."
    >>>
    >>> Jackson's Laws of Program Optimization:
    >>> FIRST LAW: Don't do it.
    >>> SECOND LAW (for experts only): Don't do it yet.
    >>>
    >>> See also <http://www.flounder.com/optimization.htm>.
    >>>
    >>> ... and that's all I'll offer on the matter. Until you
    >>> have made measurements or done calculations that demonstrate
    >>> a performance problem, "approaching optimum performance as
    >>> early as possible" is folly. If you wish to behave foolishly,
    >>> go ahead -- but do it on your own; I'm not going to assist
    >>> you to your ruin.
    >>>

    >>
    >> Thanks for those references : all very valid if designing a large
    >> system.

    >
    > All very valid for any type of system.


    Possibly. I know however, that if I have a known
    bottleneck/limitation at a certain part of a system then
    I'd better take that into account at an early stage or its going to
    be hell later. This could be an API to a graphics HW processor for
    instance. We can all throw big shot Laws at things : but often we
    choose C as a solution because it does allow us to get down and
    dirty. At the end of the day I know the format of my data : what goes
    around it or provides it is immaterial for the topic in hand : ie
    methods for fast bitstream scanning and manipulation in C. The other
    posts in the thread, bar one, have been constructive.

    >
    >> I am designing/impementing a very low level system with fairly
    >> specific requirements :

    >
    > We do not know what does specific requirements, constraints, and
    > operating parameters (they would also be off-topic here, I am
    >assuming).


    What dont you know? I want fast ways in C to handle bit data. I have some
    answers. C is on topic here. I specifically said I want to keep it all
    in C.

    >
    >> and apriori

    >
    > I think you are missing the point that what will work best in that
    > narrow environment cannot be known *a priori* (see


    Of course it can : and precisely because of that environment. A
    smaller environment make it easier to make a such a statement :

    " relating to or derived by reasoning from self-evident propositions"

    I will concurr that "self evident" to me, might not match that which is
    self evident to you. I will also agree that I might have got the use
    of that word arseways, but since noone else correct it yet I'll
    assume I have some breathing space :-;

    Remeber we're not inventing a square circle here : theres no big
    mission. I simply want anyone with any experience of doing hardcore
    bit field manipulation to give me some pointers. And they have.


    *snip
     
    Richard G. Riley, Mar 14, 2006
    #18
  19. Richard G. Riley

    Michael Mair Guest

    Richard G. Riley schrieb:
    > On 2006-03-13, Keith Thompson <> wrote:
    >
    >>"Richard G. Riley" <> writes:
    >>
    >>>On 2006-03-13, Keith Thompson <> wrote:

    >>
    >>[...]
    >>
    >>>>It's guaranteed that, for any integer type, all-bits-zero is a
    >>>>representation of the value 0. (Neither the C90 nor the C99 standard
    >>>>says this, but n1124 does; this change was made in TC2 in response to
    >>>>DR #263. I've never heard of an implementation that doesn't have this
    >>>>property anyway, and I'd be comfortable depending on it.) For
    >>>>one's-complement and sign-magnitude representations, there are two
    >>>>distinct representations of zero (+0 and -0), but you can avoid that
    >>>>by using an unsigned type. But unsigned types *can* have padding
    >>>>bits, so even if buf==0, you might still have missed a 1 bit.
    >>>
    >>>Could you please explain this? If I have calculated (or even knw..) that the
    >>>underlying word size is 32 bit and I use an unsigned int to represent
    >>>this in C, then how do padding bits make any difference? Isnt the
    >>>variable guarenteed to go from 0 to 2^32-1?

    >>
    >>No, it isn't. The guaranteed range of unsigned int is 0 to 32768.
    >>But, for example, an implementation could legally have:
    >> CHAR_BIT == 8
    >> sizeof(unsigned int) == 4
    >> INT_MAX == 32767
    >>An unsigned int then has 32 bits: 16 value bits and 16 padding bits.
    >>
    >>For details, see section 6.2.6.2 of the C99 standard.

    >
    > Wow. That would break a lot of code. Assumimg some situation like
    > that, is their a constant "MAX BITS FOR ULONG" or "MAX WORD SIZE IN BITS"? e.g
    > to get a platform independant assurance of the number of bits one can
    > use in a single HW word? Having said that, in my sample code in the
    > initial reply to Eric, I would only need to recalculate my "start
    > mask" from 0x80000000 to "whatever" when I calculate "usable bits per
    > HW word" in program init. Since the overhead of doing that calc is
    > almost less than zero compared to cother computation, I could do that.


    If you intend "fast and portable", then consider doing the following:
    Build a "guaranteedly portable" version of your intended to be fast
    stuff.
    Then start documenting your assumptions for your "fast header": Add a
    source file which is not linked with the rest of the code but fails
    compilation (during preprocessing or actual compiling) if one of
    these assumptions is violated.
    Say you have determined that sizeof(int)*CHAR_BIT is 32 Bits, then
    0xFFFFFFFF - INT_MAX should be 0. In other words, neither
    char inttestl[0x7FFFFFFF - (long)INT_MAX + 1L];
    nor
    char inttestu[(long)INT_MAX - 0x7FFFFFFF + 1L];
    should fail to compile; actually, we have typedefs not unlike the
    C99 intN_t in our code, with appropriate maximum and minimum values
    #define'd or otherwise provided, so we test only the provided values
    and fail with
    #if (!defined XX_INT32_MAX) || (XX_INT32_MAX != 2147483647)
    # error Bleargh -- no Int32 type
    #endif
    or something similar.

    There was a thread about preprocessor and compile time checks (or it
    evolved to that) some time ago, starting at
    <43838308$0$20847$-online.net> if my
    repository of potentially useful stuff remembers correctly.


    Cheers
    Michael
    --
    E-Mail: Mine is an /at/ gmx /dot/ de address.
     
    Michael Mair, Mar 14, 2006
    #19
  20. On 2006-03-14, Michael Mair <> wrote:
    > Richard G. Riley schrieb:
    >> On 2006-03-13, Keith Thompson <> wrote:
    >>
    >>>"Richard G. Riley" <> writes:
    >>>
    >>>>On 2006-03-13, Keith Thompson <> wrote:
    >>>
    >>>[...]
    >>>
    >>>>>It's guaranteed that, for any integer type, all-bits-zero is a
    >>>>>representation of the value 0. (Neither the C90 nor the C99 standard
    >>>>>says this, but n1124 does; this change was made in TC2 in response to
    >>>>>DR #263. I've never heard of an implementation that doesn't have this
    >>>>>property anyway, and I'd be comfortable depending on it.) For
    >>>>>one's-complement and sign-magnitude representations, there are two
    >>>>>distinct representations of zero (+0 and -0), but you can avoid that
    >>>>>by using an unsigned type. But unsigned types *can* have padding
    >>>>>bits, so even if buf==0, you might still have missed a 1 bit.
    >>>>
    >>>>Could you please explain this? If I have calculated (or even knw..) that the
    >>>>underlying word size is 32 bit and I use an unsigned int to represent
    >>>>this in C, then how do padding bits make any difference? Isnt the
    >>>>variable guarenteed to go from 0 to 2^32-1?
    >>>
    >>>No, it isn't. The guaranteed range of unsigned int is 0 to 32768.
    >>>But, for example, an implementation could legally have:
    >>> CHAR_BIT == 8
    >>> sizeof(unsigned int) == 4
    >>> INT_MAX == 32767
    >>>An unsigned int then has 32 bits: 16 value bits and 16 padding bits.
    >>>
    >>>For details, see section 6.2.6.2 of the C99 standard.

    >>
    >> Wow. That would break a lot of code. Assumimg some situation like
    >> that, is their a constant "MAX BITS FOR ULONG" or "MAX WORD SIZE IN BITS"? e.g
    >> to get a platform independant assurance of the number of bits one can
    >> use in a single HW word? Having said that, in my sample code in the
    >> initial reply to Eric, I would only need to recalculate my "start
    >> mask" from 0x80000000 to "whatever" when I calculate "usable bits per
    >> HW word" in program init. Since the overhead of doing that calc is
    >> almost less than zero compared to cother computation, I could do that.

    >
    > If you intend "fast and portable", then consider doing the

    following:

    That is the best. My initial requirements are slightly smaller : fast
    on x86 as I originally specified and "works elsewhere when it might be
    needed the day hell freezes over" .-; Having said that, I have taken
    Thomsons comments to heart about word padding for some reason : I see
    no real overhead (when keeping everything in C) to ensure platform
    compatability in the C level. It will be, I admit, the first code I
    ever wrote where a machine word can potentially hold less values than
    its size indicates : unless of course I do come across an unforseen
    performance hit in which case bye bye good practice.

    > Build a "guaranteedly portable" version of your intended to be fast
    > stuff.
    > Then start documenting your assumptions for your "fast header": Add a
    > source file which is not linked with the rest of the code but fails
    > compilation (during preprocessing or actual compiling) if one of
    > these assumptions is violated.
    > Say you have determined that sizeof(int)*CHAR_BIT is 32 Bits, then
    > 0xFFFFFFFF - INT_MAX should be 0. In other words, neither
    > char inttestl[0x7FFFFFFF - (long)INT_MAX + 1L];


    Cant assume that : CHAR_BIT will be 8 or so and sizeof(int) will be 4,
    but KT pointed out that maximum unsigned int might still be 32767
    due to padding bits..I asked for an easy way to determine "usable bit
    size" but got no reply so I guess its just some code to do at
    init. Not too far up on the priority list I must admit though.
     
    Richard G. Riley, Mar 14, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Guest
    Replies:
    5
    Views:
    365
    Guest
    Jul 28, 2004
  2. Ben Fidge

    Optimizing ViewState

    Ben Fidge, Feb 17, 2005, in forum: ASP .Net
    Replies:
    4
    Views:
    452
    =?Utf-8?B?U2hhdW4=?=
    Feb 18, 2005
  3. Alejandro Penate-Diaz

    optimizing DropDownLists

    Alejandro Penate-Diaz, Apr 8, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    344
    Karl Seguin
    Apr 8, 2005
  4. =?Utf-8?B?RGlmZmlkZW50?=

    Question on optimizing a piece of code

    =?Utf-8?B?RGlmZmlkZW50?=, Jun 9, 2005, in forum: ASP .Net
    Replies:
    3
    Views:
    407
    Patrice
    Jun 9, 2005
  5. Vaclav

    optimizing meta tags

    Vaclav, Jun 12, 2005, in forum: ASP .Net
    Replies:
    0
    Views:
    346
    Vaclav
    Jun 12, 2005
Loading...

Share This Page