Optimizing C

R

Richard G. Riley

Without resorting to asm chunks I'm working on a few small routines
which manipulate bitmasks. I'm looking for any guidance on writing C
in a manner which tilts the compilers hand in, if possible, a
compiler/underlying processor independant way : althought to be fair I
cant see this stuff on anything other than x86, but who knows.

I found some ok info here:

http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm
http://www.devx.com/amd/Article/21314

But any techniques for doing the standard bit fiddles without
occurring extra unnecessary cpu bandwidth would be nice. By standard
bit fiddles I mean : shift & rotate, anding, oring, shift til bit
found etc. The operations will be occurring on sequences of
words and the smallest word will be the standard CPU word size (32
bits). By sequences of words : I might, for example, be locating
the first non zero bit from the left in sequence of 64 words.

Any pointers appreciated.
 
E

Eric Sosman

Richard said:
Without resorting to asm chunks I'm working on a few small routines
which manipulate bitmasks. I'm looking for any guidance on writing C
in a manner which tilts the compilers hand in, if possible, a
compiler/underlying processor independant way : althought to be fair I
cant see this stuff on anything other than x86, but who knows.

I found some ok info here:

http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm
http://www.devx.com/amd/Article/21314

But any techniques for doing the standard bit fiddles without
occurring extra unnecessary cpu bandwidth would be nice. By standard
bit fiddles I mean : shift & rotate, anding, oring, shift til bit
found etc. The operations will be occurring on sequences of
words and the smallest word will be the standard CPU word size (32
bits). By sequences of words : I might, for example, be locating
the first non zero bit from the left in sequence of 64 words.

First, it's not at all clear what you're trying to speed
up, nor what you have measured and found to be too slow. A
"generic" optimization is often ill-directed; optimizing "on
general principles" is often ill-advised. If you have a specific
operation you'd like to speed up, it'd be a good idea to show
the actual code you're currently using, along with measurements
of its current speed and an estimate of the amount of speedup
you simply MUST achieve to make the code useful.

As to finding the first non-zero bit in an array of 64 ints,
the first thing to do is nail down what you mean by "first," lest
endianness issues pollute the solutions. Having done that, you
should also decide what form the answer should take -- it may be
simpler (or harder) to get a bit mask than a bit index, and you
may benefit by adjusting the rest of the code to use the more
easily obtained answer.

If one-bits are expected to be fairly sparse, you'll probably
want to search through largish "chunks" to find one that isn't
all zeroes, then search within that chunk for the proper bit.
You could do the search an `int' at a time with an ordinary loop,
or you might use memchr() to search byte-by-byte. The memchr()
method might (or might not) take longer for the search, but note
that it leaves a smaller space for the bit-by-bit search to explore.
You'd need to implement both and measure.

If you've used memchr() and have located a non-zero byte, it
might make sense to use a precalculated 256-element table to finish
the job without further "computation." Then again, it might not:
modern CPU's are much faster than memory, and might be able to
complete a multi-step computation in less time than a memory fetch.
More measurements are needed.

If you've searched int-by-int you could dive directly into
the second search, or you could first use a couple of comparisons
to decide which of the four bytes holds the "first" non-zero and
then proceed as if memchr() had located that byte. Which will be
faster? Measure, measure, measure.

If you need a bit index and don't want to use a table, the
obvious test-and-shift-and-count approach is probably best. If
you need a mask as opposed to an index, the `x & (x - 1)' trick
may be faster. Measure, measure, -- is there an echo in here?

Finally, an impression of the two Web pages you cited. A
quick glance suggests they're filled with well-intentioned and
sometimes reasonable advice, but ALL of it is out of context.
(Necessarily so, of course: the authors of the Web pages have no
idea what kind of program you're trying to write.) Even expert
advice can become nonsense when taken out of context. Your
grandmother's doctor is (presumably) an expert; do you therefore
take the same medicines that are prescribed for your grandmother?
IMHO, it is best to read and understand advice of the sort these
pages offer, but NOT to follow it until and unless you have reason
to believe it applies to your situation.
 
L

Laurent Deniau

Richard said:
Without resorting to asm chunks I'm working on a few small routines
which manipulate bitmasks. I'm looking for any guidance on writing C
in a manner which tilts the compilers hand in, if possible, a
compiler/underlying processor independant way : althought to be fair I
cant see this stuff on anything other than x86, but who knows.

I found some ok info here:

http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm
http://www.devx.com/amd/Article/21314

But any techniques for doing the standard bit fiddles without
occurring extra unnecessary cpu bandwidth would be nice. By standard
bit fiddles I mean : shift & rotate, anding, oring, shift til bit
found etc. The operations will be occurring on sequences of
words and the smallest word will be the standard CPU word size (32
bits). By sequences of words : I might, for example, be locating
the first non zero bit from the left in sequence of 64 words.

Any pointers appreciated.

Maybe http://www.jjj.de/bitwizardry/ is a good start.

a+, ld.
 
R

Richard G. Riley

First, it's not at all clear what you're trying to speed
up, nor what you have measured and found to be too slow. A

Hi Eric,

I'm looking for advice/pointers on the operators mentioned above. Its
not a case of anything being too slow : its a question of ensuring or
approaching optimum performance as early as possible. I did a lot of
this stuff at assembler level years ago when writing high performance
vga libraries, but want, if possible to leave it all in C now.

From my opening paragraph :

shift & rotate operations : left & right shift operations
anding : bitwise and operations
oring : bitwise or operations

I am writing some small libraries for generating image masks,
transparancy masks and edge silhouettes. Some more of the details are below.
"generic" optimization is often ill-directed; optimizing "on
general principles" is often ill-advised. If you have a specific
operation you'd like to speed up, it'd be a good idea to show
the actual code you're currently using, along with measurements
of its current speed and an estimate of the amount of speedup
you simply MUST achieve to make the code useful.

I'm looking for general guidelines : nothing spectacuarly
specific. Although you do give some good guidlines, some of which had
crossed my mind already, but certainly not all.
As to finding the first non-zero bit in an array of 64 ints,
the first thing to do is nail down what you mean by "first," lest
endianness issues pollute the solutions. Having done that, you

Thats in the details : but here we can assume that the sequence is
high bit == left most pixel. Also that any colour issues are segregated
into 3 or more color planes : initial code can assume a monochrome
bitmap.

I dont really think endian comes into it since the operations will
be taken care of at the word level. Even if it wasnt monochrome then
the most specific X bits would be leftmost pixel where X is the color
depth : in this case I would see the only code issues being that of
increasing the bit width of the test mask and the number of shifts
done in order to check the next pixel.

(Just to clarify the endian thing : all work is in memory buffers and
is in no way related to video HW addressing.)
should also decide what form the answer should take -- it may be
simpler (or harder) to get a bit mask than a bit index, and you
may benefit by adjusting the rest of the code to use the more
easily obtained answer.

Thats something to think about ok. Although I always considered just
anding and shifting 0x80000000 until a non zero result was
indicated. The mask can be generated from the bit by subtracting one.
(e.g 0x4000000-1 -> 0x3fffffff) and readding the bit.

After left pixel is detected we detect right most and if we are
creating a mask of words for rest of scanline its then trivial.
If one-bits are expected to be fairly sparse, you'll probably
want to search through largish "chunks" to find one that isn't
all zeroes, then search within that chunk for the proper bit.
You could do the search an `int' at a time with an ordinary loop,
or you might use memchr() to search byte-by-byte. The memchr()
method might (or might not) take longer for the search, but note
that it leaves a smaller space for the bit-by-bit search to explore.
You'd need to implement both and measure.

Yes, check word for 0 before looking for a set bit : simple enough &
very practical operation with no real overhead at time of word
read. I dont expect the images to be sparse so I'm not sure its
worth the overhead of calling a library to find it : considering left
most pixel detection: so In C, this seems fairly optimal to me (and
this type of thing I'm looking for advice on)

(early caveat : all pseudo c code assuming 32bit word for clarity and not
getting lost in type sizes)


bitmap =0;
while(columnsToCheck-- && !(bitmap = *imagePtr++));
if(bitmap){
/* now find leftmost pixel */
}


should be optimal enough?
If you've used memchr() and have located a non-zero byte, it

memchr() would certinaly not be on the agenda I think : it needs a
character to compare whereas the implicit recognition of "zero"
or "non zero" is faster. Also, I read in words (properly aligned of
course) and not bytes : bytes would certainly strangle the application.
might make sense to use a precalculated 256-element table to finish
the job without further "computation." Then again, it might not:

Cou you expand on that table bit a little : I dont know about this. Is
it something CPU specific?
modern CPU's are much faster than memory, and might be able to
complete a multi-step computation in less time than a memory fetch.
More measurements are needed.

If you've searched int-by-int you could dive directly into
the second search, or you could first use a couple of comparisons
to decide which of the four bytes holds the "first" non-zero and
then proceed as if memchr() had located that byte. Which will be
faster? Measure, measure, measure.

If you need a bit index and don't want to use a table, the
obvious test-and-shift-and-count approach is probably best. If
you need a mask as opposed to an index, the `x & (x - 1)' trick

Could you expand on that? I would see it as simple as

(find left pixel mask)

/* we know the bitmask is non zero so a bit check is a must*/
for(bit=0x8000000;!(bit&bitmap);bit>>=1);
/*word bit mask protects all bits from leftmost pixel on*/
leftWordMask = bit | (bit-1);
may be faster. Measure, measure, -- is there an echo in here?

Ive used a few profilers on Vax, Convex and on MS : but will be using
gcov for initial run time analysis on linux - but I'm not sure if its
what I want for real execution time profiling. But I'm certainly
looking for "C hints" on faster ways with bitmaps first : then I can
decide on a design approach : I think the code above (silly errors not
withstanding), appears to be relatively quick and the core "loop" areas are
small enough where I would expect their execution footprint to get cached.
Finally, an impression of the two Web pages you cited. A
quick glance suggests they're filled with well-intentioned and
sometimes reasonable advice, but ALL of it is out of context.

Yes & no : it has some general advice on standard stuff which can help
: alignment, powers of two etc. It even discusses use of machine words
at some stage instead of smaller units which can cause excessive
overhead I think. Certainly helpful to anyone embarking on optimizing
anything in C for the first time.
(Necessarily so, of course: the authors of the Web pages have no
idea what kind of program you're trying to write.) Even expert
advice can become nonsense when taken out of context. Your
grandmother's doctor is (presumably) an expert; do you therefore
take the same medicines that are prescribed for your grandmother?
IMHO, it is best to read and understand advice of the sort these
pages offer, but NOT to follow it until and unless you have reason
to believe it applies to your situation.

I appreciate your reply,

thanks for taking the time to do so.
 
J

James Dow Allen

Richard said:
Without resorting to asm chunks I'm working on a few small routines
which manipulate bitmasks. I'm looking for any guidance on writing C
in a manner which tilts the compilers hand in, if possible, a
compiler/underlying processor independant way : althought to be fair I
cant see this stuff on anything other than x86, but who knows.

Perhaps useless info if you only run on x86 but...

The Motorola 68020 has special bitfield opcodes which some
C compilers will generate *if* you declare the data as C bitfields.
(One of these special opcodes -- 'bfffo' -- is unusual and can come in
*very* handy sometimes, but I doubt any compiler will use *it*.)

James Dow Allen
 
E

Eric Sosman

Richard G. Riley wrote On 03/13/06 10:43,:
Richard G. Riley wrote:



First, it's not at all clear what you're trying to speed
up, nor what you have measured and found to be too slow. A


Hi Eric,

I'm looking for advice/pointers on the operators mentioned above. Its
not a case of anything being too slow : its a question of ensuring or
approaching optimum performance as early as possible. [...]

D.E. Knuth (elaborating on a remark by Hoare): "We should
forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil."

W.A. Wulf: "More computing sins are committed in the name
of efficiency (without necessarily achieving it) than for any
other single reason, including blind stupidity."

Jackson's Laws of Program Optimization:
FIRST LAW: Don't do it.
SECOND LAW (for experts only): Don't do it yet.

See also <http://www.flounder.com/optimization.htm>.

... and that's all I'll offer on the matter. Until you
have made measurements or done calculations that demonstrate
a performance problem, "approaching optimum performance as
early as possible" is folly. If you wish to behave foolishly,
go ahead -- but do it on your own; I'm not going to assist
you to your ruin.
 
R

Richard G. Riley

D.E. Knuth (elaborating on a remark by Hoare): "We should
forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil."

W.A. Wulf: "More computing sins are committed in the name
of efficiency (without necessarily achieving it) than for any
other single reason, including blind stupidity."

Jackson's Laws of Program Optimization:
FIRST LAW: Don't do it.
SECOND LAW (for experts only): Don't do it yet.

See also <http://www.flounder.com/optimization.htm>.

... and that's all I'll offer on the matter. Until you
have made measurements or done calculations that demonstrate
a performance problem, "approaching optimum performance as
early as possible" is folly. If you wish to behave foolishly,
go ahead -- but do it on your own; I'm not going to assist
you to your ruin.

Thanks for those references : all very valid if designing a large
system. I am designing/impementing a very low level system with fairly
specific requirements : and apriori knowledge of what can be achieved
at the earliest stages to guide the design to embrace known optimal
techniques is very much relevant. The key here is "known optimal
techniques" : if there are such techniques already worked out (as the
link another poster gave recomemnds) then it would be foolish to dismiss
those out of hand.
 
M

Mark F. Haigh

Richard said:
Without resorting to asm chunks I'm working on a few small routines
which manipulate bitmasks. I'm looking for any guidance on writing C
in a manner which tilts the compilers hand in, if possible, a
compiler/underlying processor independant way : althought to be fair I
cant see this stuff on anything other than x86, but who knows.

I found some ok info here:

http://www.eventhelix.com/RealtimeMantra/Basics/OptimizingCAndCPPCode.htm
http://www.devx.com/amd/Article/21314

But any techniques for doing the standard bit fiddles without
occurring extra unnecessary cpu bandwidth would be nice. By standard
bit fiddles I mean : shift & rotate, anding, oring, shift til bit
found etc. The operations will be occurring on sequences of
words and the smallest word will be the standard CPU word size (32
bits). By sequences of words : I might, for example, be locating
the first non zero bit from the left in sequence of 64 words.

Any pointers appreciated.

Isolate the bit scan routine into its own function, and write it in C
as simply as possible. Then special-case the function for x86/x86-64
using the appropriate bit scan machine instruction (bsf, bsr). Make
sure the function actually gets inlined.

This is really OT, but I mention it because I've yet to see a compiler
that actually generates bit scan instructions, and the speedup from
using them is around 10x (in my experience). Bit scanning is heavily
used in some areas of networking, particularly in packet
classification.

However, this is probably the _only_ thing you should code in assembly,
as bit twiddling C code is normally compiled to very tight assembly.
In fact, it's most often faster than non-expert hand written assembly,
simply because the compiler knows more about the minutae of the machine
architecture and instruction scheduling.


Mark F. Haigh
(e-mail address removed)
 
R

Richard G. Riley

Isolate the bit scan routine into its own function, and write it in C
as simply as possible. Then special-case the function for x86/x86-64
using the appropriate bit scan machine instruction (bsf, bsr). Make
sure the function actually gets inlined.

This is really OT, but I mention it because I've yet to see a compiler
that actually generates bit scan instructions, and the speedup from
using them is around 10x (in my experience). Bit scanning is heavily
used in some areas of networking, particularly in packet
classification.

However, this is probably the _only_ thing you should code in assembly,
as bit twiddling C code is normally compiled to very tight assembly.
In fact, it's most often faster than non-expert hand written assembly,
simply because the compiler knows more about the minutae of the machine
architecture and instruction scheduling.

Thanks Mark : I'll take a look and might try that at a later date :
I'm "off" with assembler at the moment but I'll put that on the back
burner until the "optimized C" parts are in place - although to be
honest theres not a lot there other than good common sense
programming. The link another poster gave is interesting too

http://www.jjj.de/bitwizardry/

But I've just located an x86 reference PDF and they are certainly the
instructions for boundary bit checks.
Mark F. Haigh
(e-mail address removed)

cheers,

richard.
 
K

Keith Thompson

Eric Sosman said:
If one-bits are expected to be fairly sparse, you'll probably
want to search through largish "chunks" to find one that isn't
all zeroes, then search within that chunk for the proper bit.
You could do the search an `int' at a time with an ordinary loop,
or you might use memchr() to search byte-by-byte. The memchr()
method might (or might not) take longer for the search, but note
that it leaves a smaller space for the bit-by-bit search to explore.
You'd need to implement both and measure.
[...]

I don't think memchr() would be useful. It scans for a byte equal to
a specified value; you want to scan for bytes *unequal* to 0.

If you're searchin for a 1 bit in a chunk of memory, checking first
whether each word (where "word" probably means something like an
unsigned int or unsigned long) is equal to 0 seems like an obvious
optimization. But if you care about absolute portability (i.e., code
that will work with any possible conforming C implementation), it can
get you into trouble.

It's guaranteed that, for any integer type, all-bits-zero is a
representation of the value 0. (Neither the C90 nor the C99 standard
says this, but n1124 does; this change was made in TC2 in response to
DR #263. I've never heard of an implementation that doesn't have this
property anyway, and I'd be comfortable depending on it.) For
one's-complement and sign-magnitude representations, there are two
distinct representations of zero (+0 and -0), but you can avoid that
by using an unsigned type. But unsigned types *can* have padding
bits, so even if buf==0, you might still have missed a 1 bit.

unsigned char can't have padding bits, and every object can be viewed
as an array of unsigned char, so you can safely compare each byte to 0
to find the first 1 bit. But this is likely to be less efficient than
scanning by words.

You can scan by words *if* the unsigned type you're using has no
padding bits. You can test this automatically by checking whether,
for example, looking at the values of CHAR_BIT*sizeof(unsigned long)
and LONG_MAX; if the range takes up all the bits, there can be no
padding bits. If this test fails, you have to fall back to some other
algorithm -- which means you have two different algorithms to test.

You also have to account for alignment issues (unless you use unsigned
char).

If you don't care about absolute portability, you can get away with
making whatever assumptions you can get away with. I suggest
documenting any assumptions you make that aren't guaranteed by the
standard. If you can test your assumptions, either during compilation
or at run time, you can fall back to a "#error" directive or use
assert(), so code will fail early and cleanly if your assumptions are
violated.

To summarize: portable source-level micro-optimization is hard.
 
R

Richard G. Riley

Eric Sosman said:
If one-bits are expected to be fairly sparse, you'll probably
want to search through largish "chunks" to find one that isn't
all zeroes, then search within that chunk for the proper bit.
You could do the search an `int' at a time with an ordinary loop,
or you might use memchr() to search byte-by-byte. The memchr()
method might (or might not) take longer for the search, but note
that it leaves a smaller space for the bit-by-bit search to explore.
You'd need to implement both and measure.
[...]

I don't think memchr() would be useful. It scans for a byte equal to
a specified value; you want to scan for bytes *unequal* to 0.

I dont either : see other reply. But to sumamrise : it uses bytes
which is plain silly and it requires a per character comparison which
is not required when checking for "non zero".
If you're searchin for a 1 bit in a chunk of memory, checking first
whether each word (where "word" probably means something like an
unsigned int or unsigned long) is equal to 0 seems like an obvious
optimization. But if you care about absolute portability (i.e., code
that will work with any possible conforming C implementation), it can
get you into trouble.

It's guaranteed that, for any integer type, all-bits-zero is a
representation of the value 0. (Neither the C90 nor the C99 standard
says this, but n1124 does; this change was made in TC2 in response to
DR #263. I've never heard of an implementation that doesn't have this
property anyway, and I'd be comfortable depending on it.) For
one's-complement and sign-magnitude representations, there are two
distinct representations of zero (+0 and -0), but you can avoid that
by using an unsigned type. But unsigned types *can* have padding
bits, so even if buf==0, you might still have missed a 1 bit.


Could you please explain this? If I have calculated (or even knw..) that the
underlying word size is 32 bit and I use an unsigned int to represent
this in C, then how do padding bits make any difference? Isnt the
variable guarenteed to go from 0 to 2^32-1?

Keeping in mind that it is 32 bit "image data" put into 32 bit memory
storage (array of unsigned ints).

I think the term "padding bits" is confusing me here. Its not another
one of these horrible things that means for truly portable code we
cant assume that a 4 byte word doesnt actually take 4 bytes in memory
is it? And since I'm accessing that data via an "unsigned long" array
would it make any difference to the code?

(I think example code in other replay was something like ...)
...
unsigned int * refImageData = &arrImageDataScanLine[0];
...

while(columnsToCheck-- && !(imageWord = *refImageData++));


This is surely portable? Slap me down if its not : that I can take.
unsigned char can't have padding bits, and every object can be viewed
as an array of unsigned char, so you can safely compare each byte to 0
to find the first 1 bit. But this is likely to be less efficient than
scanning by words.

You can scan by words *if* the unsigned type you're using has no
padding bits. You can test this automatically by checking whether,
for example, looking at the values of CHAR_BIT*sizeof(unsigned long)
and LONG_MAX; if the range takes up all the bits, there can be no
padding bits. If this test fails, you have to fall back to some other
algorithm -- which means you have two different algorithms to test.

You also have to account for alignment issues (unless you use unsigned
char).

Assume guarenteed to align to natural word alignment. Otherwise it would kill
the compiler produced optimisations triggered by using the natural
word size data type. This type of thing was covered well, IMO, in the
two initial links I gave in the OP.
If you don't care about absolute portability, you can get away with
making whatever assumptions you can get away with. I suggest
documenting any assumptions you make that aren't guaranteed by the
standard. If you can test your assumptions, either during compilation
or at run time, you can fall back to a "#error" directive or use
assert(), so code will fail early and cleanly if your assumptions are
violated.

To summarize: portable source-level micro-optimization is hard.

I dont doubt it : but I'm happy for it to be optimzed for x86 and
"work" for other platforms :-;
 
K

Keith Thompson

Richard G. Riley said:
It's guaranteed that, for any integer type, all-bits-zero is a
representation of the value 0. (Neither the C90 nor the C99 standard
says this, but n1124 does; this change was made in TC2 in response to
DR #263. I've never heard of an implementation that doesn't have this
property anyway, and I'd be comfortable depending on it.) For
one's-complement and sign-magnitude representations, there are two
distinct representations of zero (+0 and -0), but you can avoid that
by using an unsigned type. But unsigned types *can* have padding
bits, so even if buf==0, you might still have missed a 1 bit.


Could you please explain this? If I have calculated (or even knw..) that the
underlying word size is 32 bit and I use an unsigned int to represent
this in C, then how do padding bits make any difference? Isnt the
variable guarenteed to go from 0 to 2^32-1?


No, it isn't. The guaranteed range of unsigned int is 0 to 32768.
But, for example, an implementation could legally have:
CHAR_BIT == 8
sizeof(unsigned int) == 4
INT_MAX == 32767
An unsigned int then has 32 bits: 16 value bits and 16 padding bits.

For details, see section 6.2.6.2 of the C99 standard.

[...]
I dont doubt it : but I'm happy for it to be optimzed for x86 and
"work" for other platforms :-;

If you make too many assumptions (for example, that unsigned int has
no padding bits), you code might be optimized for x86 and *not* "work"
for some other platforms.
 
R

Richard G. Riley

Richard G. Riley said:
It's guaranteed that, for any integer type, all-bits-zero is a
representation of the value 0. (Neither the C90 nor the C99 standard
says this, but n1124 does; this change was made in TC2 in response to
DR #263. I've never heard of an implementation that doesn't have this
property anyway, and I'd be comfortable depending on it.) For
one's-complement and sign-magnitude representations, there are two
distinct representations of zero (+0 and -0), but you can avoid that
by using an unsigned type. But unsigned types *can* have padding
bits, so even if buf==0, you might still have missed a 1 bit.


Could you please explain this? If I have calculated (or even knw..) that the
underlying word size is 32 bit and I use an unsigned int to represent
this in C, then how do padding bits make any difference? Isnt the
variable guarenteed to go from 0 to 2^32-1?


No, it isn't. The guaranteed range of unsigned int is 0 to 32768.
But, for example, an implementation could legally have:
CHAR_BIT == 8
sizeof(unsigned int) == 4
INT_MAX == 32767
An unsigned int then has 32 bits: 16 value bits and 16 padding bits.

For details, see section 6.2.6.2 of the C99 standard.


Wow. That would break a lot of code. Assumimg some situation like
that, is their a constant "MAX BITS FOR ULONG" or "MAX WORD SIZE IN BITS"? e.g
to get a platform independant assurance of the number of bits one can
use in a single HW word? Having said that, in my sample code in the
initial reply to Eric, I would only need to recalculate my "start
mask" from 0x80000000 to "whatever" when I calculate "usable bits per
HW word" in program init. Since the overhead of doing that calc is
almost less than zero compared to cother computation, I could do that.
[...]
I dont doubt it : but I'm happy for it to be optimzed for x86 and
"work" for other platforms :-;

If you make too many assumptions (for example, that unsigned int has
no padding bits), you code might be optimized for x86 and *not* "work"
for some other platforms.
 
J

Jordan Abel

If you make too many assumptions (for example, that unsigned int has
no padding bits), you code might be optimized for x86 and *not* "work"
for some other platforms.

That's not to say that portable source-level optimization isn't done.

There is a macro in stdio.h on freebsd that is optimized for pcc on VAX.

Source-level optimization only requires non-universal assumptions about
what's fast, not about what works.
 
K

Keith Thompson

Jordan Abel said:
That's not to say that portable source-level optimization isn't done.

There is a macro in stdio.h on freebsd that is optimized for pcc on VAX.

Source-level optimization only requires non-universal assumptions about
what's fast, not about what works.

I think you missed half of my point.

*Correctly portable* source-level optimization only requires
non-universal assumptions about what's fast, but it's entirely
possible (and too common) to do source-level optimization that makes
the code faster on some platforms, and incorrect on others.

I discussed a specific instance of this upthread: the assumption that
a particular type has no padding bits.

It's also possible, with a little extra work, to write code that
*tests* this assumption, and falls back to a more portable algorithm
if the test fails; this is useful only if you get everything right.
 
A

A. Sinan Unur

Thanks for those references : all very valid if designing a large
system.

All very valid for any type of system.
I am designing/impementing a very low level system with fairly
specific requirements :

We do not know what does specific requirements, constraints, and
operating parameters (they would also be off-topic here, I am assuming).
and apriori

I think you are missing the point that what will work best in that
narrow environment cannot be known *a priori* (see
http://wordnet.princeton.edu/perl/webwn?s=a priori), only
*a posteriori*:

Adverb

* S: (adv) a priori (derived by logic, without observed facts)
o antonym
+ W: (adv) a posteriori [Opposed to: a priori] (derived from observed
facts)

Sinan
 
R

Richard G. Riley

All very valid for any type of system.

Possibly. I know however, that if I have a known
bottleneck/limitation at a certain part of a system then
I'd better take that into account at an early stage or its going to
be hell later. This could be an API to a graphics HW processor for
instance. We can all throw big shot Laws at things : but often we
choose C as a solution because it does allow us to get down and
dirty. At the end of the day I know the format of my data : what goes
around it or provides it is immaterial for the topic in hand : ie
methods for fast bitstream scanning and manipulation in C. The other
posts in the thread, bar one, have been constructive.
We do not know what does specific requirements, constraints, and
operating parameters (they would also be off-topic here, I am
assuming).

What dont you know? I want fast ways in C to handle bit data. I have some
answers. C is on topic here. I specifically said I want to keep it all
in C.
I think you are missing the point that what will work best in that
narrow environment cannot be known *a priori* (see

Of course it can : and precisely because of that environment. A
smaller environment make it easier to make a such a statement :

" relating to or derived by reasoning from self-evident propositions"

I will concurr that "self evident" to me, might not match that which is
self evident to you. I will also agree that I might have got the use
of that word arseways, but since noone else correct it yet I'll
assume I have some breathing space :-;

Remeber we're not inventing a square circle here : theres no big
mission. I simply want anyone with any experience of doing hardcore
bit field manipulation to give me some pointers. And they have.


*snip
 
M

Michael Mair

Richard said:
Richard G. Riley said:
[...]

It's guaranteed that, for any integer type, all-bits-zero is a
representation of the value 0. (Neither the C90 nor the C99 standard
says this, but n1124 does; this change was made in TC2 in response to
DR #263. I've never heard of an implementation that doesn't have this
property anyway, and I'd be comfortable depending on it.) For
one's-complement and sign-magnitude representations, there are two
distinct representations of zero (+0 and -0), but you can avoid that
by using an unsigned type. But unsigned types *can* have padding
bits, so even if buf==0, you might still have missed a 1 bit.

Could you please explain this? If I have calculated (or even knw..) that the
underlying word size is 32 bit and I use an unsigned int to represent
this in C, then how do padding bits make any difference? Isnt the
variable guarenteed to go from 0 to 2^32-1?


No, it isn't. The guaranteed range of unsigned int is 0 to 32768.
But, for example, an implementation could legally have:
CHAR_BIT == 8
sizeof(unsigned int) == 4
INT_MAX == 32767
An unsigned int then has 32 bits: 16 value bits and 16 padding bits.

For details, see section 6.2.6.2 of the C99 standard.


Wow. That would break a lot of code. Assumimg some situation like
that, is their a constant "MAX BITS FOR ULONG" or "MAX WORD SIZE IN BITS"? e.g
to get a platform independant assurance of the number of bits one can
use in a single HW word? Having said that, in my sample code in the
initial reply to Eric, I would only need to recalculate my "start
mask" from 0x80000000 to "whatever" when I calculate "usable bits per
HW word" in program init. Since the overhead of doing that calc is
almost less than zero compared to cother computation, I could do that.


If you intend "fast and portable", then consider doing the following:
Build a "guaranteedly portable" version of your intended to be fast
stuff.
Then start documenting your assumptions for your "fast header": Add a
source file which is not linked with the rest of the code but fails
compilation (during preprocessing or actual compiling) if one of
these assumptions is violated.
Say you have determined that sizeof(int)*CHAR_BIT is 32 Bits, then
0xFFFFFFFF - INT_MAX should be 0. In other words, neither
char inttestl[0x7FFFFFFF - (long)INT_MAX + 1L];
nor
char inttestu[(long)INT_MAX - 0x7FFFFFFF + 1L];
should fail to compile; actually, we have typedefs not unlike the
C99 intN_t in our code, with appropriate maximum and minimum values
#define'd or otherwise provided, so we test only the provided values
and fail with
#if (!defined XX_INT32_MAX) || (XX_INT32_MAX != 2147483647)
# error Bleargh -- no Int32 type
#endif
or something similar.

There was a thread about preprocessor and compile time checks (or it
evolved to that) some time ago, starting at
<[email protected]> if my
repository of potentially useful stuff remembers correctly.


Cheers
Michael
 
R

Richard G. Riley

Richard said:
[...]

It's guaranteed that, for any integer type, all-bits-zero is a
representation of the value 0. (Neither the C90 nor the C99 standard
says this, but n1124 does; this change was made in TC2 in response to
DR #263. I've never heard of an implementation that doesn't have this
property anyway, and I'd be comfortable depending on it.) For
one's-complement and sign-magnitude representations, there are two
distinct representations of zero (+0 and -0), but you can avoid that
by using an unsigned type. But unsigned types *can* have padding
bits, so even if buf==0, you might still have missed a 1 bit.

Could you please explain this? If I have calculated (or even knw..) that the
underlying word size is 32 bit and I use an unsigned int to represent
this in C, then how do padding bits make any difference? Isnt the
variable guarenteed to go from 0 to 2^32-1?

No, it isn't. The guaranteed range of unsigned int is 0 to 32768.
But, for example, an implementation could legally have:
CHAR_BIT == 8
sizeof(unsigned int) == 4
INT_MAX == 32767
An unsigned int then has 32 bits: 16 value bits and 16 padding bits.

For details, see section 6.2.6.2 of the C99 standard.


Wow. That would break a lot of code. Assumimg some situation like
that, is their a constant "MAX BITS FOR ULONG" or "MAX WORD SIZE IN BITS"? e.g
to get a platform independant assurance of the number of bits one can
use in a single HW word? Having said that, in my sample code in the
initial reply to Eric, I would only need to recalculate my "start
mask" from 0x80000000 to "whatever" when I calculate "usable bits per
HW word" in program init. Since the overhead of doing that calc is
almost less than zero compared to cother computation, I could do that.


If you intend "fast and portable", then consider doing the

following:

That is the best. My initial requirements are slightly smaller : fast
on x86 as I originally specified and "works elsewhere when it might be
needed the day hell freezes over" .-; Having said that, I have taken
Thomsons comments to heart about word padding for some reason : I see
no real overhead (when keeping everything in C) to ensure platform
compatability in the C level. It will be, I admit, the first code I
ever wrote where a machine word can potentially hold less values than
its size indicates : unless of course I do come across an unforseen
performance hit in which case bye bye good practice.
Build a "guaranteedly portable" version of your intended to be fast
stuff.
Then start documenting your assumptions for your "fast header": Add a
source file which is not linked with the rest of the code but fails
compilation (during preprocessing or actual compiling) if one of
these assumptions is violated.
Say you have determined that sizeof(int)*CHAR_BIT is 32 Bits, then
0xFFFFFFFF - INT_MAX should be 0. In other words, neither
char inttestl[0x7FFFFFFF - (long)INT_MAX + 1L];

Cant assume that : CHAR_BIT will be 8 or so and sizeof(int) will be 4,
but KT pointed out that maximum unsigned int might still be 32767
due to padding bits..I asked for an easy way to determine "usable bit
size" but got no reply so I guess its just some code to do at
init. Not too far up on the priority list I must admit though.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,187
Latest member
RosaDemko

Latest Threads

Top