David Brown said:
Here would be one way of being sure of signed shifts that I
believe will be correct (at least on "normal" processors with
two's compliment arithmetic, and support for the given integer
sizes):
// Note that when pre-processing, the same rules should apply as
// during runtime for the target
#if (-1 >> 2) < 0
// Right-shift duplicates the sign bit ("arithmetic right shift")
static inline int32_t sign_extending_right_shift(int32_t x) {
return x >> 1;
}
static inline int32_t zero_extending_right_shift(int32_t x) {
return (int32_t) (((uint32_t) x) >> 1);
}
#else
// Right-shift zeros the sign bit ("logical right shift")
static inline int32_t sign_extending_right_shift(int32_t x) {
if (x < 0) return (x - 1) / 2;
return x >> 1;
}
static inline int32_t zero_extending_right_shift(int32_t x) {
return x >> 1;
}
#endif
Better (using 'int' rather than the platform-dependent 'int32_t'):
When you are dealing with shifts and other bit manipulation, you
almost certainly know the sizes of the data you are working with.
[snip elaboration re int vs int32_t]
By focusing on this incidental distinction you are missing
the point. The examples I gave apply to any integer type
(adjusting various declarations appropriately), including
int32_t. However, because the point is about writing code
that will work on all platforms, including those where
int32_t cannot be provided, the examples use int instead
of int32_t. How to make these available with explicit
widths is a separate discussion - possibly an important
discussion, but nevertheless a separate discussion.
The versions I gave also work for all integer types in the same way.
When I write C, I aim for the code to work on different platforms
(unless the code is directly accessing target-specific details, which is
not uncommon for my embedded code). I try to write in a way that will
work as expected on realistic platforms, and will produce compile-time
errors if any of my assumptions do not hold. So if I want to work with
32-bit values, I use the platform-independent type "int32_t" (or perhaps
"int_least32_t", depending on the circumstances). I rarely use plain
"int", precisely because my code often needs to be portable across
different platforms. ("int" is usually synonymous with "int_fast16_t"
as far as I am concerned.)
(You may consider this a separate discussion, but you brought it up
here. But I will try not to add more on the topic in this thread.)
If you had bothered to try actually compiling it, as I did, I
believe you would find my version generates code that is just as
fast as the smaller of your two cases, and faster than the other,
on platforms of interest (I tried two).
I tried the code, and reached the same basic conclusion.
However, it is only as small and fast when optimisation is enabled (not
everyone enables optimisation - I think that is generally a mistake on
their part, but it is still true), and with a good compiler (I have used
many compilers that would not have a chance of optimising your code
here). And we only know that it is as fast as the plain "x >> 1"
version on platforms which implement "x >> 1" as an arithmetic right
shift - we don't know what code a compiler would generate on a platform
that implemented "x >> 1" as a logical right shift, because we don't
have easy access to such a platform (if one even exists).
So we know your version is never faster than mine on any platform we can
try, and that it is slower than mine on some platforms (I tried
compiling for the 8-bit AVR, using avr-gcc 4.5.1), and slower if
optimisation is disabled. We know nothing about the case for platforms
which use logical right shift.
Minor point: your statement that this way is "bigger" is simply
false. My version has one line of function body. Your version
has either one or two, or a total of three, and that is not
counting the preprocessor statements to choose between the two
alternatives. If we take into account the other needed overhead
then your version is much bigger.
I meant bigger object code, not source code (I can count source lines
too). But to be fair, your code is not bigger than mine on "typical"
platforms.
My version has no need of either preprocessor tests or static
asserts, and works in all conforming implementations. Given
that it is just as fast or faster than the larger and less
general version, I see no reason to prefer the latter.
See above. If your version were always as optimal, then I would mostly
agree. (I still think your version is less clear - but as I said in
another post, that is a subjective opinion.)
No, it isn't. The casting version works correctly only
on platforms that use twos complement. This version
doesn't have that restriction.
Would that be because a cast /could/ change the bit representation of
the data when converting from unsigned to signed? I am not sure if I am
getting the details of the standards correct here, but I believe a cast
between signed int and unsigned int will count as a "conversion". The
cast from signed to unsigned is well defined, and will work even on
one's complement systems (-1 converts to 0xffff ffff for 32-bit ints).
But conversion from unsigned 0xffff ffff back to -1 is apparently
implementation dependent. If that interpretation is correct, then
perhaps my version is implementation-dependent even for two's complement
systems. I'd appreciate a "ruling" on this point.
As far as I can see from the standards, your type-punning method should
be correct.
However, the casting method will work correctly on platforms that are
realistic for running new code (how many targets are /not/ two's
complement?).
Testing with gcc on amd64 and the AVR shows the casting version to be
optimal for all settings, and the union method is optimal when
optimisation is enabled and very sub-optimal when optimisation is
disabled. And if anyone is interested (I know this is almost blasphemy
in this group), the casting version is also valid in C++ while the union
version is not. (This is the fault of the C++ committee - designated
initialisers should have been included in C++.)
This is a specious argument. No real work is done under such
conditions. Your original example uses 'inline', which means the
compiler was done after 1999. No contemporary compiler is going
to be lacking in such rudimentary transformations, and anyone
interested in good performance obviously is going to enable
them.
Unfortunately, that is not true. In the world of embedded programming,
there are some very poor compilers around. For many reasons, it is
sometimes necessary to use very old tools even for new code - and there
are plenty of "modern" compilers that do very little in the way of
optimisation.
The functions I wrote produced code that is just as fast
or faster than either of your alternate versions, using
the lowest levels of optimization available. That includes
test compiles targeting an embedded processor.
I tested on an 8-bit AVR, which is quite a common embedded processor.
Admittedly it was not the latest gcc available, but it was approximately
the same version as for amd64 that I have conveniently on my system. (I
also tested for a 16-bit msp430, but as the results were the same as for
amd64, I didn't bother giving the details.)