single double precision question ....... more

ma740988 · Sep 29, 2005

I've got an unpacker that unpacks a 32 bit word into 3-10 bits samples.
Bits 0 and 1 are dont cares. For the purposes of perfoming an FFT and
an inverse FFT, I cast the 10 bit values into doubles. I'm told:

"floats and doubles have bits for mantissa and exponent. (I knew that).
If you are subtract bits that partly belong to mantissa and partly
belong to exponent, that hardly makes sense. Anyway, looking on bits
only, a float has 32 bits and doesn't behave different than a integer."

My advisor went on to say:
If I unpacked the samples into an - say an array of floats, it's quite
possible that I could get a left side value whose bits were
*very* different to th eright-hand integer *and* most likely wouldn't
match exactly to the integer.

So now:
float[0] = 0x2AA;
or better:
float f = (float)0x2AA;
cout << f << endl;

would show perhaps
682.00000123

Makes absolutely no sense to me. All machines I ran the source on
produced 682.00000000000 (depending of course on your precision).

Am I being mislead here?

David White · Sep 29, 2005

I've got an unpacker that unpacks a 32 bit word into 3-10 bits
samples. Bits 0 and 1 are dont cares. For the purposes of perfoming
an FFT and an inverse FFT, I cast the 10 bit values into doubles.
I'm told:

"floats and doubles have bits for mantissa and exponent. (I knew
that). If you are subtract bits that partly belong to mantissa and
partly belong to exponent, that hardly makes sense. Anyway, looking
on bits only, a float has 32 bits and doesn't behave different than a
integer."

My advisor went on to say:
If I unpacked the samples into an - say an array of floats, it's quite
possible that I could get a left side value whose bits were
*very* different to th eright-hand integer *and* most likely wouldn't
match exactly to the integer.

I don't know what you mean here by "left side" and "right" side.

So now:
float[0] = 0x2AA;
or better:
float f = (float)0x2AA;
cout << f << endl;

would show perhaps
682.00000123

Makes absolutely no sense to me. All machines I ran the source on
produced 682.00000000000 (depending of course on your precision).

All you are doing is a standard conversion from 0x2AA (682) to a float.
Floats are capable of representing integer values exactly, and that's what's
happening. There's no reason to expect a slight error.

Am I being mislead here?

I don't know because I'm not sure what you are getting at. If you are
talking about reinterpreting the bit pattern of an integer as a float, then
you could end up with any nonsense value, but if you are just doing a
standard conversion, in which the compiler takes care of correctly mapping
the integer bit pattern to the float bit pattern of the equivalant value,
then you got the expected result.

DW

Jack Klein · Sep 29, 2005

I've got an unpacker that unpacks a 32 bit word into 3-10 bits samples.
Bits 0 and 1 are dont cares. For the purposes of perfoming an FFT and
an inverse FFT, I cast the 10 bit values into doubles. I'm told:

Why are you casting? Values with accessible bits are integer types,
and you can just assign them to doubles. A case is redundant and has
no effect at all in this case.

"floats and doubles have bits for mantissa and exponent. (I knew that).
If you are subtract bits that partly belong to mantissa and partly
belong to exponent, that hardly makes sense. Anyway, looking on bits
only, a float has 32 bits and doesn't behave different than a integer."

C++ does not say how many bits a float has. It may have 32, it may
have 64, it can't have as few as 6, on a conforming implementation.

My advisor went on to say:
If I unpacked the samples into an - say an array of floats, it's quite
possible that I could get a left side value whose bits were
*very* different to th eright-hand integer *and* most likely wouldn't
match exactly to the integer.

There is no way to "unpack" bits into an array of floats defined by
the C++ language. You can extract some of the bits from an integer
type into another suitably sized integer type object. Or you can
extract some of the bits and assign them to floating point type, that
is defined.

So now:
float[0] = 0x2AA;

This is a syntax error, you can't have an array with a name identical
to a keyword. So let's assume that there is an array of floats named
'my_float'.

Now given:

my_float[0] = 0x2AA;

....then the integer literal '0x2AA' has type int and the value 682.
You assign this value to a float, which causes the compiler to convert
the value 682 into the float representation for 682.0 and assign it to
the float.

or better:
float f = (float)0x2AA;

This is no better, it is worse because the cast is redundant and it
shows a lack of understanding.

cout << f << endl;

would show perhaps
682.00000123

No it wouldn't, not with the code you posted.

Makes absolutely no sense to me. All machines I ran the source on
produced 682.00000000000 (depending of course on your precision).

Am I being mislead here?

There are two things you are missing here. The first is that your
advisor is discussing ideas that you are not yet ready for, and may
never need.

But the main thing that you are missing here is that in C++
conversions are based on value, not on differences in the internal
bitwise implementation of the data.

Your advisor is talking about the differences in the internal bitwise
implementation of the int value 682 and the float value 682.0. If
programs are design correctly, very, very few of them will ever need
to be concerned about the difference.

ma740988 · Sep 30, 2005

Thanks gents. I have a hard time expressing my ideas sometimes but
Jack you pointed out what I was alluding to: internal representation.

|| Your advisor is talking about the differences in the internal
bitwise
|| implementation of the int value 682 and the float value 682.0. If
|| programs are design correctly, very, very few of them will ever need

|| to be concerned about the difference.
Now can you highlight a case where I would be concerned about the
difference. That was my 'real' question because for some reason I
cant see it.

You see in my mind and for my case:

my_float[0] = 0x2AA;

The value 682 converted from int to float representation would amount
to 682.0. I cant see how it will be 682.00000000000023 or ...

Christian Meier · Sep 30, 2005

Thanks gents. I have a hard time expressing my ideas sometimes but
Jack you pointed out what I was alluding to: internal representation.

|| Your advisor is talking about the differences in the internal
bitwise
|| implementation of the int value 682 and the float value 682.0. If
|| programs are design correctly, very, very few of them will ever need

|| to be concerned about the difference.
Now can you highlight a case where I would be concerned about the
difference. That was my 'real' question because for some reason I
cant see it.

You see in my mind and for my case:

my_float[0] = 0x2AA;

The value 682 converted from int to float representation would amount
to 682.0. I cant see how it will be 682.00000000000023 or ...

floating point datatypes have a deviation....

ma740988 · Sep 30, 2005

uhmnn, David and Jack just told me it doesnt. I'll get teh correct
result always. i.e 682.0000000000000000

What deviation are you referring to?

Kai-Uwe Bux · Sep 30, 2005

Thanks gents. I have a hard time expressing my ideas sometimes but
Jack you pointed out what I was alluding to: internal representation.

|| Your advisor is talking about the differences in the internal
bitwise
|| implementation of the int value 682 and the float value 682.0. If
|| programs are design correctly, very, very few of them will ever need

|| to be concerned about the difference.
Now can you highlight a case where I would be concerned about the
difference. That was my 'real' question because for some reason I
cant see it.

You see in my mind and for my case:

my_float[0] = 0x2AA;

The value 682 converted from int to float representation would amount
to 682.0. I cant see how it will be 682.00000000000023 or ...

It will not. Small integer values have exact representations in float or
double on any decent c++ implementation. (It is true that this is a quality
of implementation issue as the standard is remarkably shy to give any
guarantees about floating point arithmetic). Floating point arithmetic
represents numbers by sign, mantissa, and exponent. Since a float or a
double uses only finitely many bits, only finitely many real numbers are
representable as a float. Usually, we deal with the missing reals by
considering a nearby float approximating them as their representation.
However, as long as the bitlength of the mantissa can host your integer, it
can be represented as a float without being just approximated.

Now, as for the bit patterns, they will generally look vastly different.
However, that should not be of your concern. The compiler will generate the
code taking care of all necessary bit-shuffling when converting an int to a
float.

Best

Kai-Uwe Bux

David White · Oct 1, 2005

Thanks gents. I have a hard time expressing my ideas sometimes but
Jack you pointed out what I was alluding to: internal representation.

|| Your advisor is talking about the differences in the internal
bitwise
|| implementation of the int value 682 and the float value 682.0. If
|| programs are design correctly, very, very few of them will ever need

|| to be concerned about the difference.
Now can you highlight a case where I would be concerned about the
difference. That was my 'real' question because for some reason I
cant see it.

You see in my mind and for my case:

my_float[0] = 0x2AA;

The value 682 converted from int to float representation would amount
to 682.0. I cant see how it will be 682.00000000000023 or ...

I can't highlight a case concerning a direct conversion from integer 682, or
similar sized value, to a float, because on any implementation you are
likely to come across you'll get an exact conversion. I can, however,
describe a real case I had recently: I was reading the speed of a vacuum
pump that has a maximum speed of 1500 Hz. I needed to display it to the user
as a percentage of maximum speed using a Number object (our own class) that
was created with limits 0 to 100. I read the Hz value and converted it to a
percentage like this:
userDisplayParam.setValue(speedInHz / 15.f);
The problem was that if you started the program with the pump already at
full speed it would sometimes take minutes to show anything but the default
zero value for the speed, even though it was being read every second, and it
strangely would never show 100. The highest you ever saw was 99.9. It turned
out that the "division" in the expression above was turned into a
multiplication by the compiler, i.e., instead of speed / 15.f it was doing
speed * (1/15.f), where 1/15.f was pre-calculated by the compiler. This
makes sense because multiplications can be done faster than divisions.
Unfortunately, 1/15.f cannot be represented exactly in a binary float value,
so there was a slight error in the result even when the speed in Hz is
exactly divisible by 15. My assumed exact 100% result for 1500/15 turned out
to be something like 100.0001, which exceeded the limit of the user-display
Number, so it never showed 100% for the speed.

DW

double -> text -> double	11	Nov 29, 2006
Small High-precision Arithmetic Library	7	Jul 18, 2007
RNGs: A double KISS	10	Apr 14, 2010
What are the minimum numbers of bits required to represent C stdfloat and double components?	6	Apr 18, 2008
Question on printf with %x	8	Apr 17, 2013
question about .. my 'unpacker'.	10	Sep 11, 2005
double precise enough for unsigned short?	11	May 31, 2010
Z-Ordering (Morton ordering) question	2	Nov 5, 2009

single double precision question ....... more

ma740988

David White

Jack Klein

ma740988

Christian Meier

ma740988

Kai-Uwe Bux

David White

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads