Integer to Floating-Point conversions

Kai Koehne · May 13, 2006

Hi,

I need an int to real32 conversion with IEEE-754 round-towards-zero mode
.... That means, a method x, so that

int i = 44554435
float f = x(i)

results in f = 3355432.0, and not f = 335546.0 .

Are you aware of any mathematical packages that provide such a function,
or a way to do this with pure Java?

Regards,

Kai Koehne

joseph_daniel_zukiger · May 13, 2006

Kai said:
Hi,

I need an int to real32 conversion with IEEE-754 round-towards-zero mode
... That means, a method x, so that

int i = 44554435
float f = x(i)

results in f = 3355432.0, and not f = 335546.0 .

Are you aware of any mathematical packages that provide such a function,
or a way to do this with pure Java?

Regards,

Kai Koehne

Huh?

I was going to mumble something automatic integer to floating point
promotion should work as advertised, but your question does not seem to
make sense.

If you absolutely must work with 32 bit floats instead of 64 bit
doubles or java.math.BigDecimal, you will lose precision when promoting
large integers. There's no way to squeeze 32 bits of precision into
less than 32 bits of precision, expecially when a 32 bit float needs a
few bits for the exponent.

But if you really need a method that converts 44554435 stored in an int
to either 3355432.0 or 335546.0 in 32 bit floating point
representation, you can always write your own. I'd prefer you give it a
reasonable name, like "mungeNumber()" or "hashAFloat()" or
my.own.strange.Promotions.conversion_x() so that if it ever gets close
to any of my code my code doesn't get infected. But you can definitely
build it. I'd point you to the xxxToYyyBits() and yyyBitsToXxx()
methods of java.lang.Double and java.lang.Float, but I wouldn't want to
be accused of contributing to the crime.

Kai Koehne · May 13, 2006

Kai said:
Kai said:

Hi,

I need an int to real32 conversion with IEEE-754 round-towards-zero mode
[...]
Kai Koehne

Click to expand...

Huh?

I was going to mumble something automatic integer to floating point
promotion should work as advertised, but your question does not seem to
make sense.

Just to clarify my motivation: I am building an Occam 2 Java compiler
.... and as it happens, occam supports two ways of converting an integer
to a floating-point variable:

REAL32 ROUND n -- conversion with round-towards-middle
REAL32 TRUNC n -- conversion with round-towards-zero

The first has the same semantics as (float)n , but I couldn't find a way
to implement the second conversion.

[...]
But if you really need a method that converts 44554435 stored in an int
to either 3355432.0 or 335546.0 in 32 bit floating point
representation, you can always write your own. I'd prefer you give it a
reasonable name, like "mungeNumber()" or "hashAFloat()" or
my.own.strange.Promotions.conversion_x() so that if it ever gets close
to any of my code my code doesn't get infected. But you can definitely
build it. I'd point you to the xxxToYyyBits() and yyyBitsToXxx()
methods of java.lang.Double and java.lang.Float, but I wouldn't want to
be accused of contributing to the crime.

Huh, this is the tough way ... but I will try to find my way through the
implementation details, thank you for the idea

Kind regards,

Kai Koehne

Chris Uppal · May 13, 2006

Kai said:
I need an int to real32 conversion with IEEE-754 round-towards-zero mode

Java doesn't include a spectacularly complete set of IEEE-754 operations,
that's one of the ones that's missing. I believe that more stuff will be added
in 1.6 (see http://download.java.net/jdk6/docs/api/java/lang/Math.html
) but I don't know if that will provide what you need.

If not, or if you can't wait, then I think you'll just have to program it
yourself. Math.ulp(float) may help.

-- chris

joseph_daniel_zukiger · May 13, 2006

Kai said:
Kai said:

Hi,

I need an int to real32 conversion with IEEE-754 round-towards-zero mode
[...]
Kai Koehne

Click to expand...

Huh?

I was going to mumble something automatic integer to floating point
promotion should work as advertised, but your question does not seem to
make sense.

Click to expand...

Just to clarify my motivation: I am building an Occam 2 Java compiler
... and as it happens, occam supports two ways of converting an integer
to a floating-point variable:

REAL32 ROUND n -- conversion with round-towards-middle
REAL32 TRUNC n -- conversion with round-towards-zero

The first has the same semantics as (float)n , but I couldn't find a way
to implement the second conversion.

Ahh. I see.

[...]
But if you really need a method that converts 44554435 stored in an int
to either 3355432.0 or 335546.0 in 32 bit floating point
representation, you can always write your own. I'd prefer you give it a
reasonable name, like "mungeNumber()" or "hashAFloat()" or
my.own.strange.Promotions.conversion_x() so that if it ever gets close
to any of my code my code doesn't get infected. But you can definitely
build it. I'd point you to the xxxToYyyBits() and yyyBitsToXxx()
methods of java.lang.Double and java.lang.Float, but I wouldn't want to
be accused of contributing to the crime.

Click to expand...

Huh, this is the tough way ... but I will try to find my way through the
implementation details, thank you for the idea

If what Chris suggests doesn't get there fast enough for you, you might
want to consider implementing it borrowing from java.math.BigDecimal
and BigInteger. If it turns out too slow, it could at least give you a
good baseline.

Patricia Shanahan · May 13, 2006

Kai Koehne wrote:
....

Huh, this is the tough way ... but I will try to find my way through the
implementation details, thank you for the idea

....

Here's my attempt. The test cases shown in the main method
are ALL the testing I've done.

public class FloatConvert {

public static void main(String[] args) {
// Original example
test(44554435, 44554432);
// Negation of original example
test(-44554435, -44554432);
// No rounding needed
test(44554432, 44554432);
// Rounding goes in correct direction
test(44554433, 44554432);
}

private static void test(int input, int expected) {
double actual = truncateIntToFloat(input);
if (actual == expected) {
System.out.printf("Success: input %d actual %f\n", input, actual);
} else {
System.out.printf("Failure: input %d actual %f expected %d\n", input,
actual, expected);
}
}

/**
* Convert int to float with truncation towards zero.
*
* @param input
* Number to convert
* @return Largest magnitude float that has the same sign as input
* and no greater absolute value.
*/
public static float truncateIntToFloat(int input) {
float rounded = (float) input;
if (Math.abs((double) rounded) <= Math.abs((double) input)) {
return rounded;
} else {
int roundedBits = Float.floatToIntBits(rounded);
float truncated = Float.intBitsToFloat(roundedBits - 1);
return truncated;
}
}
}

Chris Uppal · May 14, 2006

Patricia said:
Here's my attempt. The test cases shown in the main method
are ALL the testing I've done.

It works for the other 4294967292 inputs too. I've tried 'em all ;-)

int roundedBits = Float.floatToIntBits(rounded);
float truncated = Float.intBitsToFloat(roundedBits - 1);
return truncated;

That's clever. At least, I /hope/ it's clever because I don't understand why
it works. Do you care to expand a bit ? Please ?

I got interested in this myself and have tried out a number of approaches,
including an attempt to convert rounding into truncation by consideration of
ULPs (as I suggested to the OP earlier). The best, IMO, is to do the
truncation in the integer domain before converting into a float. Its a little
complicated to do, but it runs fast, and has the aesthetic advantage of not
requiring double-precision operations to produce a single-precision result
(which all the other things I tried do).

I've appended the code, including the test-harness, for anyone who's
interested.

-- chris

=========== Test.java ===============
/*
* contains several static implementations of truncating a 32-bit integer
* to a 32-bit float with rounding "towards zero" rather that to the nearest
* float (as in the Java language spec for casting an int to a float).
* Also contains an exhaustive test which checks all the implementations
* against the simplest one (presumed to be correct).
*
* Approximate times when run on a 1.5.0_06-b05 JVM, in "client" mode on
* a 1.5 GHz celeron WinXP Pro box. In each case the time is the time in
* nanoseconds to convert one value averaged over the full 32-bit range of
* integers.
*
* Slow: 130
* ULP: 63.5
* FastULP: 65
* Twiddle: 19
* Patricia: 74.5
*
* Approximate times for inputs in the resticted range [-10,000,000,
10,000,000),
* 83% of which is exactly representable as a float.
*
* Slow: 12
* ULP: 12
* FastULP: 20.5
* Twiddle: 18.5
* Patricia: 24
*/
public class Test
{
public static void
main(String[] args)
{
int i = 0;
int errors = 0;
for (;

{
if (i % 10000000 == 0)
{
// progress...
System.out.printf("%11d...\r", i);
System.out.flush();
}
float shouldBe = slowTruncate(i);
float is;

if ((is = ulpTruncate(i)) != shouldBe)
{
System.out.printf("Slow: %d: %f != %f%n", i, is, shouldBe);
errors++;
}
if ((is = fastTruncate(i)) != shouldBe)
{
System.out.printf("Fast: %d: %f != %f%n", i, is, shouldBe);
errors++;
}
if ((is = twiddleTruncate(i)) != shouldBe)
{
System.out.printf("Twiddle: %d: %f != %f%n", i, is, shouldBe);
errors++;
}
if ((is = patriciaTruncate(i)) != shouldBe)
{
System.out.printf("Patricia: %d: %f != %f%n", i, is, shouldBe);
errors++;
}

if (errors > 10)
break;
if (i == -1)
break; // we've wrapped all the way around
i++;
}
System.out.printf("%ndone, %d errors%n", errors);
}

/**
* Slow implementation of truncating an int to a 32-bit floating point.
* This implementation is intended to be the "so simple it can't possibly
* be wrong" benchmark against which other implementations can be compared
*/
public static float
slowTruncate(int i)
{
// since all +ve numbers have negations, but the reverse
// isn't true, we work with -ve numbers
if (i > 0)
return -slowTruncate(-i);

// loop up (towards zero) until we find the right answer
for (;

{
float f = (float)i;
if ((double)f >= (double)i)
return f;
i++;
}
}

/**
* Implementation of truncating an int to a 32-bit floating point based on
* ULP manipulation
*/
public static float
ulpTruncate(int i)
{
float f = (float)i;
double error = (double)f - (double)i;

// no error ?
if (error == 0)
return f;

// or rounded towards 0 anyway ?
if ((error < 0) == (i > 0))
return f;

// adjust by one ulp towards 0
// the double ulp() is because ulp is the distance to the next
// number away from 0, we need the distance to the next float
// nearest zero
if (i > 0)
{
float ulp = Math.ulp(f-Math.ulp(f));
f -= ulp;
}
else
{
float ulp = Math.ulp(f+Math.ulp(f));
f += ulp;
}

return f;
}

/**
* Faster implementation of truncating an int to a 32-bit floating point.
* This implementation uses a simple test to catch a useful range of
* "eaasy" cases. If the input is likely to be in that range then
* this provides a useful optimisation, otherwise it's marginally
* slower.
* NB1: this version uses ulpTruncate() for the difficult cases, but
* it could just as easily use a different one -- even slowTruncate().
* NB2: since this consists only of a simple test guarding simple
operations,
* it seems reasonable to hope that the JIT will inline it in callers
*/
public static float
fastTruncate(int i)
{
// these are all exact
if (i >= -8388608 && i <= 8388608)
return (float)i;
else
return ulpTruncate(i); // or any other xxxTruncate()
}

/**
* Faster implementation of truncating an int to a 32-bit floating point
based
* on bit-twiddling.
* The idea is to take the number, discover how many high-bits it has which
* don't fit into the 24-bit mantissa of a float, and to mask off that many
* low bits before converting to a float. I.e. we do the truncation in the
* integer domain before converting to a float, so the conversion is always
* is always exact.
* The actual implementation is complicated by sign issues, but that's the
* basic idea.
* Note that we don't actually mess with the physical layout of an IEEE
float,
* but only use the fact that it has exactly 24 bits in which to represent
the
* mantissa
*/
public static float
twiddleTruncate(int i)
{
if (i >= 0)
{
// because i is +ve the high bit is always zero,
// and so (i >>> 24) is always in the range [0, 128)
return (float)(i & MASKS[i >>> 24]);
}

// nasty special case
if (i == Integer.MIN_VALUE)
return -2147483648.0f;

return -twiddleTruncate(-i);
}

// helper data for twiddleTruncate(). A set of precomuted masks which
// will be indexed by bits [24, 30] (zero-based) of the input integer.
private static final int[] MASKS = new int[128];
static
{
for (int i = 0; i < 128; i++)
MASKS = -1;
for (int bit = 0; bit < 7; bit++)
{
int threshold = 1 << bit;
int mask = 1 << bit;
for (int i = threshold; i < 128; i++)
MASKS ^= mask;
}
}

/**
* Clever implementation of truncating an int to a 32-bit floating point
posted
* to comp.lang.java.programmer by Patricia Shanahan on 13 May 2006
* (Message-ID: <[email protected]> )
*/
public static float
patriciaTruncate(int i)
{
float rounded = (float) i;

if (Math.abs((double) rounded) <= Math.abs((double) i))
return rounded;

int roundedBits = Float.floatToIntBits(rounded);
float truncated = Float.intBitsToFloat(roundedBits - 1);

return truncated;
}
}
==========================

Patricia Shanahan · May 14, 2006

Chris said:
Patricia Shanahan wrote:

It works for the other 4294967292 inputs too. I've tried 'em all ;-)

That's clever. At least, I /hope/ it's clever because I don't
understand why it works. Do you care to expand a bit ? Please ?

Sure. First I split the problem into two cases. If the result did not
require rounding, or the rounding was towards zero, the rounded result
and truncated result are equal, so I just returned the rounded result.

The code you quote is for those cases that round away from zero. The
float we want is the one with the same sign as rounded, and the largest
magnitude that is less than rounded.

NaNs, infinities, and zeros would all be special cases in a general
attempt to make the smallest possible reduction in the magnitude of a
float, but can be ignored here. Zero can only result from converting int
zero, and getting an exact result. The others cannot result from int
conversion.

If rounded is not an exact power of two, the truncation has a mantissa
one less than rounded. If it is an exact power of two, it has an
exponent one less than rounded, and mantissa all ones. Either way,
subtracting one from the rounded bit pattern gets the bit pattern for
truncated.

I got interested in this myself and have tried out a number of
approaches, including an attempt to convert rounding into truncation
by consideration of ULPs (as I suggested to the OP earlier). The
best, IMO, is to do the truncation in the integer domain before
converting into a float. Its a little complicated to do, but it runs
fast, and has the aesthetic advantage of not requiring
double-precision operations to produce a single-precision result
(which all the other things I tried do).

If you prefer to avoid double, just do the abs in int:

public static float truncateIntToFloat(int input) {
float rounded = (float) input;
if (Math.abs((int) rounded) <= Math.abs(input)) {
return rounded;
} else {
int roundedBits = Float.floatToIntBits(rounded);
float truncated = Float.intBitsToFloat(roundedBits - 1);
return truncated;
}
}

Both numbers whose absolute values are to be compared can be exactly
represented in either int or double, so either type should work.

Patricia

Kai Koehne · May 14, 2006

Wow! Thank you all for this thorough discussion ... Two days ago I was
not even sure if truncating conversion can be done at all, and now I
have five working solutions + performance tests. I have the strong
feeling to owe you a beverage of your choice

Thank you!

Kai Koehne

Chris Uppal · May 15, 2006

Patricia said:
If rounded is not an exact power of two, the truncation has a mantissa
one less than rounded. If it is an exact power of two, it has an
exponent one less than rounded, and mantissa all ones. Either way,
subtracting one from the rounded bit pattern gets the bit pattern for
truncated.

Ah, I see. The IEEE fp layout is subtle.

Thank you.

If you prefer to avoid double, just do the abs in int:

public static float truncateIntToFloat(int input) {
float rounded = (float) input;
if (Math.abs((int) rounded) <= Math.abs(input)) {
return rounded;
} else {
int roundedBits = Float.floatToIntBits(rounded);
float truncated = Float.intBitsToFloat(roundedBits - 1);
return truncated;
}
}

Unfotunately, the comparison doesn't work for Integer.MAX_VALUE, nor for values
near Integer.MIN_VALUE. Two example failure cases are:

2147483647
-2147483638

Some test code (which uses a routine from yesterday's post):
=============
public class Test2
{
public static void
main(String[] args)
{
test(2147483647);
test(-2147483638);
}

static void
test(int input)
{
System.out.printf("input: %d%n", input);
System.out.printf("desired output: %f%n", Test.slowTruncate(input));

float rounded = (float)input;
System.out.printf("rounded: %f%n", rounded);

System.out.printf("(int)rounded: %d%n", (int)rounded);
System.out.printf("Math.abs((int)rounded): %d%n", Math.abs((int)rounded));
System.out.printf("Math.abs(input): %d%n", Math.abs(input));
System.out.println();
}
}
=============

Which produces output:

input: 2147483647
desired output: 2147483520.000000
rounded: 2147483648.000000
(int)rounded: 2147483647
Math.abs((int)rounded): 2147483647
Math.abs(input): 2147483647

input: -2147483638
desired output: -2147483520.000000
rounded: -2147483648.000000
(int)rounded: -2147483648
Math.abs((int)rounded): -2147483648
Math.abs(input): 2147483638

I find the negative output from Math.abs() particularly entertaining ;-)

-- chris

Patricia Shanahan · May 15, 2006

Chris said:
Patricia Shanahan wrote:

Ah, I see. The IEEE fp layout is subtle.

Thank you.

....
Unfotunately, the comparison doesn't work for Integer.MAX_VALUE, nor for values
near Integer.MIN_VALUE. Two example failure cases are: ....
I find the negative output from Math.abs() particularly entertaining ;-)

Good point. Shows the folly of not testing at least the maximum and
minimum value, as well as some ordinary values.

Patricia

C++ SSE and SSE2 compiler settings, and their Floating Point effects.	0	May 31, 2022
floating point conversions && how to read standards	4	Oct 7, 2011
Trouble with integer floating point conversion	49	Dec 12, 2007
converting floating point types round off error ....	13	Oct 4, 2008
Floating point load-store behaviour.	7	Aug 23, 2006
wierd floating point output	13	Sep 3, 2007
Floating-Point Constant with Payload	1	Apr 20, 2008
The floating point environment	4	Jan 17, 2006

Integer to Floating-Point conversions

Kai Koehne

joseph_daniel_zukiger

Kai Koehne

Chris Uppal

joseph_daniel_zukiger

Patricia Shanahan

Chris Uppal

Patricia Shanahan

Kai Koehne

Chris Uppal

Patricia Shanahan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads