Fixed Point Rounding

T

Tricky

Im using the new float fix lib, and Im getting a bit confused with the
resize function.

Im using the following code:

video_us : out ufixed(7 downto 0)
...
variable weighted_pix0 : ufixed(7 downto -12);
variable weighted_pix1 : ufixed(7 downto -12);
.......
video_us <= resize(weighted_pix0 + weighted_pix1, 7 ,
0);

Now, when the sum of the 2 weighted pixels is X.5, the rounded output
(round style is defaulting to fixed_round) is always X, not X+1. Is
there a rule Im missing? Is this not what the resize function is for,
because otherwise whats the point of the "round_style" and
"overflow_style" function arguments? do I have to go back to old way
of rounding that is +0.5 and then truncating?

Thanks for help in advance
 
K

KJ

Im using the new float fix lib, and Im getting a bit confused with the
resize function.

Im using the following code:

video_us : out ufixed(7 downto 0)
..
variable weighted_pix0    : ufixed(7 downto -12);
variable weighted_pix1    : ufixed(7 downto -12);
......
video_us                <= resize(weighted_pix0 + weighted_pix1, 7 ,
0);

Now, when the sum of the 2 weighted pixels is X.5, the rounded output
(round style is defaulting to fixed_round) is always X, not X+1. Is
there a rule Im missing?

What you're missing is that whether X.5 rounds up or down depends on
what X is.

From ther user's guide...
"round_style" defaults to fixed_round (true) that turns on the
rounding routines. If false (fixed_truncate), the number is truncated.
Rounding is done by first looking to see if the MSB of the remainder
is a “1”, AND the LSB of the unrounded result is a “1” or the lower
bits of the remainder include a “1”, the result will be rounded. This
is similar to the floating-point “round_nearest” style. The down side
is that ALL of the bits are included in the decision to round

do I have to go back to old way
of rounding that is +0.5 and then truncating?

I use the "+0.5 and then truncating" approach because it takes less
logic to implement (hence the 'down side' mentioned in the user's
guide) and my requirements haven't so far required the floating-point
“round_nearest” style

Kevin Jennings
 
K

KJ

Should've said "I *have* used +0.5...". I've also used 'fixed_round'.
Which causes data forking.
Take the numbers 1.5 to 8.5 and round by your method and add up the
error.

Depending on the application though, this additional error for certain
input combinations might still be acceptable. "+0.5 and truncate" is
generally intermediate between 'fixed_truncate' and 'fixed_round' both
in error and logic resources to implement. Which of the three methods
is 'best' will depend on the accuracy requirements of the particular
application.

"+0.5 and truncate" is a design tradeoff that should be evaluated
versus 'fixed_truncate' and 'fixed_round'...it's just another tool in
the toolbox.

As an example of resource usage, I took Tricky's code (actual code
posted below) and ran it through Quartus 9.0 to produce the following
results:

Rounding method Logic resources
=============== ===============
fixed_round 44
+.5_and_trunc 39
fixed_truncate 29

Kevin Jennings

--- START OF CODE
library ieee_proposed;
use ieee_proposed.math_utility_pkg.all;
use ieee_proposed.fixed_pkg.all;

entity Resizer_Adder is port(
weighted_pix0: in ufixed(7 downto -12);
weighted_pix1: in ufixed(7 downto -12);
video_us: out ufixed(7 downto 0));
end Resizer_Adder;
architecture rtl of Resizer_Adder is
begin
-- Uncomment the line you would like to evaluate
video_us <= resize(weighted_pix0 + weighted_pix1, 7 , 0,
fixed_overflow_style,fixed_round);
-- video_us <= resize(weighted_pix0 + weighted_pix1 +
to_ufixed(0.5,-1,-1), 7 , 0, fixed_overflow_style,fixed_truncate);
-- video_us <= resize(weighted_pix0 + weighted_pix1, 7 , 0,
fixed_overflow_style,fixed_truncate);
end rtl;
--- END OF CODE
 
T

Tricky

KJ is correct.

4.5 rounds to 4
5.5 rounds to 6

Though carrying 12 bits of decimal is a bit overkill.



Which causes data forking.
Take the numbers 1.5 to 8.5 and round by your method and add up the
error.   Then do the same for my method (I wrote the fixed point packages).

For me, data forking (if I understand correctly - is that like
compounding errors?) shouldnt be an issue becuase this is the final
output. The 12 bits are carried only because I have a previous divide
(by a constant 2^n, with n as a generic) with input data only carrying
4 bits fractional. 12 bits contains the worst possible case of N, with
me expecting the synthesiser to clear up anything thats overkill. And
actually, for my output, I require X.5 to always round to X+1, so
there is no error.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top