doing sqrt( ) for synthesis.

Luis Cupido · Oct 16, 2012

Hello.

I would like to have at some point of my code

MV= sqrt(I^2+Q^2);

but since the above I have no clue how to do in a single clock cycle
I have just I^2+Q^2 in the form:

pv <= i**i + q**q;

That uses a few embedded multipliers and works great.

The number of bits in the result is obviously in (2x)+1
Since I can't afford to manipulate that wide word further
down the chain I just truncated it to the same number of bits of the input.

All fine but a lot of dynamic range was lost
(this is some DSP coding as you might have guessed already).

Is there any compression function just gates, that could work
unclocked (for me to get done with one clock latency) that does the
SQRT( ) or anything that resembles this function just to get the same
dynamyc range preservation effect as the SQRT( ) does.

Pardon me for being a bit off topic.

Anyone has any idea ?

Luis C.

Anton Gunman · Oct 17, 2012

Hello,

That's an interesting one!

I think that you first need to decide, or at least estimate, how much precision you will need.
- How many bits can you truncate before doing the SQRT?
- How many bits will the SQRT output? (Typically it's a fixed point notation)
- How much resources can you afford to use?
- Is your design fully pipelined (perform a new calculation at every clock cycle), or can you afford to have some sort of "load"/"done" signals to perform your calculations?

Afterwards, here are some ideas you can look into:
1- Sometimes you don't really need to do the SQRT, you can just use the result prior to the SQRT and keep some of the MSBs (it is not as accurate).
2- Do you really need it to be done in a single clock cycle? Sometimes a pipelined design could be used (takes several clock cycles to output the first result, but can accept and output a new result at every other clock cycle).
3- You can try using your tool's core generator (e.g. Xilinx does have some SQRT cores that use the CORDIC algorithm, you can generate them using the wizard).
- You might be able to choose between a slow core (area efficient and several clock cycles per operation) and a fast core (fully pipelined and consumes more resources).
4- You can also take a look at the following links (I have not tested them):
- https://groups.google.com/forum/?fr...0root/comp.lang.vhdl/BQ9MypRSkhM/Qf7uLkIrmJQJ (see "function SquareRoot")
- http://vhdlguru.blogspot.ca/2010/03/vhdl-function-for-finding-square-root.html
- But these are COMBINATORIAL circuits (no clocks), they are not efficient in terms of resources and might affect timing, so be careful here, especially with large bit widths!
5- Depending on the number of bits you are using, you can think of any function (in your case the SQRT) as a large LUT and implement it using a single ROM with pre-calculated values.

The info below roughly applies to internal block RAM/ROMs.
Keep in mind that the FPGA RAM is limited (you can quickly check your FPGA's datasheet to get an estimate of how many of these block RAMs you can use).
If you are using Xilinx devices, typically a single BRAM (small internal FPGA block RAM) is either 18kbits 18*1024 bits (or 36 kbits).
These BRAM can be configured with various address/data configurations (for e.g. 10 address bits and 18 data bits, etc..., not all configurations might be valid)
You would typically code it as an array and let the tools worry about it (but you MUST check the synthesis result to see if the BRAM has been correctly inferred or if you are using way too many BRAM)
In this case you might need to change your approach.
Note: The ROM/RAM has to be synchronous, otherwise the synthesizer will try using registers instead of BRAM and you will see an explosion in resource utilization.

A rather "rough" way of approximating the number of BRAM that will be consumed is the following:
Num. BRAM = ceil(2^addr_width * data_width/1024/18) (use 36 instead of 18 depending on your device, but you should get the point)

The actual number will depend on the configuration possibilities, your coding style, and how "smart" the tools are.

In your case IF you can round your result (prior to the SQRT) to something like 10 bits (or just use a generic for that), this might work for you.
e.g.:

type type_my_rom is array(integer range 0 to 2**10-1) of unsigned(18-1 downto 0);
signal my_sqrt_rom : type_my_rom := MY_INIT_FUNCTION(); -- You can write an init function of some sort, or read it from file etc...
--...

-- clocked process
-- ...
pv <= i*i + q*q; -- Calculate
pv_10bit <= pv(pv'high downto pv'high-10); -- Truncate
mv <= my_sqrt_rom(pv_10bit); -- SQRT using a ROM, pv_10bit is used as the address

Note: this is really going to depend on your application's needs. I would suggest you try it out in software (or non-synthesizable code) and see how many bits you can truncate without losing too much precision.
10 bits might be too much of a truncation and if you increase your addr_width to 16 bits you will quickly run out of resources, so this might not be the best approach.

- I would probably go for the core generator approach, you can check the documentation online to see if it fits your requirements, or just try generating it. (It will still probably require a few clock cycles depending on the precision).
- If you can afford the LUT resources, maybe look into the combinatorial implementations #2 (you can simply try synthesizing it and checking your resource usage and timing estimate).

I hope this helps a bit,
Anton

Luis Cupido · Oct 17, 2012

I hope this helps a bit,
Anton

That helps a lot. Thanks.

I can narrow down the specs and add some info.

The the design can grow to the maximum
I can fit on the device or up to the point of start limiting
the speed.

-The system input has a pair of 16bit ADC.
I decide on an global design CONSTANT how
many bits I use, from (whatever) to 16bit

-At that point of the calculation I have a pair of max 17bit each
so here I have, from (whatever+1) to 17bit

- i^2+q^2 would make 35bit maximum
so here I have, from (whatever*2+1) to 35bit

Ideally I should get back to the same dynamic range and
the sqrt would get me to (whatever+1) to 17bit

now I'm truncating the 35bit to 16bit thus cutting dynamic range in
half which is very uninteresting :-(

since I do have a 16bit transfer bus at the end and since the ENOB of the
ADC's are below 15bit at that speed (120MS/s) I would aim
to 15bit at ADC and 16bit output therefore a sqrt(33bit).

----------

It is a big desgin, and YES I have it all pipelined. there are several
blocks
Each block even takes care of pipelining the address of the data so I don't
care how many clock cycles it takes, I just have to make a pipeline for
address to make it all consistent at the output.

(silly me that I asked for a 1 cycle thing, that was because I had the
i^2+q^2
in embedded multipliers so no need to pipeline that until now...).

The ROM... don't think so... it is a sqrt( ) with too many bits I'm afraid.

Tks again for your comments.
If you like to comment now on my refined description I would appreciate.
Thanks,

Luis C.

Gabor · Oct 17, 2012

Luis said:
That helps a lot. Thanks.

I can narrow down the specs and add some info.

The the design can grow to the maximum
I can fit on the device or up to the point of start limiting
the speed.

-The system input has a pair of 16bit ADC.
I decide on an global design CONSTANT how
many bits I use, from (whatever) to 16bit

-At that point of the calculation I have a pair of max 17bit each
so here I have, from (whatever+1) to 17bit

- i^2+q^2 would make 35bit maximum
so here I have, from (whatever*2+1) to 35bit

Ideally I should get back to the same dynamic range and
the sqrt would get me to (whatever+1) to 17bit

now I'm truncating the 35bit to 16bit thus cutting dynamic range in
half which is very uninteresting :-(

since I do have a 16bit transfer bus at the end and since the ENOB of the
ADC's are below 15bit at that speed (120MS/s) I would aim
to 15bit at ADC and 16bit output therefore a sqrt(33bit).

----------

It is a big desgin, and YES I have it all pipelined. there are several
blocks
Each block even takes care of pipelining the address of the data so I don't
care how many clock cycles it takes, I just have to make a pipeline for
address to make it all consistent at the output.

(silly me that I asked for a 1 cycle thing, that was because I had the
i^2+q^2
in embedded multipliers so no need to pipeline that until now...).

The ROM... don't think so... it is a sqrt( ) with too many bits I'm afraid.

Tks again for your comments.
If you like to comment now on my refined description I would appreciate.
Thanks,

Luis C.

You might think about ROM + linear interpolation. You can model
this in an Excel spreadsheet to check the accuracy at different
ROM bit depths. You can also do a normalization before using
the ROM, so you only need values 0.25 <= X < 1 in the table.

-- Gabor

Sven Lundgren · Oct 17, 2012

If you're looking for a relatively small, fast and simple solution and can deal with +/- 4% accuracy, check out the alpha max + beta min algorithm (http://en.wikipedia.org/wiki/Alpha_max_plus_beta_min_algorithm).

Otherwise, try a variation of Anton's suggestion #3, where a CORDIC core/algorithm is used to rotate your vector to 0 degrees, then MV = I.

Anton Gunman · Oct 17, 2012

These are some interesting suggestions by both Gabor and Sven!

If you are still considering using a fully pipelined SQRT implementation, here goes (excuse me for the long post).
I personally do not like using cores (although they are often more optimized than most of the stuff that you will code).

If you try synthesizing the fully combinatorial function your timing will probably be terrible (although it will be able to do it in a single clock cycle).
Here below is some code that used the combinatorial function from the previous links, you can try synthesizing it to check your resource/timing. I would not recommend using that for large bit widths (unless if your clock frequency is quite low).
I have added some I/O registers.

(This is a synthesis resource/timing estimate for INPUT_WIDTH := 32)
Number of Slice Registers : 78
Number of Slice LUTs : 731
Number of fully used LUT-FF pairs: 9
Minimum period: 42.849ns (Maximum Frequency: 23.338MHz)

-------------------------------------------------------------------------------
-- Combinatorial sqrt module
-- Note: I would not recommend using that for synthesis using large bit widths.
-------------------------------------------------------------------------------
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity sqrt_comb is
generic (
INPUT_WIDTH : integer := 32 -- Has to be even
);
port (
clk : in std_logic;
sync_reset : in std_logic;
data_in : in std_logic_vector(INPUT_WIDTH-1 downto 0);
data_out : out std_logic_vector(INPUT_WIDTH/2-1 downto 0)
);
end sqrt_comb;

architecture rtl of sqrt_comb is
-----------------------------------------------------------------------------
-- Function declaration
-----------------------------------------------------------------------------
function SquareRoot (Arg : unsigned) return unsigned is

constant AMSB : integer := Arg'length-1;
constant RMSB : integer := (Arg'length/2) - 1;
variable Root : unsigned(RMSB downto 0);
variable Test : unsigned(RMSB+1 downto 0);
variable Rest : unsigned(AMSB+1 downto 0);

begin

Root := (others => '0');
Rest := '0' & Arg;

for i in RMSB downto 0 loop

Test := Root(RMSB-1 downto 0) & "01";
if Test(RMSB-i+1 downto 0) >
Rest(AMSB+1 downto 2*i) then
Root := Root(RMSB-1 downto 0) & '0';
else
Root := Root(RMSB-1 downto 0) & '1';
Rest(AMSB downto i*2) := Rest(AMSB downto i*2) -
Test(RMSB-i+1 downto 0);
end if;

end loop;

return Root;
end;

-----------------------------------------------------------------------------
-- Signal declaration
-----------------------------------------------------------------------------
signal data_in_reg : unsigned(INPUT_WIDTH-1 downto 0);

begin

-----------------------------------------------------------------------------
-- Calculate sqrt
-- Note: Using I/O registers.
-----------------------------------------------------------------------------
calc_sqrt : process (clk)
begin
if rising_edge(clk) then
if sync_reset = '1' then
data_in_reg <= (others => '0');
data_out <= (others => '0');
else
data_in_reg <= unsigned(data_in);
data_out <= std_logic_vector(SquareRoot(data_in_reg));
end if;
end if;
end process calc_sqrt;

end rtl;
-------------------------------------------------------------------------------

Now if you take the previous function and add some pipeline stages, you can get a fully pipelined design, the timing should be MUCH better, but you will get a latency of INPUT_WIDTH/2.
I am not sure how it compares to a fully pipelined core and have not tried optimizing it.
I personally might require using a SQRT function later on, so I drafted the following implementation:
(You will get some warnings about constants being optimized, since not all the bits are used/set for every stage, but it should be fine).

(This is a synthesis resource/timing estimate for INPUT_WIDTH := 32)
Number of Slice Registers : 434
Number of Slice LUTs : 678
Number of fully used LUT-FF pairs: 294
Minimum period: 3.862ns (Maximum Frequency: 258.903MHz)

-------------------------------------------------------------------------------
-- Fully pipelined sqrt module
-------------------------------------------------------------------------------
-- * The data INPUT_WIDTH has to be even (The output width is INPUT_WIDTH/2).
-- * The latency is INPUT_WIDTH/2.
-- Note: I have not thoroughly tested this module.
-------------------------------------------------------------------------------
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity sqrt_pipelined is
generic (
INPUT_WIDTH : integer := 32 -- Has to be even
);
port (
clk : in std_logic;
sync_reset : in std_logic;
data_in : in std_logic_vector(INPUT_WIDTH-1 downto 0);
data_out : out std_logic_vector(INPUT_WIDTH/2-1 downto 0)
);
end sqrt_pipelined;

architecture rtl of sqrt_pipelined is

-----------------------------------------------------------------------------
-- Signal declaration
-----------------------------------------------------------------------------
constant AMSB : integer := INPUT_WIDTH - 1;
constant RMSB : integer := (INPUT_WIDTH/2) - 1;

-- These are arrays for pipelining. RMSB + 1 stages (which is the latency)
type type_root_array is array (0 to RMSB) of unsigned(RMSB downto 0);
type type_rest_array is array (0 to RMSB) of unsigned(AMSB + 1 downto 0);
signal root : type_root_array;
signal rest : type_rest_array;

begin
-----------------------------------------------------------------------------
-- Calculate sqrt
-----------------------------------------------------------------------------
calc_sqrt : process (clk)
variable test : unsigned(RMSB+1 downto 0); -- Temporary variable
begin
if rising_edge(clk) then
if sync_reset = '1' then
root <= (others => (others => '0'));
rest <= (others => (others => '0'));
else

for i in RMSB downto 0 loop
if i = RMSB then
-- First stage: shift in new data
root(i) <= (others => '0');
rest(i) <= '0' & unsigned(data_in);
else
-- Pipeline shift registers
root(i) <= root(i+1);
rest(i) <= rest(i+1);

test := root(i+1)(RMSB-1 downto 0) & "01";

if test(RMSB-i+1 downto 0) >
rest(i+1)(AMSB+1 downto 2*i) then
root(i) <= root(i+1)(RMSB-1 downto 0) & '0';
else
root(i) <= root(i+1)(RMSB-1 downto 0) & '1';
rest(i)(AMSB downto i*2) <= rest(i+1)(AMSB downto i*2) -
test(RMSB-i+1 downto 0);
end if;

end if;
end loop;
end if;
end if;
end process calc_sqrt;

-- The output is at the last stage of the pipe array
data_out <= std_logic_vector(root(0));

end rtl;
-------------------------------------------------------------------------------

Also here is a small testbench that you can use to test the above code.

-------------------------------------------------------------------------------
-- Testbench for sqrt_pipelined
-------------------------------------------------------------------------------
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity sqrt_pipelined_tb is
end sqrt_pipelined_tb;

architecture behavioral of sqrt_pipelined_tb is

component sqrt_pipelined
generic (
INPUT_WIDTH : integer);
port (
clk : in std_logic;
sync_reset : in std_logic;
data_in : in std_logic_vector(INPUT_WIDTH-1 downto 0);
data_out : out std_logic_vector(INPUT_WIDTH/2-1 downto 0));
end component;

-- component generics
constant INPUT_WIDTH : integer := 32;

-- component ports
signal sync_reset : std_logic;
signal data_in : std_logic_vector(INPUT_WIDTH-1 downto 0);
signal data_out : std_logic_vector(INPUT_WIDTH/2-1 downto 0);

-- clock
signal clk : std_logic := '1';

begin -- behavioral

-- component instantiation
DUT : sqrt_pipelined
generic map (
INPUT_WIDTH => INPUT_WIDTH)
port map (
clk => clk,
sync_reset => sync_reset,
data_in => data_in,
data_out => data_out);

-- clock generation
clk <= not clk after 10 ns;

-- waveform generation
WaveGen_Proc : process
begin
-- insert signal assignments here
sync_reset <= '1';
data_in <= (others => '0');
wait for 100 ns;
wait until rising_edge(clk);
sync_reset <= '0';
data_in <= std_logic_vector(to_unsigned(4, data_in'length));
wait until rising_edge(clk);
data_in <= std_logic_vector(to_unsigned(9, data_in'length));
wait until rising_edge(clk);
data_in <= std_logic_vector(to_unsigned(121, data_in'length));
wait until rising_edge(clk);
data_in <= (others => '0');
wait;
end process WaveGen_Proc;

end behavioral;

Luis Cupido · Oct 17, 2012

Thanks.

I cant really cope with the non pipelined suff,
from Anton's values it results way too slow.

So I will try the pipelined approach.
Many thanks for the code Anton.

In the mean time it doesn't hurt to make a component
using alpha max + beta min algorithm.
I have not evaluated the impact of the error but just to see
all this working it will be worth a try as coding effort
is very small.

Tks all for the great suggestions.

Luis C.

Luis Cupido · Oct 17, 2012

Anton

The pipeline code works fantastic but has a small bug in the algorithm
I could not figure out what it is.

it computes perfect up to 32767^2 and for any value above this
the result is always 32767=7FFF
For 32 bit input and 16bit out I should go up to 0xFFFF at the output.
In fact the compiler warns that data(15) is stuck low.

the code is from your message copy/paste.
I sure you can spot what is wrong in a eye blink.

Super thanks.
Luis Cupido.

Anton Gunman · Oct 17, 2012

Thanks for pointing that out !
Indeed there was a bug, I added an extra pipeline stage, which increases the latency by one but should fix the problem (I was not iterating over all the bits).
The current latency is INPUT_WIDTH/2 + 1.
Note: If you're going to test the corner cases make sure you don't reach the INTEGER limitations inside the testbench.
I was able to reproduce the problem by setting the generic INPUT_WIDTH := 8 in the testbench.
Below is the modified code, let me know if there's still a problem.

Anton

-------------------------------------------------------------------------------
-- Fully pipelined sqrt module
-------------------------------------------------------------------------------
-- * The data INPUT_WIDTH has to be even (The output width is INPUT_WIDTH/2).
-- * The latency is INPUT_WIDTH/2 + 1.
-- Note: I have not thoroughly tested this module.
-------------------------------------------------------------------------------
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity sqrt_pipelined is
generic (
INPUT_WIDTH : integer := 32 -- Has to be even
);
port (
clk : in std_logic;
sync_reset : in std_logic;
data_in : in std_logic_vector(INPUT_WIDTH-1 downto 0);
data_out : out std_logic_vector(INPUT_WIDTH/2-1 downto 0)
);
end sqrt_pipelined;

architecture rtl of sqrt_pipelined is

-----------------------------------------------------------------------------
-- Signal declaration
-----------------------------------------------------------------------------
constant AMSB : integer := INPUT_WIDTH - 1;
constant RMSB : integer := (INPUT_WIDTH/2) - 1;

-- These are arrays for pipelining. RMSB + 2 stages (which is the latency)
type type_root_array is array (0 to RMSB+1) of unsigned(RMSB downto 0);
type type_rest_array is array (0 to RMSB+1) of unsigned(AMSB + 1 downto 0);
signal root : type_root_array;
signal rest : type_rest_array;

begin
-----------------------------------------------------------------------------
-- Calculate sqrt
-----------------------------------------------------------------------------
calc_sqrt : process (clk)
variable test : unsigned(RMSB+1 downto 0); -- Temporary variable
begin
if rising_edge(clk) then
if sync_reset = '1' then
root <= (others =>(others => '0'));
rest <= (others =>(others => '0'));
else
-- Input register
root(RMSB + 1) <= (others => '0');
rest(RMSB + 1) <= '0' & unsigned(data_in);

for i in RMSB downto 0 loop
-- Pipeline shift registers
root(i) <= root(i+1);
rest(i) <= rest(i+1);

test := root(i+1)(RMSB-1 downto 0) & "01";

if test(RMSB-i+1 downto 0) >
rest(i+1)(AMSB+1 downto 2*i) then
root(i) <= root(i+1)(RMSB-1 downto 0) & '0';
else
root(i) <= root(i+1)(RMSB-1 downto 0) & '1';
rest(i)(AMSB downto i*2) <= rest(i+1)(AMSB downto i*2) -
test(RMSB-i+1 downto 0);
end if;
end loop;
end if;
end if;
end process calc_sqrt;

-- The output is at the last stage of the pipe array
data_out <= std_logic_vector(root(0));

end rtl;

Anton Gunman · Oct 17, 2012

Thanks for pointing that out !
Indeed there was a bug, I added an extra pipeline stage, which increases the latency by one but should fix the problem.
The current latency is INPUT_WIDTH/2 + 1.
Note: If you're going to test the corner cases make sure you don't reach the INTEGER limitations inside the testbench.
I was able to reproduce the problem by setting the generic INPUT_WIDTH := 8 in the testbench.
Below is the modified code, let me know if there's still a problem.

Anton

-------------------------------------------------------------------------------
-- Fully pipelined sqrt module
-------------------------------------------------------------------------------
-- * The data INPUT_WIDTH has to be even (The output width is INPUT_WIDTH/2).
-- * The latency is INPUT_WIDTH/2 + 1.
-- Note: I have not thoroughly tested this module.
-------------------------------------------------------------------------------
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity sqrt_pipelined is
generic (
INPUT_WIDTH : integer := 32 -- Has to be even
);
port (
clk : in std_logic;
sync_reset : in std_logic;
data_in : in std_logic_vector(INPUT_WIDTH-1 downto 0);
data_out : out std_logic_vector(INPUT_WIDTH/2-1 downto 0)
);
end sqrt_pipelined;

architecture rtl of sqrt_pipelined is

-----------------------------------------------------------------------------
-- Signal declaration
-----------------------------------------------------------------------------
constant AMSB : integer := INPUT_WIDTH - 1;
constant RMSB : integer := (INPUT_WIDTH/2) - 1;

-- These are arrays for pipelining. RMSB + 2 stages (which is the latency)
type type_root_array is array (0 to RMSB+1) of unsigned(RMSB downto 0);
type type_rest_array is array (0 to RMSB+1) of unsigned(AMSB + 1 downto 0);
signal root : type_root_array;
signal rest : type_rest_array;

begin
-----------------------------------------------------------------------------
-- Calculate sqrt
-----------------------------------------------------------------------------
calc_sqrt : process (clk)
variable test : unsigned(RMSB+1 downto 0); -- Temporary variable
begin
if rising_edge(clk) then
if sync_reset = '1' then
root <= (others =>(others => '0'));
rest <= (others =>(others => '0'));
else
-- Input register
root(RMSB + 1) <= (others => '0');
rest(RMSB + 1) <= '0' & unsigned(data_in);

for i in RMSB downto 0 loop
-- Pipeline shift registers
root(i) <= root(i+1);
rest(i) <= rest(i+1);

test := root(i+1)(RMSB-1 downto 0) & "01";

if test(RMSB-i+1 downto 0) >
rest(i+1)(AMSB+1 downto 2*i) then
root(i) <= root(i+1)(RMSB-1 downto 0) & '0';
else
root(i) <= root(i+1)(RMSB-1 downto 0) & '1';
rest(i)(AMSB downto i*2) <= rest(i+1)(AMSB downto i*2) -
test(RMSB-i+1 downto 0);
end if;
end loop;
end if;
end if;
end process calc_sqrt;

-- The output is at the last stage of the pipe array
data_out <= std_logic_vector(root(0));

end rtl;

Pontus · Oct 17, 2012

Hello.

I would like to have at some point of my code

MV= sqrt(I^2+Q^2);

I saw somewhere that using the fact: sqrt(x)=x*1/sqrt(x) could be
useful since 1/sqrt(x)
"behaves" better and multipliers are (sometimes) free...

Do you really want the sqrt(i^2+q^2), or will you anyway do log(x)
[x=i^2+q^2] later on?
In that case you could use log(sqrt(x))=1/2log(x) and concentrate on
log().

HTH -- pontus

Luis Cupido · Oct 17, 2012

Below is the modified code, let me know if there's still a problem.

Anton

Excellent. works fantastic all the range.
Super thanks...
Now lets keep coding all the rest of it... ;-)

Many thanks.
Luis C.

p.s. drop me a line to put in the header for credits please. tks.
If I get famous for this device I'm doing (I doubt) you
deserve a share of the fame ;-)

Luis Cupido · Oct 17, 2012

Hello.

I would like to have at some point of my code

MV= sqrt(I^2+Q^2);

Click to expand...

I saw somewhere that using the fact: sqrt(x)=x*1/sqrt(x) could be
useful since 1/sqrt(x)
"behaves" better and multipliers are (sometimes) free...

Do you really want the sqrt(i^2+q^2), or will you anyway do log(x)
[x=i^2+q^2] later on?
In that case you could use log(sqrt(x))=1/2log(x) and concentrate on
log().

HTH -- pontus

Hi Pontus.

data is coming from a radix2 pipelined FFT what follows is a massive
accumulating memory so there is not really a log( ).

lc.

rickman · Oct 19, 2012

Hello.

I would like to have at some point of my code

MV= sqrt(I^2+Q^2);

but since the above I have no clue how to do in a single clock cycle
I have just I^2+Q^2 in the form:

pv <= i**i + q**q;

That uses a few embedded multipliers and works great.

The number of bits in the result is obviously in (2x)+1
Since I can't afford to manipulate that wide word further
down the chain I just truncated it to the same number of bits of the input.

All fine but a lot of dynamic range was lost
(this is some DSP coding as you might have guessed already).

Is there any compression function just gates, that could work
unclocked (for me to get done with one clock latency) that does the
SQRT( ) or anything that resembles this function just to get the same
dynamyc range preservation effect as the SQRT( ) does.

Pardon me for being a bit off topic.

Anyone has any idea ?

Luis C.

From the way you express your question, I get the impression that your
real concern is the "dynamic" range of the squared result, or in other
words the number of bits. If it really is just the number of bits, you
could go with logarithmic compression, or in other terms, convert your
squared number to a floating point value. That should be pretty easy to
do in a single clock cycle, if needed the multiplier hardware can be
used to shift the mantissa.

Would that help you or do you really need the square root?

Rick

Synthesis of Concurrent Statements for FIR Filter	7	Mar 24, 2009
Some support for VHDL project	1	Jul 15, 2009
Request for help	22	Sep 20, 2007
For Peer Review	1	Apr 2, 2010
Still newbie question : Dialog between states machines.	7	Jan 30, 2004
Am I right in my VHDL code? Synopsys DC runs for ever...	8	Sep 23, 2003
extension_pack	0	Jan 5, 2006
comp.lang.vhdl FAQ part 1 of 4: general	0	Jul 8, 2003

doing sqrt( ) for synthesis.

Luis Cupido

Anton Gunman

Luis Cupido

Gabor

Sven Lundgren

Anton Gunman

Luis Cupido

Luis Cupido

Anton Gunman

Anton Gunman

Pontus

Luis Cupido

Luis Cupido

rickman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads