doing sqrt( ) for synthesis.

Discussion in 'VHDL' started by Luis Cupido, Oct 17, 2012.

  1. Luis Cupido

    Luis Cupido Guest

    Hello.

    I would like to have at some point of my code

    MV= sqrt(I^2+Q^2);

    but since the above I have no clue how to do in a single clock cycle
    I have just I^2+Q^2 in the form:

    pv <= i**i + q**q;

    That uses a few embedded multipliers and works great.

    The number of bits in the result is obviously in (2x)+1
    Since I can't afford to manipulate that wide word further
    down the chain I just truncated it to the same number of bits of the input.

    All fine but a lot of dynamic range was lost
    (this is some DSP coding as you might have guessed already).

    Is there any compression function just gates, that could work
    unclocked (for me to get done with one clock latency) that does the
    SQRT( ) or anything that resembles this function just to get the same
    dynamyc range preservation effect as the SQRT( ) does.

    Pardon me for being a bit off topic.

    Anyone has any idea ?

    Luis C.
     
    Luis Cupido, Oct 17, 2012
    #1
    1. Advertising

  2. Luis Cupido

    Anton Gunman Guest

    Hello,

    That's an interesting one! :p
    I think that you first need to decide, or at least estimate, how much precision you will need.
    - How many bits can you truncate before doing the SQRT?
    - How many bits will the SQRT output? (Typically it's a fixed point notation)
    - How much resources can you afford to use?
    - Is your design fully pipelined (perform a new calculation at every clock cycle), or can you afford to have some sort of "load"/"done" signals to perform your calculations?

    Afterwards, here are some ideas you can look into:
    1- Sometimes you don't really need to do the SQRT, you can just use the result prior to the SQRT and keep some of the MSBs (it is not as accurate).
    2- Do you really need it to be done in a single clock cycle? Sometimes a pipelined design could be used (takes several clock cycles to output the first result, but can accept and output a new result at every other clock cycle).
    3- You can try using your tool's core generator (e.g. Xilinx does have some SQRT cores that use the CORDIC algorithm, you can generate them using the wizard).
    - You might be able to choose between a slow core (area efficient and several clock cycles per operation) and a fast core (fully pipelined and consumes more resources).
    4- You can also take a look at the following links (I have not tested them):
    - https://groups.google.com/forum/?fromgroups=#!searchin/comp.lang.vhdl/square$20root/comp.lang.vhdl/BQ9MypRSkhM/Qf7uLkIrmJQJ (see "function SquareRoot")
    - http://vhdlguru.blogspot.ca/2010/03/vhdl-function-for-finding-square-root.html
    - But these are COMBINATORIAL circuits (no clocks), they are not efficient in terms of resources and might affect timing, so be careful here, especially with large bit widths!
    5- Depending on the number of bits you are using, you can think of any function (in your case the SQRT) as a large LUT and implement it using a single ROM with pre-calculated values.

    The info below roughly applies to internal block RAM/ROMs.
    Keep in mind that the FPGA RAM is limited (you can quickly check your FPGA's datasheet to get an estimate of how many of these block RAMs you can use).
    If you are using Xilinx devices, typically a single BRAM (small internal FPGA block RAM) is either 18kbits 18*1024 bits (or 36 kbits).
    These BRAM can be configured with various address/data configurations (for e.g. 10 address bits and 18 data bits, etc..., not all configurations might be valid)
    You would typically code it as an array and let the tools worry about it (but you MUST check the synthesis result to see if the BRAM has been correctly inferred or if you are using way too many BRAM)
    In this case you might need to change your approach.
    Note: The ROM/RAM has to be synchronous, otherwise the synthesizer will try using registers instead of BRAM and you will see an explosion in resource utilization.

    A rather "rough" way of approximating the number of BRAM that will be consumed is the following:
    Num. BRAM = ceil(2^addr_width * data_width/1024/18) (use 36 instead of 18 depending on your device, but you should get the point)

    The actual number will depend on the configuration possibilities, your coding style, and how "smart" the tools are.

    In your case IF you can round your result (prior to the SQRT) to something like 10 bits (or just use a generic for that), this might work for you.
    e.g.:

    type type_my_rom is array(integer range 0 to 2**10-1) of unsigned(18-1 downto 0);
    signal my_sqrt_rom : type_my_rom := MY_INIT_FUNCTION(); -- You can write an init function of some sort, or read it from file etc...
    --...

    -- clocked process
    -- ...
    pv <= i*i + q*q; -- Calculate
    pv_10bit <= pv(pv'high downto pv'high-10); -- Truncate
    mv <= my_sqrt_rom(pv_10bit); -- SQRT using a ROM, pv_10bit is used as the address

    Note: this is really going to depend on your application's needs. I would suggest you try it out in software (or non-synthesizable code) and see how many bits you can truncate without losing too much precision.
    10 bits might be too much of a truncation and if you increase your addr_width to 16 bits you will quickly run out of resources, so this might not be the best approach.

    - I would probably go for the core generator approach, you can check the documentation online to see if it fits your requirements, or just try generating it. (It will still probably require a few clock cycles depending on the precision).
    - If you can afford the LUT resources, maybe look into the combinatorial implementations #2 (you can simply try synthesizing it and checking your resource usage and timing estimate).

    I hope this helps a bit,
    Anton
     
    Anton Gunman, Oct 17, 2012
    #2
    1. Advertising

  3. Luis Cupido

    Luis Cupido Guest


    > I hope this helps a bit,
    > Anton


    That helps a lot. Thanks.

    I can narrow down the specs and add some info.

    The the design can grow to the maximum
    I can fit on the device or up to the point of start limiting
    the speed.


    -The system input has a pair of 16bit ADC.
    I decide on an global design CONSTANT how
    many bits I use, from (whatever) to 16bit

    -At that point of the calculation I have a pair of max 17bit each
    so here I have, from (whatever+1) to 17bit

    - i^2+q^2 would make 35bit maximum
    so here I have, from (whatever*2+1) to 35bit

    Ideally I should get back to the same dynamic range and
    the sqrt would get me to (whatever+1) to 17bit

    now I'm truncating the 35bit to 16bit thus cutting dynamic range in
    half which is very uninteresting :-(

    since I do have a 16bit transfer bus at the end and since the ENOB of the
    ADC's are below 15bit at that speed (120MS/s) I would aim
    to 15bit at ADC and 16bit output therefore a sqrt(33bit).

    ----------

    It is a big desgin, and YES I have it all pipelined. there are several
    blocks
    Each block even takes care of pipelining the address of the data so I don't
    care how many clock cycles it takes, I just have to make a pipeline for
    address to make it all consistent at the output.

    (silly me that I asked for a 1 cycle thing, that was because I had the
    i^2+q^2
    in embedded multipliers so no need to pipeline that until now...).

    The ROM... don't think so... it is a sqrt( ) with too many bits I'm afraid.

    Tks again for your comments.
    If you like to comment now on my refined description I would appreciate.
    Thanks,

    Luis C.
     
    Luis Cupido, Oct 17, 2012
    #3
  4. Luis Cupido

    Gabor Guest

    Luis Cupido wrote:
    >
    >> I hope this helps a bit,
    >> Anton

    >
    > That helps a lot. Thanks.
    >
    > I can narrow down the specs and add some info.
    >
    > The the design can grow to the maximum
    > I can fit on the device or up to the point of start limiting
    > the speed.
    >
    >
    > -The system input has a pair of 16bit ADC.
    > I decide on an global design CONSTANT how
    > many bits I use, from (whatever) to 16bit
    >
    > -At that point of the calculation I have a pair of max 17bit each
    > so here I have, from (whatever+1) to 17bit
    >
    > - i^2+q^2 would make 35bit maximum
    > so here I have, from (whatever*2+1) to 35bit
    >
    > Ideally I should get back to the same dynamic range and
    > the sqrt would get me to (whatever+1) to 17bit
    >
    > now I'm truncating the 35bit to 16bit thus cutting dynamic range in
    > half which is very uninteresting :-(
    >
    > since I do have a 16bit transfer bus at the end and since the ENOB of the
    > ADC's are below 15bit at that speed (120MS/s) I would aim
    > to 15bit at ADC and 16bit output therefore a sqrt(33bit).
    >
    > ----------
    >
    > It is a big desgin, and YES I have it all pipelined. there are several
    > blocks
    > Each block even takes care of pipelining the address of the data so I don't
    > care how many clock cycles it takes, I just have to make a pipeline for
    > address to make it all consistent at the output.
    >
    > (silly me that I asked for a 1 cycle thing, that was because I had the
    > i^2+q^2
    > in embedded multipliers so no need to pipeline that until now...).
    >
    > The ROM... don't think so... it is a sqrt( ) with too many bits I'm afraid.
    >
    > Tks again for your comments.
    > If you like to comment now on my refined description I would appreciate.
    > Thanks,
    >
    > Luis C.
    >
    >

    You might think about ROM + linear interpolation. You can model
    this in an Excel spreadsheet to check the accuracy at different
    ROM bit depths. You can also do a normalization before using
    the ROM, so you only need values 0.25 <= X < 1 in the table.

    -- Gabor
     
    Gabor, Oct 17, 2012
    #4
  5. If you're looking for a relatively small, fast and simple solution and can deal with +/- 4% accuracy, check out the alpha max + beta min algorithm (http://en.wikipedia.org/wiki/Alpha_max_plus_beta_min_algorithm).

    Otherwise, try a variation of Anton's suggestion #3, where a CORDIC core/algorithm is used to rotate your vector to 0 degrees, then MV = I.
     
    Sven Lundgren, Oct 17, 2012
    #5
  6. Luis Cupido

    Anton Gunman Guest

    These are some interesting suggestions by both Gabor and Sven!

    If you are still considering using a fully pipelined SQRT implementation, here goes (excuse me for the long post).
    I personally do not like using cores (although they are often more optimized than most of the stuff that you will code).

    If you try synthesizing the fully combinatorial function your timing will probably be terrible (although it will be able to do it in a single clock cycle).
    Here below is some code that used the combinatorial function from the previous links, you can try synthesizing it to check your resource/timing. I would not recommend using that for large bit widths (unless if your clock frequency is quite low).
    I have added some I/O registers.

    (This is a synthesis resource/timing estimate for INPUT_WIDTH := 32)
    Number of Slice Registers : 78
    Number of Slice LUTs : 731
    Number of fully used LUT-FF pairs: 9
    Minimum period: 42.849ns (Maximum Frequency: 23.338MHz)

    -------------------------------------------------------------------------------
    -- Combinatorial sqrt module
    -- Note: I would not recommend using that for synthesis using large bit widths.
    -------------------------------------------------------------------------------
    library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

    entity sqrt_comb is
    generic (
    INPUT_WIDTH : integer := 32 -- Has to be even
    );
    port (
    clk : in std_logic;
    sync_reset : in std_logic;
    data_in : in std_logic_vector(INPUT_WIDTH-1 downto 0);
    data_out : out std_logic_vector(INPUT_WIDTH/2-1 downto 0)
    );
    end sqrt_comb;

    architecture rtl of sqrt_comb is
    -----------------------------------------------------------------------------
    -- Function declaration
    -----------------------------------------------------------------------------
    function SquareRoot (Arg : unsigned) return unsigned is

    constant AMSB : integer := Arg'length-1;
    constant RMSB : integer := (Arg'length/2) - 1;
    variable Root : unsigned(RMSB downto 0);
    variable Test : unsigned(RMSB+1 downto 0);
    variable Rest : unsigned(AMSB+1 downto 0);

    begin

    Root := (others => '0');
    Rest := '0' & Arg;

    for i in RMSB downto 0 loop

    Test := Root(RMSB-1 downto 0) & "01";
    if Test(RMSB-i+1 downto 0) >
    Rest(AMSB+1 downto 2*i) then
    Root := Root(RMSB-1 downto 0) & '0';
    else
    Root := Root(RMSB-1 downto 0) & '1';
    Rest(AMSB downto i*2) := Rest(AMSB downto i*2) -
    Test(RMSB-i+1 downto 0);
    end if;

    end loop;

    return Root;
    end;

    -----------------------------------------------------------------------------
    -- Signal declaration
    -----------------------------------------------------------------------------
    signal data_in_reg : unsigned(INPUT_WIDTH-1 downto 0);

    begin

    -----------------------------------------------------------------------------
    -- Calculate sqrt
    -- Note: Using I/O registers.
    -----------------------------------------------------------------------------
    calc_sqrt : process (clk)
    begin
    if rising_edge(clk) then
    if sync_reset = '1' then
    data_in_reg <= (others => '0');
    data_out <= (others => '0');
    else
    data_in_reg <= unsigned(data_in);
    data_out <= std_logic_vector(SquareRoot(data_in_reg));
    end if;
    end if;
    end process calc_sqrt;

    end rtl;
    -------------------------------------------------------------------------------

    Now if you take the previous function and add some pipeline stages, you can get a fully pipelined design, the timing should be MUCH better, but you will get a latency of INPUT_WIDTH/2.
    I am not sure how it compares to a fully pipelined core and have not tried optimizing it.
    I personally might require using a SQRT function later on, so I drafted the following implementation:
    (You will get some warnings about constants being optimized, since not all the bits are used/set for every stage, but it should be fine).


    (This is a synthesis resource/timing estimate for INPUT_WIDTH := 32)
    Number of Slice Registers : 434
    Number of Slice LUTs : 678
    Number of fully used LUT-FF pairs: 294
    Minimum period: 3.862ns (Maximum Frequency: 258.903MHz)

    -------------------------------------------------------------------------------
    -- Fully pipelined sqrt module
    -------------------------------------------------------------------------------
    -- * The data INPUT_WIDTH has to be even (The output width is INPUT_WIDTH/2).
    -- * The latency is INPUT_WIDTH/2.
    -- Note: I have not thoroughly tested this module.
    -------------------------------------------------------------------------------
    library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

    entity sqrt_pipelined is
    generic (
    INPUT_WIDTH : integer := 32 -- Has to be even
    );
    port (
    clk : in std_logic;
    sync_reset : in std_logic;
    data_in : in std_logic_vector(INPUT_WIDTH-1 downto 0);
    data_out : out std_logic_vector(INPUT_WIDTH/2-1 downto 0)
    );
    end sqrt_pipelined;

    architecture rtl of sqrt_pipelined is

    -----------------------------------------------------------------------------
    -- Signal declaration
    -----------------------------------------------------------------------------
    constant AMSB : integer := INPUT_WIDTH - 1;
    constant RMSB : integer := (INPUT_WIDTH/2) - 1;

    -- These are arrays for pipelining. RMSB + 1 stages (which is the latency)
    type type_root_array is array (0 to RMSB) of unsigned(RMSB downto 0);
    type type_rest_array is array (0 to RMSB) of unsigned(AMSB + 1 downto 0);
    signal root : type_root_array;
    signal rest : type_rest_array;

    begin
    -----------------------------------------------------------------------------
    -- Calculate sqrt
    -----------------------------------------------------------------------------
    calc_sqrt : process (clk)
    variable test : unsigned(RMSB+1 downto 0); -- Temporary variable
    begin
    if rising_edge(clk) then
    if sync_reset = '1' then
    root <= (others => (others => '0'));
    rest <= (others => (others => '0'));
    else

    for i in RMSB downto 0 loop
    if i = RMSB then
    -- First stage: shift in new data
    root(i) <= (others => '0');
    rest(i) <= '0' & unsigned(data_in);
    else
    -- Pipeline shift registers
    root(i) <= root(i+1);
    rest(i) <= rest(i+1);

    test := root(i+1)(RMSB-1 downto 0) & "01";

    if test(RMSB-i+1 downto 0) >
    rest(i+1)(AMSB+1 downto 2*i) then
    root(i) <= root(i+1)(RMSB-1 downto 0) & '0';
    else
    root(i) <= root(i+1)(RMSB-1 downto 0) & '1';
    rest(i)(AMSB downto i*2) <= rest(i+1)(AMSB downto i*2) -
    test(RMSB-i+1 downto 0);
    end if;

    end if;
    end loop;
    end if;
    end if;
    end process calc_sqrt;

    -- The output is at the last stage of the pipe array
    data_out <= std_logic_vector(root(0));

    end rtl;
    -------------------------------------------------------------------------------


    Also here is a small testbench that you can use to test the above code.

    -------------------------------------------------------------------------------
    -- Testbench for sqrt_pipelined
    -------------------------------------------------------------------------------
    library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

    entity sqrt_pipelined_tb is
    end sqrt_pipelined_tb;

    architecture behavioral of sqrt_pipelined_tb is

    component sqrt_pipelined
    generic (
    INPUT_WIDTH : integer);
    port (
    clk : in std_logic;
    sync_reset : in std_logic;
    data_in : in std_logic_vector(INPUT_WIDTH-1 downto 0);
    data_out : out std_logic_vector(INPUT_WIDTH/2-1 downto 0));
    end component;

    -- component generics
    constant INPUT_WIDTH : integer := 32;

    -- component ports
    signal sync_reset : std_logic;
    signal data_in : std_logic_vector(INPUT_WIDTH-1 downto 0);
    signal data_out : std_logic_vector(INPUT_WIDTH/2-1 downto 0);

    -- clock
    signal clk : std_logic := '1';

    begin -- behavioral

    -- component instantiation
    DUT : sqrt_pipelined
    generic map (
    INPUT_WIDTH => INPUT_WIDTH)
    port map (
    clk => clk,
    sync_reset => sync_reset,
    data_in => data_in,
    data_out => data_out);

    -- clock generation
    clk <= not clk after 10 ns;

    -- waveform generation
    WaveGen_Proc : process
    begin
    -- insert signal assignments here
    sync_reset <= '1';
    data_in <= (others => '0');
    wait for 100 ns;
    wait until rising_edge(clk);
    sync_reset <= '0';
    data_in <= std_logic_vector(to_unsigned(4, data_in'length));
    wait until rising_edge(clk);
    data_in <= std_logic_vector(to_unsigned(9, data_in'length));
    wait until rising_edge(clk);
    data_in <= std_logic_vector(to_unsigned(121, data_in'length));
    wait until rising_edge(clk);
    data_in <= (others => '0');
    wait;
    end process WaveGen_Proc;

    end behavioral;
    -------------------------------------------------------------------------------

    Cheers !

    By the way, does anyone know if we can paste source code in the group with proper formatting (or at least courier fonts)? Thanks !
     
    Anton Gunman, Oct 17, 2012
    #6
  7. Luis Cupido

    Luis Cupido Guest

    Thanks.

    I cant really cope with the non pipelined suff,
    from Anton's values it results way too slow.

    So I will try the pipelined approach.
    Many thanks for the code Anton.

    In the mean time it doesn't hurt to make a component
    using alpha max + beta min algorithm.
    I have not evaluated the impact of the error but just to see
    all this working it will be worth a try as coding effort
    is very small.

    Tks all for the great suggestions.

    Luis C.
     
    Luis Cupido, Oct 17, 2012
    #7
  8. Luis Cupido

    Luis Cupido Guest

    Anton

    The pipeline code works fantastic but has a small bug in the algorithm
    I could not figure out what it is.

    it computes perfect up to 32767^2 and for any value above this
    the result is always 32767=7FFF
    For 32 bit input and 16bit out I should go up to 0xFFFF at the output.
    In fact the compiler warns that data(15) is stuck low.

    the code is from your message copy/paste.
    I sure you can spot what is wrong in a eye blink.

    Super thanks.
    Luis Cupido.
     
    Luis Cupido, Oct 17, 2012
    #8
  9. Luis Cupido

    Anton Gunman Guest

    Thanks for pointing that out !
    Indeed there was a bug, I added an extra pipeline stage, which increases the latency by one but should fix the problem (I was not iterating over all the bits).
    The current latency is INPUT_WIDTH/2 + 1.
    Note: If you're going to test the corner cases make sure you don't reach the INTEGER limitations inside the testbench.
    I was able to reproduce the problem by setting the generic INPUT_WIDTH := 8 in the testbench.
    Below is the modified code, let me know if there's still a problem.

    Anton

    -------------------------------------------------------------------------------
    -- Fully pipelined sqrt module
    -------------------------------------------------------------------------------
    -- * The data INPUT_WIDTH has to be even (The output width is INPUT_WIDTH/2).
    -- * The latency is INPUT_WIDTH/2 + 1.
    -- Note: I have not thoroughly tested this module.
    -------------------------------------------------------------------------------
    library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

    entity sqrt_pipelined is
    generic (
    INPUT_WIDTH : integer := 32 -- Has to be even
    );
    port (
    clk : in std_logic;
    sync_reset : in std_logic;
    data_in : in std_logic_vector(INPUT_WIDTH-1 downto 0);
    data_out : out std_logic_vector(INPUT_WIDTH/2-1 downto 0)
    );
    end sqrt_pipelined;

    architecture rtl of sqrt_pipelined is

    -----------------------------------------------------------------------------
    -- Signal declaration
    -----------------------------------------------------------------------------
    constant AMSB : integer := INPUT_WIDTH - 1;
    constant RMSB : integer := (INPUT_WIDTH/2) - 1;

    -- These are arrays for pipelining. RMSB + 2 stages (which is the latency)
    type type_root_array is array (0 to RMSB+1) of unsigned(RMSB downto 0);
    type type_rest_array is array (0 to RMSB+1) of unsigned(AMSB + 1 downto 0);
    signal root : type_root_array;
    signal rest : type_rest_array;

    begin
    -----------------------------------------------------------------------------
    -- Calculate sqrt
    -----------------------------------------------------------------------------
    calc_sqrt : process (clk)
    variable test : unsigned(RMSB+1 downto 0); -- Temporary variable
    begin
    if rising_edge(clk) then
    if sync_reset = '1' then
    root <= (others =>(others => '0'));
    rest <= (others =>(others => '0'));
    else
    -- Input register
    root(RMSB + 1) <= (others => '0');
    rest(RMSB + 1) <= '0' & unsigned(data_in);

    for i in RMSB downto 0 loop
    -- Pipeline shift registers
    root(i) <= root(i+1);
    rest(i) <= rest(i+1);

    test := root(i+1)(RMSB-1 downto 0) & "01";

    if test(RMSB-i+1 downto 0) >
    rest(i+1)(AMSB+1 downto 2*i) then
    root(i) <= root(i+1)(RMSB-1 downto 0) & '0';
    else
    root(i) <= root(i+1)(RMSB-1 downto 0) & '1';
    rest(i)(AMSB downto i*2) <= rest(i+1)(AMSB downto i*2) -
    test(RMSB-i+1 downto 0);
    end if;
    end loop;
    end if;
    end if;
    end process calc_sqrt;

    -- The output is at the last stage of the pipe array
    data_out <= std_logic_vector(root(0));

    end rtl;
     
    Anton Gunman, Oct 17, 2012
    #9
  10. Luis Cupido

    Anton Gunman Guest

    Thanks for pointing that out !
    Indeed there was a bug, I added an extra pipeline stage, which increases the latency by one but should fix the problem.
    The current latency is INPUT_WIDTH/2 + 1.
    Note: If you're going to test the corner cases make sure you don't reach the INTEGER limitations inside the testbench.
    I was able to reproduce the problem by setting the generic INPUT_WIDTH := 8 in the testbench.
    Below is the modified code, let me know if there's still a problem.

    Anton

    -------------------------------------------------------------------------------
    -- Fully pipelined sqrt module
    -------------------------------------------------------------------------------
    -- * The data INPUT_WIDTH has to be even (The output width is INPUT_WIDTH/2).
    -- * The latency is INPUT_WIDTH/2 + 1.
    -- Note: I have not thoroughly tested this module.
    -------------------------------------------------------------------------------
    library ieee;
    use ieee.std_logic_1164.all;
    use ieee.numeric_std.all;

    entity sqrt_pipelined is
    generic (
    INPUT_WIDTH : integer := 32 -- Has to be even
    );
    port (
    clk : in std_logic;
    sync_reset : in std_logic;
    data_in : in std_logic_vector(INPUT_WIDTH-1 downto 0);
    data_out : out std_logic_vector(INPUT_WIDTH/2-1 downto 0)
    );
    end sqrt_pipelined;

    architecture rtl of sqrt_pipelined is

    -----------------------------------------------------------------------------
    -- Signal declaration
    -----------------------------------------------------------------------------
    constant AMSB : integer := INPUT_WIDTH - 1;
    constant RMSB : integer := (INPUT_WIDTH/2) - 1;

    -- These are arrays for pipelining. RMSB + 2 stages (which is the latency)
    type type_root_array is array (0 to RMSB+1) of unsigned(RMSB downto 0);
    type type_rest_array is array (0 to RMSB+1) of unsigned(AMSB + 1 downto 0);
    signal root : type_root_array;
    signal rest : type_rest_array;

    begin
    -----------------------------------------------------------------------------
    -- Calculate sqrt
    -----------------------------------------------------------------------------
    calc_sqrt : process (clk)
    variable test : unsigned(RMSB+1 downto 0); -- Temporary variable
    begin
    if rising_edge(clk) then
    if sync_reset = '1' then
    root <= (others =>(others => '0'));
    rest <= (others =>(others => '0'));
    else
    -- Input register
    root(RMSB + 1) <= (others => '0');
    rest(RMSB + 1) <= '0' & unsigned(data_in);

    for i in RMSB downto 0 loop
    -- Pipeline shift registers
    root(i) <= root(i+1);
    rest(i) <= rest(i+1);

    test := root(i+1)(RMSB-1 downto 0) & "01";

    if test(RMSB-i+1 downto 0) >
    rest(i+1)(AMSB+1 downto 2*i) then
    root(i) <= root(i+1)(RMSB-1 downto 0) & '0';
    else
    root(i) <= root(i+1)(RMSB-1 downto 0) & '1';
    rest(i)(AMSB downto i*2) <= rest(i+1)(AMSB downto i*2) -
    test(RMSB-i+1 downto 0);
    end if;
    end loop;
    end if;
    end if;
    end process calc_sqrt;

    -- The output is at the last stage of the pipe array
    data_out <= std_logic_vector(root(0));

    end rtl;
     
    Anton Gunman, Oct 17, 2012
    #10
  11. Luis Cupido

    Pontus Guest

    On 17 Okt, 05:02, Luis Cupido <> wrote:
    > Hello.
    >
    > I would like to have at some point of my code
    >
    > MV= sqrt(I^2+Q^2);
    >


    I saw somewhere that using the fact: sqrt(x)=x*1/sqrt(x) could be
    useful since 1/sqrt(x)
    "behaves" better and multipliers are (sometimes) free...

    Do you really want the sqrt(i^2+q^2), or will you anyway do log(x)
    [x=i^2+q^2] later on?
    In that case you could use log(sqrt(x))=1/2log(x) and concentrate on
    log().

    HTH -- pontus
     
    Pontus, Oct 17, 2012
    #11
  12. Luis Cupido

    Luis Cupido Guest

    On 10/17/2012 9:15 PM, Anton Gunman wrote:
    > Below is the modified code, let me know if there's still a problem.
    >
    > Anton


    Excellent. works fantastic all the range.
    Super thanks...
    Now lets keep coding all the rest of it... ;-)

    Many thanks.
    Luis C.

    p.s. drop me a line to put in the header for credits please. tks.
    If I get famous for this device I'm doing (I doubt) you
    deserve a share of the fame ;-)
     
    Luis Cupido, Oct 18, 2012
    #12
  13. Luis Cupido

    Luis Cupido Guest

    On 10/17/2012 10:45 PM, Pontus wrote:
    > On 17 Okt, 05:02, Luis Cupido<> wrote:
    >> Hello.
    >>
    >> I would like to have at some point of my code
    >>
    >> MV= sqrt(I^2+Q^2);
    >>

    >
    > I saw somewhere that using the fact: sqrt(x)=x*1/sqrt(x) could be
    > useful since 1/sqrt(x)
    > "behaves" better and multipliers are (sometimes) free...
    >
    > Do you really want the sqrt(i^2+q^2), or will you anyway do log(x)
    > [x=i^2+q^2] later on?
    > In that case you could use log(sqrt(x))=1/2log(x) and concentrate on
    > log().
    >
    > HTH -- pontus


    Hi Pontus.

    data is coming from a radix2 pipelined FFT what follows is a massive
    accumulating memory so there is not really a log( ).

    lc.
     
    Luis Cupido, Oct 18, 2012
    #13
  14. Luis Cupido

    rickman Guest

    On 10/16/2012 11:02 PM, Luis Cupido wrote:
    > Hello.
    >
    > I would like to have at some point of my code
    >
    > MV= sqrt(I^2+Q^2);
    >
    > but since the above I have no clue how to do in a single clock cycle
    > I have just I^2+Q^2 in the form:
    >
    > pv <= i**i + q**q;
    >
    > That uses a few embedded multipliers and works great.
    >
    > The number of bits in the result is obviously in (2x)+1
    > Since I can't afford to manipulate that wide word further
    > down the chain I just truncated it to the same number of bits of the input.
    >
    > All fine but a lot of dynamic range was lost
    > (this is some DSP coding as you might have guessed already).
    >
    > Is there any compression function just gates, that could work
    > unclocked (for me to get done with one clock latency) that does the
    > SQRT( ) or anything that resembles this function just to get the same
    > dynamyc range preservation effect as the SQRT( ) does.
    >
    > Pardon me for being a bit off topic.
    >
    > Anyone has any idea ?
    >
    > Luis C.


    From the way you express your question, I get the impression that your
    real concern is the "dynamic" range of the squared result, or in other
    words the number of bits. If it really is just the number of bits, you
    could go with logarithmic compression, or in other terms, convert your
    squared number to a floating point value. That should be pretty easy to
    do in a single clock cycle, if needed the multiplier hardware can be
    used to shift the mantissa.

    Would that help you or do you really need the square root?

    Rick
     
    rickman, Oct 19, 2012
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. walala
    Replies:
    4
    Views:
    2,177
    Ralf Hildebrandt
    Sep 8, 2003
  2. walala
    Replies:
    4
    Views:
    1,276
    Technology Consultant
    Sep 9, 2003
  3. valentin tihomirov

    sqrt in HW

    valentin tihomirov, Sep 12, 2004, in forum: VHDL
    Replies:
    1
    Views:
    1,635
    David Bishop
    Sep 13, 2004
  4. Tim Ward

    sqrt(negative)?

    Tim Ward, Dec 17, 2004, in forum: Java
    Replies:
    2
    Views:
    567
  5. Prune Tracy
    Replies:
    6
    Views:
    1,861
    Ioannis Vranos
    Oct 14, 2004
Loading...

Share This Page