Hello,
That's an interesting one!
I think that you first need to decide, or at least estimate, how much precision you will need.
- How many bits can you truncate before doing the SQRT?
- How many bits will the SQRT output? (Typically it's a fixed point notation)
- How much resources can you afford to use?
- Is your design fully pipelined (perform a new calculation at every clock cycle), or can you afford to have some sort of "load"/"done" signals to perform your calculations?
Afterwards, here are some ideas you can look into:
1- Sometimes you don't really need to do the SQRT, you can just use the result prior to the SQRT and keep some of the MSBs (it is not as accurate).
2- Do you really need it to be done in a single clock cycle? Sometimes a pipelined design could be used (takes several clock cycles to output the first result, but can accept and output a new result at every other clock cycle).
3- You can try using your tool's core generator (e.g. Xilinx does have some SQRT cores that use the CORDIC algorithm, you can generate them using the wizard).
- You might be able to choose between a slow core (area efficient and several clock cycles per operation) and a fast core (fully pipelined and consumes more resources).
4- You can also take a look at the following links (I have not tested them):
-
https://groups.google.com/forum/?fr...0root/comp.lang.vhdl/BQ9MypRSkhM/Qf7uLkIrmJQJ (see "function SquareRoot")
-
http://vhdlguru.blogspot.ca/2010/03/vhdl-function-for-finding-square-root.html
- But these are COMBINATORIAL circuits (no clocks), they are not efficient in terms of resources and might affect timing, so be careful here, especially with large bit widths!
5- Depending on the number of bits you are using, you can think of any function (in your case the SQRT) as a large LUT and implement it using a single ROM with pre-calculated values.
The info below roughly applies to internal block RAM/ROMs.
Keep in mind that the FPGA RAM is limited (you can quickly check your FPGA's datasheet to get an estimate of how many of these block RAMs you can use).
If you are using Xilinx devices, typically a single BRAM (small internal FPGA block RAM) is either 18kbits 18*1024 bits (or 36 kbits).
These BRAM can be configured with various address/data configurations (for e.g. 10 address bits and 18 data bits, etc..., not all configurations might be valid)
You would typically code it as an array and let the tools worry about it (but you MUST check the synthesis result to see if the BRAM has been correctly inferred or if you are using way too many BRAM)
In this case you might need to change your approach.
Note: The ROM/RAM has to be synchronous, otherwise the synthesizer will try using registers instead of BRAM and you will see an explosion in resource utilization.
A rather "rough" way of approximating the number of BRAM that will be consumed is the following:
Num. BRAM = ceil(2^addr_width * data_width/1024/18) (use 36 instead of 18 depending on your device, but you should get the point)
The actual number will depend on the configuration possibilities, your coding style, and how "smart" the tools are.
In your case IF you can round your result (prior to the SQRT) to something like 10 bits (or just use a generic for that), this might work for you.
e.g.:
type type_my_rom is array(integer range 0 to 2**10-1) of unsigned(18-1 downto 0);
signal my_sqrt_rom : type_my_rom := MY_INIT_FUNCTION(); -- You can write an init function of some sort, or read it from file etc...
--...
-- clocked process
-- ...
pv <= i*i + q*q; -- Calculate
pv_10bit <= pv(pv'high downto pv'high-10); -- Truncate
mv <= my_sqrt_rom(pv_10bit); -- SQRT using a ROM, pv_10bit is used as the address
Note: this is really going to depend on your application's needs. I would suggest you try it out in software (or non-synthesizable code) and see how many bits you can truncate without losing too much precision.
10 bits might be too much of a truncation and if you increase your addr_width to 16 bits you will quickly run out of resources, so this might not be the best approach.
- I would probably go for the core generator approach, you can check the documentation online to see if it fits your requirements, or just try generating it. (It will still probably require a few clock cycles depending on the precision).
- If you can afford the LUT resources, maybe look into the combinatorial implementations #2 (you can simply try synthesizing it and checking your resource usage and timing estimate).
I hope this helps a bit,
Anton