newbie with timing problem, is adding pipeline stages the only optionto speed up?

F

festi.giorgio

i am implementing a fir filter and it has to be(in my naive opinion) very fast.
when i grow up the number of coefficient some timing problem arise, at first i have tried to add some pipeline stages and it get something better.
Without going too much in the detail.. is adding pipeline stages the only option that i have to speed up the all system?

thanks :)
 
R

rickman

i am implementing a fir filter and it has to be(in my naive opinion) very fast.
when i grow up the number of coefficient some timing problem arise, at first i have tried to add some pipeline stages and it get something better.
Without going too much in the detail.. is adding pipeline stages the only option that i have to speed up the all system?

thanks :)


No, there are other options.

Rick
 
A

antongunman

It first depends on what you mean by "very fast".

In some devices there are dedicated multipliers or MACC (e.g. DSP48) that are supposed to work up to ~400MHz-~600MHz.
Although this is true, you would have to carefully tweak your architecture,by either correctly inferring the multiplier components and cascading themwith appropriate register (pipelining) stages. Also the max achievable speed might be limited by the actual physical number of multipliers in an FPGAcolumn e.g. 7 (cascading to another column might reduce the theoretical max speed).
You could also use tools like the Core Generator (Xilinx) to generate such filters (I have not done that).

If on the other hand you do not care too much about such details, and wouldrather want to speed-up your design's timing some rules of thumb are:
- Trim your bit-width as soon as possible. (e.g. some dedicated multipliersor MACC like the DSP48 are typically SIGNED 18x25in and 48out, with a carry in of 48).
- Add pipeline stages. At the input of the multipliers, at the output, and at the adder stages. (For most applications, these latencies are consideredinsignificant).
- Consider using an "adder chain" instead of an "adder tree" for the final summation. (An adder is included in the DSP48 and can be efficiently cascaded using an adder chain).
- Use a synchronous reset??? (Depending on the manufacturer, sometimes dedicated multipliers can only be inferred if you use a sync-reset, otherwise you'll end up using the slower FPGA fabric).

For your case, it might be that you have a huge combinatorial adder tree at the output (e.g. SUM = A+B+C+D...+Y+Z), this is why when you increase the number of coefficients, the timing fails (but I’m just guessing at this point, I could not tell without looking at the code).

You do not have to know all the details and inner workings, just a bit of understanding of your current platform can help you code efficiently and in a way that the synthesizer can better understand.

You might also want to take a look at the synthesizer/platform user guides (e.g. the "XST User Guide") these sometimes contain VHDL templates for specific cases.

I hope this helps ! :D
 
R

Robert Miles

i am implementing a fir filter and it has to be(in my naive opinion) veryfast.

when i grow up the number of coefficient some timing problem arise, at first i have tried to add some pipeline stages and it get something better.

Without going too much in the detail.. is adding pipeline stages the onlyoption that i have to speed up the all system?



thanks :)

Adding carefully controlled delays in the appropriate places is POSSIBLE, but hard to design, partly since ASIC and FPGAs seldom have any way other than clocked stages to supply delays with a well-controlled value range. This would mean longer to get the result, but usually less than pipelining.

Do ypu have a way to do more of the first part of the calculations in parallel, then combine them in another stage?
 
B

Benjamin Couillard

Off the top of my head here are the options

1 - pipelining (you can pipeline the outputs and inputs of the multipliers,and also pipeline the tree adder stage)

The output for a generic fir filter

y[n] = b[0]*x[0] + b[1]*x[1] + ... b[n]*x[n];

you can pipeline the outputs of b[0]*x[0] to b[n]*x[n]. After, you need to add all the intermediate ouputs.

Suppose, that your filter order is 15, so you have 16 coefficients in all.

stage 1 :

8 adders (b[0]*x[0] + b[1]*x[1], b[2]*x[2] + b[3]*x[3], ...), then pipeline

stage 2 : 4 adders

stage 3 : 2 adders

stage 4 : 1 adder

The number of adder stages stages is ceil(log2(order+1)). Don't forget to pipeline after each stage.

If you pipelined the input before the multiplier, you'd have 1 pipeline delay, if you pipelined the multipliers outputs, you'd have another pipeline delays. Adding the pipelined adder tree would add another 4 pipeline delays.

In total, you would have 6 pipeline delays for this implementation. This isnot usually a big deal and would give you a speed boost.

2 - If your filter is symmetrical (or anti-symmetrical), you can get rid ofhalf the multipliers.

y[n] = b[0]*x[0] + b[1]*x[1] + ... b[n]*x[n];

if b[n] = b[0], b[1] = b[n-1], then you could rewrite the FIR equation as

y[n] = b[0]*(x[0]+x[n]) + b[1]*(x[1]+x[n-1]) +

3 - If you have access to the quantizer toolbox in Matlab you can simulate how many bits you actually need to implement the filter. You could then reduce the bitwidth of some intermediate signals and thus reduce the combinatorial delays. You don't actually need the fixed-point toolbox to perform this analysis, but it's a good tool if you have it.

There might be some errors in my answer, but I think that the general ideasare present.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top