Pipelining a large mathematical equation

Carlos · Oct 12, 2012

Hello,

This is a bit of a general question concerning coding styles.
Suppose that you want to implement the following equation in a fully pipelined manner:
((a + b) + (c + d)*e)*f

Ideally, in order to achieve good timing (more or less), I would tend to perform almost every arithmetic operation on its own into an intermediate result.
Afterwards I would combine intermediate results, for example as shown below (this is just some pseudo code, I might have made some errors).

process (my_clk)
begin
if rising_edge(my_clk) then
if my_reset = '1' then
ab1 <= (others => '0');
ab2 <= (others => '0');
cd1 <= (others => '0');
cde2 <= (others => '0');
abcde3 <= (others => '0');
abcdef4 <= (others => '0');

e1 <= (others => '0');
f1 <= (others => '0');
f2 <= (others => '0');
f3 <= (others => '0');
else
-- Stage (1)
ab1 <= a + b;
cd1 <= c + d;
e1 <= e; -- Delay
f1 <= f; -- Delay
-- Stage (2)
cde2 <= cd1 * e1;
f2 <= f1; -- Delay
ab2 <= ab1; -- Delay
-- Stage (3)
abcde3 <= ab2 + cde2;
f3 <= f2; -- Delay
-- Stage (4)
abcdef4 <= abcde3 * f3;
end if;
end if;
end process;

For now, I keep a "number" to indicate the pipeline stage I'm in, since I don't want to add an intermediate result with an "older" one.
This might be fine, but assuming we are doing some large interpolation formula which consists of 20 arithmetic terms, this might become quite cumbersome, especially that you have to give a meaningful name for the intermediate values.

My questions are the following:
#1- Is there a better coding style to keep the code clean and easily modifiable? Maybe use variables instead of signals etc... (I'm interested in knowing how you guys would implement that).
#2- Is there a way to just insert the whole equation and give enough pipelining registers for the synthesis tool to use/duplicate/balance/insert in order to achieve something neat.
(In this case latency is not an issue, we are more concerned with throughput).
For example, could we do something like, the below code, and let the synthesis tool do its magic?
#3- If #2 is doable, will this code be more or less portable to other devices, or will I have to manually tweak the synthesis parameters for different devices.

my_eq1 <= ((a + b) + (c + d)*e)*f;
my_eq2 <= my_eq1;
my_eq3 <= my_eq2;
my_eq4 <= my_eq3;
-- ...

In reality we are only going to read the signal "my_eq4" the others are just pipeline stages that we don't care about.

Notes:
- I am assuming that all vector widths are correctly set.
- I also understand that, depending on the platform and available resources, an architecture might be better suited than another.

Any cool coding tips would be appreciated. Thanks!

Gabor · Oct 13, 2012

Hello,

This is a bit of a general question concerning coding styles.
Suppose that you want to implement the following equation in a fully pipelined manner:
((a + b) + (c + d)*e)*f

Ideally, in order to achieve good timing (more or less), I would tend to perform almost every arithmetic operation on its own into an intermediate result.
Afterwards I would combine intermediate results, for example as shown below (this is just some pseudo code, I might have made some errors).

process (my_clk)
begin
if rising_edge(my_clk) then
if my_reset = '1' then
ab1 <= (others => '0');
ab2 <= (others => '0');
cd1 <= (others => '0');
cde2 <= (others => '0');
abcde3 <= (others => '0');
abcdef4 <= (others => '0');

e1 <= (others => '0');
f1 <= (others => '0');
f2 <= (others => '0');
f3 <= (others => '0');
else
-- Stage (1)
ab1 <= a + b;
cd1 <= c + d;
e1 <= e; -- Delay
f1 <= f; -- Delay
-- Stage (2)
cde2 <= cd1 * e1;
f2 <= f1; -- Delay
ab2 <= ab1; -- Delay
-- Stage (3)
abcde3 <= ab2 + cde2;
f3 <= f2; -- Delay
-- Stage (4)
abcdef4 <= abcde3 * f3;
end if;
end if;
end process;

For now, I keep a "number" to indicate the pipeline stage I'm in, since I don't want to add an intermediate result with an "older" one.
This might be fine, but assuming we are doing some large interpolation formula which consists of 20 arithmetic terms, this might become quite cumbersome, especially that you have to give a meaningful name for the intermediate values.

My questions are the following:
#1- Is there a better coding style to keep the code clean and easily modifiable? Maybe use variables instead of signals etc... (I'm interested in knowing how you guys would implement that).
#2- Is there a way to just insert the whole equation and give enough pipelining registers for the synthesis tool to use/duplicate/balance/insert in order to achieve something neat.
(In this case latency is not an issue, we are more concerned with throughput).
For example, could we do something like, the below code, and let the synthesis tool do its magic?
#3- If #2 is doable, will this code be more or less portable to other devices, or will I have to manually tweak the synthesis parameters for different devices.

my_eq1 <= ((a + b) + (c + d)*e)*f;
my_eq2 <= my_eq1;
my_eq3 <= my_eq2;
my_eq4 <= my_eq3;
-- ...

In reality we are only going to read the signal "my_eq4" the others are just pipeline stages that we don't care about.

Notes:
- I am assuming that all vector widths are correctly set.
- I also understand that, depending on the platform and available resources, an architecture might be better suited than another.

Any cool coding tips would be appreciated. Thanks!

I would suggest trying your suggestion in your synthesis tool to see
what becomes of it. XST, at least in more recent versions is pretty
good at "balancing registers" to insert the pipeline stages where it
makes sense to meet timing. Specifically you can code wide
multiplications that need more than one DSP unit, follow the output
with a number of pipeline stages, and XST wll move the pipelining
into the DSP registers for intermediate results. It's not clear
how well XST will perform on other more general equations, but
again it couldn't hurt to try it out. I imagine Synplicity will
do at least as well. Can't comment on other tools.

-- Gabor

Andy · Oct 15, 2012

Anytime I see variables or signals named xx1, xx2, xx3, etc., alarm bells and flashing lights go off in my head. You can use an array, and a shift register to accomplish what you want, and the length of the shift register determines the number of pipeline stages (and you could use a generic for that).

generic(stages: positive := 1);

....

type pipe_t is array(1 to stages) of unsigned(output'range);
signal pipe: pipe_t;

....

pipe <= pipe(2 to stages) & (((a + b) + (c + d)*e)*f); -- registered

....

output <= pipe(1); -- combinatorial

Keep in mind, you may want to manage intermediate result size/resolution. If so, take a look at the fixed point package (vhdl 2008), even if you need zero fractional bits. Each operation's result grows to handle the maximum range/precision, and then you use resize() (intermedately and/or at the end)to specify how much range/precision you want to keep. This would be a goodplace for a function that handles all that, then just call the function inthe concatenation for the shift register.

Andy

HT-Lab · Oct 17, 2012

Hello,

This is a bit of a general question concerning coding styles.
Suppose that you want to implement the following equation in a fully pipelined manner:
((a + b) + (c + d)*e)*f

Perhaps another solution to look at is High-Level Synthesis. Thanks to
Xilinx Vivado HLS this is now an affordable solution (comes free with
ISE14.2?).

There are some nice video's here:
http://www.xilinx.com/training/vivado/index.htm

HLS tools are ideally suited to perform architectural exploration
(number of pipelines, number/type of resources etc) on a large un-timed
block of logic. After you have crafted your perfect architecture the
tool generates the VHDL or Verilog for you.

Obviously Vivado HLS is not as capable as the big boys (CatapultC
Cynthesiser etc) but I suspect you get quite a lot of HLS for your money.

Just a though,

Hans
www.ht-lab.com

carloshyneman · Oct 17, 2012

Hello,

Gabor, thanks I might be testing it out with a more concrete design.

Andy, I do agree with you concerning the incremental numbers, they are quite annoying and I do not typically use them unless it's a single delay pipe.
Concerning using a function. Do you mean that you would do the whole calculation (with resizes etc...) in a combinatorial fashion (or in single equation) and then add some amount of pipelining stages to improve the performance.
Or would you manually separate the calculations into several stages rather than rely on the tools to do the work?

I will also have a look at Xilinx Vivado HLS, thanks!
C.

Andy · Oct 24, 2012

Andy, I do agree with you concerning the incremental numbers, they are quite annoying and I do not typically use them unless it's a single delay pipe. Concerning using a function. Do you mean that you would do the whole calculation (with resizes etc...) in a combinatorial fashion (or in single equation) and then add some amount of pipelining stages to improve the performance. Or would you manually separate the calculations into several stages rather than rely on the tools to do the work? I will also have a look at Xilinx Vivado HLS, thanks! C.

Sorry it took me a whle to get back to you...

I would try the single expression followed (or preceded) by pipeline stages, and enable retiming/pipelining optimizations in synthesis first. If that gets you where you need to be (speed, area, etc.) then you're done. Only ifthat does not work satisfactorily with your tool would I break it up manually into individual pipeline stages. The former is much more readable, writable and maintainable than the latter.

Note that, if you need to use resizing() to manage the size/precision of intermediate results, you can break it up into several statements using variables, then shift the final variable result into the pipeline. Just rememberto write before read on a variable to retain the combinatorial behavior before it gets shifted into the pipeline, all in the same clocked process.

Note also that breaking the expression up is not absolutely necessary if you want to use resize() on intermediate results; you can embed resize() in the expression.

Andy

Gerhard Hoffmann · Dec 9, 2012

Am 12.10.2012 16:47, schrieb Carlos:

Hello,

This is a bit of a general question concerning coding styles.
Suppose that you want to implement the following equation in a fully pipelined manner: ......
Any cool coding tips would be appreciated. Thanks!

Take a look at my sine / cos function on opencores.org
and the pipeline entity in special.

http://opencores.org/project,sincos

regards, Gerhard

Pipelining a multi-dimensional array.	5	Apr 8, 2009
Two implementations of simple math equation yield different results	6	May 13, 2008
A real math problem.	5	Jan 2, 2024
[SUMMARY] Equation Graphing (#176)	1	Sep 12, 2008
For verification, what's the best way to introduce delay offsetbetween a DUT's data array?	1	Feb 20, 2013
Translating A Pattern of Data Into Equation, and ultimately code	19	Apr 10, 2005
PHDL a new HDL for PCB design	14	Oct 28, 2011
Compilation errors in a vector problem	4	Dec 2, 2009

Pipelining a large mathematical equation

Carlos

Gabor

Andy

HT-Lab

carloshyneman

Andy

Gerhard Hoffmann

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads