Pipelining a large mathematical equation

Discussion in 'VHDL' started by Carlos, Oct 12, 2012.

  1. Carlos

    Carlos Guest

    Hello,

    This is a bit of a general question concerning coding styles.
    Suppose that you want to implement the following equation in a fully pipelined manner:
    ((a + b) + (c + d)*e)*f

    Ideally, in order to achieve good timing (more or less), I would tend to perform almost every arithmetic operation on its own into an intermediate result.
    Afterwards I would combine intermediate results, for example as shown below (this is just some pseudo code, I might have made some errors).

    process (my_clk)
    begin
    if rising_edge(my_clk) then
    if my_reset = '1' then
    ab1 <= (others => '0');
    ab2 <= (others => '0');
    cd1 <= (others => '0');
    cde2 <= (others => '0');
    abcde3 <= (others => '0');
    abcdef4 <= (others => '0');

    e1 <= (others => '0');
    f1 <= (others => '0');
    f2 <= (others => '0');
    f3 <= (others => '0');
    else
    -- Stage (1)
    ab1 <= a + b;
    cd1 <= c + d;
    e1 <= e; -- Delay
    f1 <= f; -- Delay
    -- Stage (2)
    cde2 <= cd1 * e1;
    f2 <= f1; -- Delay
    ab2 <= ab1; -- Delay
    -- Stage (3)
    abcde3 <= ab2 + cde2;
    f3 <= f2; -- Delay
    -- Stage (4)
    abcdef4 <= abcde3 * f3;
    end if;
    end if;
    end process;

    For now, I keep a "number" to indicate the pipeline stage I'm in, since I don't want to add an intermediate result with an "older" one.
    This might be fine, but assuming we are doing some large interpolation formula which consists of 20 arithmetic terms, this might become quite cumbersome, especially that you have to give a meaningful name for the intermediate values.

    My questions are the following:
    #1- Is there a better coding style to keep the code clean and easily modifiable? Maybe use variables instead of signals etc... (I'm interested in knowing how you guys would implement that).
    #2- Is there a way to just insert the whole equation and give enough pipelining registers for the synthesis tool to use/duplicate/balance/insert in order to achieve something neat.
    (In this case latency is not an issue, we are more concerned with throughput).
    For example, could we do something like, the below code, and let the synthesis tool do its magic?
    #3- If #2 is doable, will this code be more or less portable to other devices, or will I have to manually tweak the synthesis parameters for different devices.

    my_eq1 <= ((a + b) + (c + d)*e)*f;
    my_eq2 <= my_eq1;
    my_eq3 <= my_eq2;
    my_eq4 <= my_eq3;
    -- ...

    In reality we are only going to read the signal "my_eq4" the others are just pipeline stages that we don't care about.

    Notes:
    - I am assuming that all vector widths are correctly set.
    - I also understand that, depending on the platform and available resources, an architecture might be better suited than another.

    Any cool coding tips would be appreciated. Thanks!
    Carlos, Oct 12, 2012
    #1
    1. Advertising

  2. Carlos

    Gabor Guest

    On 10/12/2012 10:47 AM, Carlos wrote:
    > Hello,
    >
    > This is a bit of a general question concerning coding styles.
    > Suppose that you want to implement the following equation in a fully pipelined manner:
    > ((a + b) + (c + d)*e)*f
    >
    > Ideally, in order to achieve good timing (more or less), I would tend to perform almost every arithmetic operation on its own into an intermediate result.
    > Afterwards I would combine intermediate results, for example as shown below (this is just some pseudo code, I might have made some errors).
    >
    > process (my_clk)
    > begin
    > if rising_edge(my_clk) then
    > if my_reset = '1' then
    > ab1 <= (others => '0');
    > ab2 <= (others => '0');
    > cd1 <= (others => '0');
    > cde2 <= (others => '0');
    > abcde3 <= (others => '0');
    > abcdef4 <= (others => '0');
    >
    > e1 <= (others => '0');
    > f1 <= (others => '0');
    > f2 <= (others => '0');
    > f3 <= (others => '0');
    > else
    > -- Stage (1)
    > ab1 <= a + b;
    > cd1 <= c + d;
    > e1 <= e; -- Delay
    > f1 <= f; -- Delay
    > -- Stage (2)
    > cde2 <= cd1 * e1;
    > f2 <= f1; -- Delay
    > ab2 <= ab1; -- Delay
    > -- Stage (3)
    > abcde3 <= ab2 + cde2;
    > f3 <= f2; -- Delay
    > -- Stage (4)
    > abcdef4 <= abcde3 * f3;
    > end if;
    > end if;
    > end process;
    >
    > For now, I keep a "number" to indicate the pipeline stage I'm in, since I don't want to add an intermediate result with an "older" one.
    > This might be fine, but assuming we are doing some large interpolation formula which consists of 20 arithmetic terms, this might become quite cumbersome, especially that you have to give a meaningful name for the intermediate values.
    >
    > My questions are the following:
    > #1- Is there a better coding style to keep the code clean and easily modifiable? Maybe use variables instead of signals etc... (I'm interested in knowing how you guys would implement that).
    > #2- Is there a way to just insert the whole equation and give enough pipelining registers for the synthesis tool to use/duplicate/balance/insert in order to achieve something neat.
    > (In this case latency is not an issue, we are more concerned with throughput).
    > For example, could we do something like, the below code, and let the synthesis tool do its magic?
    > #3- If #2 is doable, will this code be more or less portable to other devices, or will I have to manually tweak the synthesis parameters for different devices.
    >
    > my_eq1 <= ((a + b) + (c + d)*e)*f;
    > my_eq2 <= my_eq1;
    > my_eq3 <= my_eq2;
    > my_eq4 <= my_eq3;
    > -- ...
    >
    > In reality we are only going to read the signal "my_eq4" the others are just pipeline stages that we don't care about.
    >
    > Notes:
    > - I am assuming that all vector widths are correctly set.
    > - I also understand that, depending on the platform and available resources, an architecture might be better suited than another.
    >
    > Any cool coding tips would be appreciated. Thanks!
    >

    I would suggest trying your suggestion in your synthesis tool to see
    what becomes of it. XST, at least in more recent versions is pretty
    good at "balancing registers" to insert the pipeline stages where it
    makes sense to meet timing. Specifically you can code wide
    multiplications that need more than one DSP unit, follow the output
    with a number of pipeline stages, and XST wll move the pipelining
    into the DSP registers for intermediate results. It's not clear
    how well XST will perform on other more general equations, but
    again it couldn't hurt to try it out. I imagine Synplicity will
    do at least as well. Can't comment on other tools.

    -- Gabor
    Gabor, Oct 13, 2012
    #2
    1. Advertising

  3. Carlos

    Andy Guest

    Anytime I see variables or signals named xx1, xx2, xx3, etc., alarm bells and flashing lights go off in my head. You can use an array, and a shift register to accomplish what you want, and the length of the shift register determines the number of pipeline stages (and you could use a generic for that).

    generic(stages: positive := 1);

    ....

    type pipe_t is array(1 to stages) of unsigned(output'range);
    signal pipe: pipe_t;

    ....

    pipe <= pipe(2 to stages) & (((a + b) + (c + d)*e)*f); -- registered

    ....

    output <= pipe(1); -- combinatorial

    Keep in mind, you may want to manage intermediate result size/resolution. If so, take a look at the fixed point package (vhdl 2008), even if you need zero fractional bits. Each operation's result grows to handle the maximum range/precision, and then you use resize() (intermedately and/or at the end)to specify how much range/precision you want to keep. This would be a goodplace for a function that handles all that, then just call the function inthe concatenation for the shift register.

    Andy
    Andy, Oct 15, 2012
    #3
  4. Carlos

    HT-Lab Guest

    On 12/10/2012 15:47, Carlos wrote:
    > Hello,
    >
    > This is a bit of a general question concerning coding styles.
    > Suppose that you want to implement the following equation in a fully pipelined manner:
    > ((a + b) + (c + d)*e)*f



    Perhaps another solution to look at is High-Level Synthesis. Thanks to
    Xilinx Vivado HLS this is now an affordable solution (comes free with
    ISE14.2?).

    There are some nice video's here:
    http://www.xilinx.com/training/vivado/index.htm

    HLS tools are ideally suited to perform architectural exploration
    (number of pipelines, number/type of resources etc) on a large un-timed
    block of logic. After you have crafted your perfect architecture the
    tool generates the VHDL or Verilog for you.

    Obviously Vivado HLS is not as capable as the big boys (CatapultC
    Cynthesiser etc) but I suspect you get quite a lot of HLS for your money.

    Just a though,

    Hans
    www.ht-lab.com


    >
    > Ideally, in order to achieve good timing (more or less), I would tend to perform almost every arithmetic operation on its own into an intermediate result.
    > Afterwards I would combine intermediate results, for example as shown below (this is just some pseudo code, I might have made some errors).
    >
    > process (my_clk)
    > begin
    > if rising_edge(my_clk) then
    > if my_reset = '1' then
    > ab1 <= (others => '0');
    > ab2 <= (others => '0');
    > cd1 <= (others => '0');
    > cde2 <= (others => '0');
    > abcde3 <= (others => '0');
    > abcdef4 <= (others => '0');
    >
    > e1 <= (others => '0');
    > f1 <= (others => '0');
    > f2 <= (others => '0');
    > f3 <= (others => '0');
    > else
    > -- Stage (1)
    > ab1 <= a + b;
    > cd1 <= c + d;
    > e1 <= e; -- Delay
    > f1 <= f; -- Delay
    > -- Stage (2)
    > cde2 <= cd1 * e1;
    > f2 <= f1; -- Delay
    > ab2 <= ab1; -- Delay
    > -- Stage (3)
    > abcde3 <= ab2 + cde2;
    > f3 <= f2; -- Delay
    > -- Stage (4)
    > abcdef4 <= abcde3 * f3;
    > end if;
    > end if;
    > end process;
    >
    > For now, I keep a "number" to indicate the pipeline stage I'm in, since I don't want to add an intermediate result with an "older" one.
    > This might be fine, but assuming we are doing some large interpolation formula which consists of 20 arithmetic terms, this might become quite cumbersome, especially that you have to give a meaningful name for the intermediate values.
    >
    > My questions are the following:
    > #1- Is there a better coding style to keep the code clean and easily modifiable? Maybe use variables instead of signals etc... (I'm interested in knowing how you guys would implement that).
    > #2- Is there a way to just insert the whole equation and give enough pipelining registers for the synthesis tool to use/duplicate/balance/insert in order to achieve something neat.
    > (In this case latency is not an issue, we are more concerned with throughput).
    > For example, could we do something like, the below code, and let the synthesis tool do its magic?
    > #3- If #2 is doable, will this code be more or less portable to other devices, or will I have to manually tweak the synthesis parameters for different devices.
    >
    > my_eq1 <= ((a + b) + (c + d)*e)*f;
    > my_eq2 <= my_eq1;
    > my_eq3 <= my_eq2;
    > my_eq4 <= my_eq3;
    > -- ...
    >
    > In reality we are only going to read the signal "my_eq4" the others are just pipeline stages that we don't care about.
    >
    > Notes:
    > - I am assuming that all vector widths are correctly set.
    > - I also understand that, depending on the platform and available resources, an architecture might be better suited than another.
    >
    > Any cool coding tips would be appreciated. Thanks!
    >
    HT-Lab, Oct 17, 2012
    #4
  5. Carlos

    Guest

    Hello,

    Gabor, thanks I might be testing it out with a more concrete design.

    Andy, I do agree with you concerning the incremental numbers, they are quite annoying and I do not typically use them unless it's a single delay pipe.
    Concerning using a function. Do you mean that you would do the whole calculation (with resizes etc...) in a combinatorial fashion (or in single equation) and then add some amount of pipelining stages to improve the performance.
    Or would you manually separate the calculations into several stages rather than rely on the tools to do the work?

    I will also have a look at Xilinx Vivado HLS, thanks!
    C.
    , Oct 17, 2012
    #5
  6. Carlos

    Andy Guest

    On Wednesday, October 17, 2012 12:03:14 PM UTC-5, wrote:
    > Andy, I do agree with you concerning the incremental numbers, they are quite annoying and I do not typically use them unless it's a single delay pipe. Concerning using a function. Do you mean that you would do the whole calculation (with resizes etc...) in a combinatorial fashion (or in single equation) and then add some amount of pipelining stages to improve the performance. Or would you manually separate the calculations into several stages rather than rely on the tools to do the work? I will also have a look at Xilinx Vivado HLS, thanks! C.


    Sorry it took me a whle to get back to you...

    I would try the single expression followed (or preceded) by pipeline stages, and enable retiming/pipelining optimizations in synthesis first. If that gets you where you need to be (speed, area, etc.) then you're done. Only ifthat does not work satisfactorily with your tool would I break it up manually into individual pipeline stages. The former is much more readable, writable and maintainable than the latter.

    Note that, if you need to use resizing() to manage the size/precision of intermediate results, you can break it up into several statements using variables, then shift the final variable result into the pipeline. Just rememberto write before read on a variable to retain the combinatorial behavior before it gets shifted into the pipeline, all in the same clocked process.

    Note also that breaking the expression up is not absolutely necessary if you want to use resize() on intermediate results; you can embed resize() in the expression.

    Andy
    Andy, Oct 24, 2012
    #6
  7. Am 12.10.2012 16:47, schrieb Carlos:
    > Hello,
    >
    > This is a bit of a general question concerning coding styles.
    > Suppose that you want to implement the following equation in a fully pipelined manner:

    ......
    > Any cool coding tips would be appreciated. Thanks!
    >


    Take a look at my sine / cos function on opencores.org
    and the pipeline entity in special.

    http://opencores.org/project,sincos

    regards, Gerhard
    Gerhard Hoffmann, Dec 9, 2012
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. koko

    Pipelining in VHDL

    koko, Apr 29, 2004, in forum: VHDL
    Replies:
    2
    Views:
    5,637
    mizocom
    Apr 29, 2004
  2. Peggy
    Replies:
    0
    Views:
    625
    Peggy
    Sep 8, 2004
  3. tulip

    pipelining

    tulip, Nov 5, 2004, in forum: VHDL
    Replies:
    0
    Views:
    622
    tulip
    Nov 5, 2004
  4. Andrea Campi

    Pipelining tutorial wanted

    Andrea Campi, Nov 14, 2004, in forum: VHDL
    Replies:
    9
    Views:
    2,183
    Mike Treseler
    Nov 19, 2004
  5. Lionel
    Replies:
    14
    Views:
    1,153
Loading...

Share This Page