Count bits in VHDL, with loop and unrolled loop produces different results

Discussion in 'VHDL' started by a s, Mar 1, 2011.

  1. a s

    a s Guest

    Dear all,

    I have come up with 2 solutions in VHDL, how to count number of bits
    in input data.
    The thing I don't understand is why the 2 solutions produce different
    results, at least with Xilinx ISE and its XST.
    There is quite a substantial difference in required number of slices/
    LUTs.

    1. solution with unrolled loop: 41 slices, 73 LUTs
    2. solution with loop: 54 slices, 100 LUTs

    The entity of both architectures is the same:

    entity one_count is
    Port ( din : in STD_LOGIC_vector(31 downto 0);
    dout : out STD_LOGIC_vector(5 downto 0)
    );
    end one_count;


    The architecture with an unrolled loop is the following:

    library IEEE;
    use IEEE.STD_LOGIC_1164.ALL;
    use IEEE.NUMERIC_STD.ALL;

    entity one_count is
    Port ( din : in STD_LOGIC_vector(31 downto 0);
    dout : out STD_LOGIC_vector(5 downto 0)
    );
    end one_count;

    architecture one_count_unrolled_arch of one_count is

    signal cnt : integer range 0 to 32;

    begin

    cnt <= to_integer(unsigned(din( 0 downto 0))) +
    to_integer(unsigned(din( 1 downto 1))) +
    to_integer(unsigned(din( 2 downto 2))) +
    to_integer(unsigned(din( 3 downto 3))) +
    to_integer(unsigned(din( 4 downto 4))) +
    to_integer(unsigned(din( 5 downto 5))) +
    to_integer(unsigned(din( 6 downto 6))) +
    to_integer(unsigned(din( 7 downto 7))) +
    to_integer(unsigned(din( 8 downto 8))) +
    to_integer(unsigned(din( 9 downto 9))) +
    to_integer(unsigned(din(10 downto 10))) +
    to_integer(unsigned(din(11 downto 11))) +
    to_integer(unsigned(din(12 downto 12))) +
    to_integer(unsigned(din(13 downto 13))) +
    to_integer(unsigned(din(14 downto 14))) +
    to_integer(unsigned(din(15 downto 15))) +
    to_integer(unsigned(din(16 downto 16))) +
    to_integer(unsigned(din(17 downto 17))) +
    to_integer(unsigned(din(18 downto 18))) +
    to_integer(unsigned(din(19 downto 19))) +
    to_integer(unsigned(din(20 downto 20))) +
    to_integer(unsigned(din(21 downto 21))) +
    to_integer(unsigned(din(22 downto 22))) +
    to_integer(unsigned(din(23 downto 23))) +
    to_integer(unsigned(din(24 downto 24))) +
    to_integer(unsigned(din(25 downto 25))) +
    to_integer(unsigned(din(26 downto 26))) +
    to_integer(unsigned(din(27 downto 27))) +
    to_integer(unsigned(din(28 downto 28))) +
    to_integer(unsigned(din(29 downto 29))) +
    to_integer(unsigned(din(30 downto 30))) +
    to_integer(unsigned(din(31 downto 31)));

    dout <= std_logic_vector(to_unsigned(cnt,6));

    end one_count_unrolled_arch ;


    And the architecture with a loop is the following:

    architecture one_count_loop_arch of one_count_loop is

    signal cnt : integer range 0 to 32;

    begin

    process(din) is
    variable tmp : integer range 0 to 32;
    begin
    tmp := to_integer(unsigned(din(0 downto 0)));
    for i in 1 to 31 loop
    tmp := tmp + to_integer(unsigned(din(i downto i)));
    end loop;
    cnt <= tmp;
    end process;

    dout <= std_logic_vector(to_unsigned(cnt,6));

    end one_count_loop_arch ;


    I would be really grateful if somebody could point out what I did
    wrong with the 2. solution with loop.
    It certainly must be my mistake, but I can not find it...

    Additionally, I know that this "brute-force" one counting might not be
    the optimal approach,
    but this is just my first attempt to get the job done. If somebody has
    a better solution, I would
    appreciate it if you could share it.

    Regards,
    Peter
    a s, Mar 1, 2011
    #1
    1. Advertising

  2. a s

    Tricky Guest

    Re: Count bits in VHDL, with loop and unrolled loop producesdifferent results

    On Mar 1, 12:59 pm, a s <> wrote:
    > Dear all,
    >
    > I have come up with 2 solutions in VHDL, how to count number of bits
    > in input data.
    > The thing I don't understand is why the 2 solutions produce different
    > results, at least with Xilinx ISE and its XST.
    > There is quite a substantial difference in required number of slices/
    > LUTs.
    >
    > 1. solution with unrolled loop:        41 slices,  73 LUTs
    > 2. solution with loop:                    54 slices, 100 LUTs
    >
    > The entity of both architectures is the same:
    >
    > entity one_count is
    >   Port ( din : in  STD_LOGIC_vector(31 downto 0);
    >          dout : out  STD_LOGIC_vector(5 downto 0)
    >         );
    > end one_count;
    >
    > The architecture with an unrolled loop is the following:
    >
    > library IEEE;
    > use IEEE.STD_LOGIC_1164.ALL;
    > use IEEE.NUMERIC_STD.ALL;
    >
    > entity one_count is
    >   Port ( din : in  STD_LOGIC_vector(31 downto 0);
    >          dout : out  STD_LOGIC_vector(5 downto 0)
    >         );
    > end one_count;
    >
    > architecture one_count_unrolled_arch of one_count is
    >
    >   signal  cnt : integer range 0 to 32;
    >
    > begin
    >
    >    cnt <= to_integer(unsigned(din( 0 downto  0))) +
    >           to_integer(unsigned(din( 1 downto  1))) +
    >           to_integer(unsigned(din( 2 downto  2))) +
    >           to_integer(unsigned(din( 3 downto  3))) +
    >           to_integer(unsigned(din( 4 downto  4))) +
    >           to_integer(unsigned(din( 5 downto  5))) +
    >           to_integer(unsigned(din( 6 downto  6))) +
    >           to_integer(unsigned(din( 7 downto  7))) +
    >           to_integer(unsigned(din( 8 downto  8))) +
    >           to_integer(unsigned(din( 9 downto  9))) +
    >           to_integer(unsigned(din(10 downto 10))) +
    >           to_integer(unsigned(din(11 downto 11))) +
    >           to_integer(unsigned(din(12 downto 12))) +
    >           to_integer(unsigned(din(13 downto 13))) +
    >           to_integer(unsigned(din(14 downto 14))) +
    >           to_integer(unsigned(din(15 downto 15))) +
    >           to_integer(unsigned(din(16 downto 16))) +
    >           to_integer(unsigned(din(17 downto 17))) +
    >           to_integer(unsigned(din(18 downto 18))) +
    >           to_integer(unsigned(din(19 downto 19))) +
    >           to_integer(unsigned(din(20 downto 20))) +
    >           to_integer(unsigned(din(21 downto 21))) +
    >           to_integer(unsigned(din(22 downto 22))) +
    >           to_integer(unsigned(din(23 downto 23))) +
    >           to_integer(unsigned(din(24 downto 24))) +
    >           to_integer(unsigned(din(25 downto 25))) +
    >           to_integer(unsigned(din(26 downto 26))) +
    >           to_integer(unsigned(din(27 downto 27))) +
    >           to_integer(unsigned(din(28 downto 28))) +
    >           to_integer(unsigned(din(29 downto 29))) +
    >           to_integer(unsigned(din(30 downto 30))) +
    >           to_integer(unsigned(din(31 downto 31)));
    >
    >    dout <= std_logic_vector(to_unsigned(cnt,6));
    >
    > end one_count_unrolled_arch ;
    >
    > And the architecture with a loop is the following:
    >
    > architecture one_count_loop_arch of one_count_loop is
    >
    > signal  cnt : integer range 0 to 32;
    >
    > begin
    >
    >   process(din) is
    >     variable  tmp : integer range 0 to 32;
    >     begin
    >       tmp := to_integer(unsigned(din(0 downto 0)));
    >       for i in 1 to 31 loop
    >           tmp := tmp + to_integer(unsigned(din(i downto i)));
    >       end loop;
    >       cnt <= tmp;
    >   end process;
    >
    >   dout <= std_logic_vector(to_unsigned(cnt,6));
    >
    > end one_count_loop_arch ;
    >
    > I would be really grateful if somebody could point out what I did
    > wrong with the 2. solution with loop.
    > It certainly must be my mistake, but I can not find it...
    >
    > Additionally, I know that this "brute-force" one counting might not be
    > the optimal approach,
    > but this is just my first attempt to get the job done. If somebody has
    > a better solution, I would
    > appreciate it if you could share it.
    >
    > Regards,
    > Peter


    see what you get with this function instead (a function I have used
    before):

    function count_ones(slv : std_logic_vector) return natural is
    varaible n_ones : natural := 0;
    begin
    for i in slv'range loop
    if slv(i) = '1' then
    n_ones := n_ones + 1;
    end if;
    end loop;

    return n_ones;
    end function count_ones;

    .....

    inside architecture, no process needed:

    dout <= std_logic_vector( to_unsigned(count_ones(din), dout'length) );


    The beauty with this is the function will work with a std_logic_vector
    of any length.
    Tricky, Mar 1, 2011
    #2
    1. Advertising

  3. a s

    Andy Guest

    Re: Count bits in VHDL, with loop and unrolled loop producesdifferent results

    A good synthesis tool should be able to optimize either version to the
    same implementation. But there are semantic differences that Xilinx
    may be getting hung up on.

    In the unrolled version, you have a long expression, and there is
    freedom within vhdl to evaluate it in different orders or groups of
    operations. In the loop version, since you are continually updating
    tmp, you are describing an explicitly sequential order in which the
    calculation takes place. Like I said, a good synthesis tool should be
    able to handle either one equally well, but you get what you pay for
    in synthesis tools.

    If you are looking for a general solution to the problem for any size
    vector, try a recursive function implentation of a binary tree and see
    what happens.

    Just for kicks, you might also put the whole calculation in the loop
    (0 to 31), with temp set to 0 before the loop. Shouldn't make any
    difference, but then again, we're already seeing differences where
    there should be none.

    On the other hand, if what you have works (fits and meets timing), use
    the most maintainable, understandable version. It will save you time
    (=money) in the long run. It is often interesting to find out what is
    "the optimal" way to code something such that it results in the
    smallest/fastest circuit. But in the big picture, it most often does
    not matter, especially when you write some cryptic code to squeeze the
    last pS/LUT out of it, and you had plenty of slack and space to spare
    anyway. Nevertheless, knowing how to squeeze every pS/LUT comes in
    handy every once in a while.

    Andy
    Andy, Mar 1, 2011
    #3
  4. a s

    a s Guest

    Re: Count bits in VHDL, with loop and unrolled loop producesdifferent results

    Dear Andy, Tricky,

    thank you both for your valuable input. Please find my comments below.

    On Mar 1, 3:49 pm, Andy <> wrote:
    > A good synthesis tool should be able to optimize either version to the
    > same implementation. But there are semantic differences that Xilinx
    > may be getting hung up on.


    Aha, that's a good thing, it means that I did not make some obvious
    mistake. ;-)

    > If you are looking for a general solution to the problem for any size
    > vector, try a recursive function implentation of a binary tree and see
    > what happens.


    OK, I didn't quite get that but will consider it again.

    > Just for kicks, you might also put the whole calculation in the loop
    > (0 to 31), with temp set to 0 before the loop. Shouldn't make any
    > difference, but then again, we're already seeing differences where
    > there should be none.


    Sorry I didn't tell you before. I have already tried that and in this
    case XST produces the same result.

    > On the other hand, if what you have works (fits and meets timing), use
    > the most maintainable, understandable version. It will save you time
    > (=money) in the long run. It is often interesting to find out what is
    > "the optimal" way to code something such that it results in the
    > smallest/fastest circuit. But in the big picture, it most often does
    > not matter, especially when you write some cryptic code to squeeze the
    > last pS/LUT out of it, and you had plenty of slack and space to spare
    > anyway. Nevertheless, knowing how to squeeze every pS/LUT comes in
    > handy every once in a while.


    Andy, I completely agree with what you have written above.
    One should strive for maintainable and understandable version.
    Although, on my particular case, I have to find a good solution
    in terms of LUT resources, because I need 8 instances of
    one counters with 64-bit input data. And the device is getting full...


    Tricky, your approach does indeed look very neat. I like it.
    Although it is far less efficient than mine. For the same input/output
    ports, your version with function requires 171 Slices in 313 LUTs.
    (The minimum that I get with unrolled version is 41 Slices and 73
    LUTs).
    a s, Mar 1, 2011
    #4
  5. a s

    johnp Guest

    Re: Count bits in VHDL, with loop and unrolled loop producesdifferent results

    Here's a slightly different approach to your problem.... It tries to
    take advantage
    of the fact that the LUTs are pretty good as lookup tables. It's in
    Verilog, but
    you should easily be able to convert it to VHDL.

    `define V1

    module tst (
    input [31:0] data,
    output [5:0] cnt
    );


    `ifdef V1

    /*
    Device utilization summary:
    ---------------------------

    Selected Device : 3s50pq208-5

    Number of Slices: 35 out of 768
    4%
    Number of 4 input LUTs: 62 out of 1536
    4%
    Number of IOs: 38
    Number of bonded IOBs: 0 out of 124
    0%
    */


    function [1:0] cnt3;
    input [2:0] d_in;
    begin
    case (d_in)
    3'h0: cnt3 = 2'h0;
    3'h1: cnt3 = 2'h1;
    3'h2: cnt3 = 2'h1;
    3'h3: cnt3 = 2'h2;
    3'h4: cnt3 = 2'h1;
    3'h5: cnt3 = 2'h2;
    3'h6: cnt3 = 2'h2;
    3'h7: cnt3 = 2'h3;
    endcase
    end
    endfunction

    assign cnt = cnt3(data[2:0])
    + cnt3(data[5:3])
    + cnt3(data[8:6])
    + cnt3(data[11:9])
    + cnt3(data[14:12])
    + cnt3(data[17:15])
    + cnt3(data[20:18])
    + cnt3(data[23:21])
    + cnt3(data[26:24])
    + cnt3(data[29:27])
    + cnt3({1'b0, data[31:30]})
    ;

    `endif


    `ifdef V2
    /*
    Selected Device : 3s50pq208-5

    Number of Slices: 44 out of 768
    5%
    Number of 4 input LUTs: 79 out of 1536
    5%
    Number of IOs: 38
    Number of bonded IOBs: 0 out of 124
    0%
    */

    function [2:0] cnt4;
    input [3:0] d_in;
    begin
    case (d_in)
    4'h0: cnt4 = 3'h0;
    4'h1: cnt4 = 3'h1;
    4'h2: cnt4 = 3'h1;
    4'h3: cnt4 = 3'h2;
    4'h4: cnt4 = 3'h1;
    4'h5: cnt4 = 3'h2;
    4'h6: cnt4 = 3'h2;
    4'h7: cnt4 = 3'h3;

    4'h8: cnt4 = 3'h1;
    4'h9: cnt4 = 3'h2;
    4'ha: cnt4 = 3'h2;
    4'hb: cnt4 = 3'h3;
    4'hc: cnt4 = 3'h2;
    4'hd: cnt4 = 3'h3;
    4'he: cnt4 = 3'h3;
    4'hf: cnt4 = 3'h4;

    endcase
    end
    endfunction

    assign cnt = cnt4(data[3:0])
    + cnt4(data[7:4])
    + cnt4(data[11:8])
    + cnt4(data[15:12])
    + cnt4(data[19:16])
    + cnt4(data[23:20])
    + cnt4(data[27:24])
    + cnt4(data[31:28])
    ;


    `endif
    endmodule


    John Providenza
    johnp, Mar 1, 2011
    #5
  6. a s

    Andy Guest

    Re: Count bits in VHDL, with loop and unrolled loop producesdifferent results

    > > If you are looking for a general solution to the problem for any size
    > > vector, try a recursive function implentation of a binary tree and see
    > > what happens.

    >
    > OK, I didn't quite get that but will consider it again.
    >


    The synthesis tool may not be able to figure out that it need not
    carry all the bits of the sum through every caculation in the loop. A
    binary tree implementation can manage the sum width at every stage.

    For a recursive binary tree implementation, define a function that
    divides the vector into two ~equal parts (i.e. n/2 and n/2 + n mod 2),
    calls itself on each one, and returns the sum of the two results. Stop
    recursion when the size of the incoming vector is 1 (just return 1 if
    the bit is set, and 0 if not). This is synthesizeable as long as the
    recursion is statically bound (which it is, by the size of the
    original vector).

    It should work out pretty close to what johnp's approach does, except
    work for any size input vector.

    Andy
    Andy, Mar 1, 2011
    #6
  7. In comp.arch.fpga Andy <> wrote:
    (snip)

    > The synthesis tool may not be able to figure out that it need not
    > carry all the bits of the sum through every caculation in the loop. A
    > binary tree implementation can manage the sum width at every stage.


    Is that the same as a Carry Save Adder tree?

    Maybe not. The CSA has three inputs and two outputs, so it
    isn't exactly binary.

    -- glen
    glen herrmannsfeldt, Mar 2, 2011
    #7
  8. a s

    Brian Davis Guest

    Re: Count bits in VHDL, with loop and unrolled loop producesdifferent results

    Peter wrote:
    >
    > If somebody has a better solution, I would appreciate it if you could share it.
    >

    When I looked at this some years back, XST worked well enough
    at creating an adder cascade using the old "mask and add" software
    trick that I never bothered writing something more optimal:

    gen_bcnt32: if ( ALU_WIDTH = 32 ) generate
    begin

    process(din)
    -- multiple variables not needed, but make intermediate steps
    visible in simulation
    variable temp : unsigned (ALU_MSB downto 0);
    variable temp1 : unsigned (ALU_MSB downto 0);
    variable temp2 : unsigned (ALU_MSB downto 0);
    variable temp3 : unsigned (ALU_MSB downto 0);
    variable temp4 : unsigned (ALU_MSB downto 0);
    variable temp5 : unsigned (ALU_MSB downto 0);

    begin
    temp := unsigned(din);

    temp1 := (temp AND X"5555_5555") + ( ( temp srl 1) AND
    X"5555_5555"); -- 0..2 out x16

    temp2 := (temp1 AND X"3333_3333") + ( ( temp1 srl 2) AND
    X"3333_3333"); -- 0..4 out x8

    temp3 := (temp2 AND X"0707_0707") + ( ( temp2 srl 4) AND
    X"0707_0707"); -- 0..8 out x4

    temp4 := (temp3 AND X"001f_001f") + ( ( temp3 srl 8) AND
    X"001f_001f"); -- 0..16 out x2

    temp5 := (temp4 AND X"0000_003f") + ( ( temp4 srl 16) AND
    X"0000_003f"); -- 0..32 out

    cnt <= std_logic_vector(temp5(5 downto 0));

    end process;

    end generate gen_bcnt32;


    Brian
    Brian Davis, Mar 2, 2011
    #8
  9. In comp.arch.fpga Brian Davis <> wrote:
    (snip)

    > When I looked at this some years back, XST worked well enough
    > at creating an adder cascade using the old "mask and add" software
    > trick that I never bothered writing something more optimal:


    (snip)

    > temp1 := (temp AND X"5555_5555") + ( ( temp srl 1) AND
    > X"5555_5555"); -- 0..2 out x16
    >
    > temp2 := (temp1 AND X"3333_3333") + ( ( temp1 srl 2) AND
    > X"3333_3333"); -- 0..4 out x8


    OK, that would be a binary tree. I believe the CSA adder tree
    is slightly more efficient, though it might depend on the number
    of inputs.

    The binary tree cascade works especially well on word oriented
    machines, and can be easily written in many high-level languages
    (with the assumption of knowing the word size).

    The first stage of a CSA tree starts with N inputs, and generates
    N/3 two bit outputs that are the sums and carries from N/3 full adders.
    (If N isn't a multiple of three, then one bit may bypass the stage,
    and two bits go into a half adder.)

    The next stage takes the N/3 ones and N/3 twos, and generates
    N/9 ones, 2N/9 twos, and N/9 fours. You can continue until
    there is only one bit of each, or sometimes there are other
    optimizations near the end. Last time I did one, I only needed
    to know zero, one, two, three, or more than three, which simplifies
    it slightly.

    It also pipelines well, but then so does the binary tree.

    -- glen
    glen herrmannsfeldt, Mar 2, 2011
    #9
  10. a s

    a s Guest

    Re: Count bits in VHDL, with loop and unrolled loop producesdifferent results

    Andy nailed it again when he said you get what you pay for
    regarding synthesis tool. Initially I was synthesising with
    ISE12.4 and the results were, hm, inconsistent. After
    switching the synthesis tool to SynplifyPro v2010.03 the
    results were as expected and, of course, even better than that.

    Please see the table below. Tricky's version is denoted "funct",
    where there are major differences between the 2 tools:

    ---------- 32-bit input data --------
    unrolled: XST 74 LUTs, 41 slices
    unrolled: SynplifyPro 57 LUTs, 34 slices

    loop: XST 100 LUTs, 54 slices
    loop: SynplifyPro 57 LUTs, 34 slices

    funct: XST 317 LUTs, 161 slices
    funct: SynplifyPro 58 LUTs, 34 slices

    ---------- 64-bit input data --------
    unrolled: XST 174 LUTs, 96 slices
    unrolled: SynplifyPro 129 LUTs, 80 slices

    loop: XST 227 LUTs, 121 slices
    loop: SynplifyPro 129 LUTs, 80 slices

    funct: XST 813 LUTs, 436 slices
    funct: SynplifyPro 130 LUTs, 82 slices



    Peter
    a s, Mar 2, 2011
    #10
  11. a s

    a s Guest

    Re: Count bits in VHDL, with loop and unrolled loop producesdifferent results

    Johnp, Brian, thank you too for your input! Much appreciated.

    I have ran your code through 2 synthesisers and have updated the table
    of required resources.

    -------------- 32-bit input data --------------
    unrolled: XST 74 LUTs, 41 slices
    unrolled: SynplifyPro 57 LUTs, 34 slices

    loop: XST 100 LUTs, 54 slices
    loop: SynplifyPro 57 LUTs, 34 slices

    funct: XST 317 LUTs, 161 slices
    funct: SynplifyPro 58 LUTs, 34 slices

    JohnpV1: XST 62 LUTs, 35 slices
    JohnpV1: SynplifyPro 57 LUTs, 33 slices

    JohnpV2: XST 78 LUTs, 43 slices
    JohnpV2: SynplifyPro 54 LUTs, 32 slices

    Brian: XST 57 LUTs, 39 slices
    Brian: SynplifyPro 57 LUTs, 41 slices


    The latest 3 pairs of results are interesting because even
    XST produces good results, especially in Brian's version
    where XST is surprisingly even slightly better. But anyway,
    it's not that XST is so clever, it is a clever coding of the design.

    Regards,
    Peter
    a s, Mar 2, 2011
    #11
  12. a s

    Andy Guest

    Re: Count bits in VHDL, with loop and unrolled loop producesdifferent results

    Thanks for publishing your results!

    It is interesting how little variance there is in the SynplifyPro
    results from widely different RTL implementations. This allows you,
    within limits, to code something so that it makes sense functionally
    to you and others, and the synthesis tool will still get pretty darn
    close to "the optimal" implementation so that it will work even in
    demanding environments (speed and/or space constrained). This also
    allows reuse of the same code across more environments.

    Andy
    Andy, Mar 2, 2011
    #12
  13. In comp.arch.fpga Andy <> wrote:

    > It is interesting how little variance there is in the SynplifyPro
    > results from widely different RTL implementations. This allows you,
    > within limits, to code something so that it makes sense functionally
    > to you and others, and the synthesis tool will still get pretty darn
    > close to "the optimal" implementation so that it will work even in
    > demanding environments (speed and/or space constrained). This also
    > allows reuse of the same code across more environments.


    As far as I know, yes, the tools are pretty good at optimizing
    combinatorial logic. This problem can be pipelined, though, and
    as far as I know the tools don't have a way to optimize that.

    You would at least have to specify the number of pipeline stages.
    It would be nice to have a tool that would optimize the partition
    between stages. Even more, given the clock frequency, it would be
    nice to have a tool that would find the optimal number of pipeline
    stages.

    -- glen
    glen herrmannsfeldt, Mar 2, 2011
    #13
  14. a s

    Andy Guest

    Re: Count bits in VHDL, with loop and unrolled loop producesdifferent results

    On Mar 2, 1:31 pm, glen herrmannsfeldt <> wrote:
    > As far as I know, yes, the tools are pretty good at optimizing
    > combinatorial logic.  This problem can be pipelined, though, and
    > as far as I know the tools don't have a way to optimize that.
    >
    > You would at least have to specify the number of pipeline stages.
    > It would be nice to have a tool that would optimize the partition
    > between stages.  Even more, given the clock frequency, it would be
    > nice to have a tool that would find the optimal number of pipeline
    > stages.  
    >
    > -- glen


    Depending on which tools to which you are referring, yes they can be
    very good. But there is a large difference between some tools even in
    their ability to optimize purely combinatorial circuits, as shown in
    Peter's results.

    The Synplicity and Mentor tools have the capability to optimize logic
    across register boundaries (re-balancing). Some do it over more than
    one boundary (moving logic more than one clock cycle forward/back).

    You still have to model the pipeline stages, but that can be a simple
    register array tacked onto the beginning or end of the logic, and you
    just let the tool redistribute the logic amongst them.

    Latency often affects other portions of the design, so unless the
    entire design is "floated" WRT to latency (clock cycles or pipeline
    stages), it makes little sense for a local optimizing routine to pick
    the number of stages.

    The behavioral C synthesis tools (Catapult-C and others) take a
    completely untimed C algorithm in the form of a function, and allow
    you to trade resources, clock speed, latency, etc. with the help of
    different views including a Gant chart of various resources and their
    utilization. They also can synthesize different types of hardware
    interfaces, including registers, streaming data, memories (single and
    dual port), etc. around the function.

    Andy
    Andy, Mar 2, 2011
    #14
  15. In comp.arch.fpga Andy <> wrote:

    (after I wrote)
    >> As far as I know, yes, the tools are pretty good at optimizing
    >> combinatorial logic.  This problem can be pipelined, though, and
    >> as far as I know the tools don't have a way to optimize that.

    >
    >> You would at least have to specify the number of pipeline stages.
    >> It would be nice to have a tool that would optimize the partition
    >> between stages.  Even more, given the clock frequency, it would be
    >> nice to have a tool that would find the optimal number of pipeline
    >> stages.  


    > Depending on which tools to which you are referring, yes they can be
    > very good. But there is a large difference between some tools even in
    > their ability to optimize purely combinatorial circuits, as shown in
    > Peter's results.


    > The Synplicity and Mentor tools have the capability to optimize logic
    > across register boundaries (re-balancing). Some do it over more than
    > one boundary (moving logic more than one clock cycle forward/back).


    > You still have to model the pipeline stages, but that can be a simple
    > register array tacked onto the beginning or end of the logic, and you
    > just let the tool redistribute the logic amongst them.


    That does sound pretty nice of them. Lately I mostly use Xilinx
    ISE which, as far as I know, doesn't do that.

    > Latency often affects other portions of the design, so unless the
    > entire design is "floated" WRT to latency (clock cycles or pipeline
    > stages), it makes little sense for a local optimizing routine to pick
    > the number of stages.


    I have done systolic array designs that do, as you say, float.
    The constraint is on the clock period. Though in addition there
    is the question of the number of unit cells that can fit into
    one FPGA. Throughput is clock frequency times (cells/FPGA).

    > The behavioral C synthesis tools (Catapult-C and others) take a
    > completely untimed C algorithm in the form of a function, and allow
    > you to trade resources, clock speed, latency, etc. with the help of
    > different views including a Gant chart of various resources and their
    > utilization. They also can synthesize different types of hardware
    > interfaces, including registers, streaming data, memories (single and
    > dual port), etc. around the function.


    Are there tools that will convert a sequential C code
    dynamic programming algorithm to a systolic array?
    That would be a pretty amazing optimization.

    -- glen
    glen herrmannsfeldt, Mar 3, 2011
    #15
  16. a s

    JustJohn Guest

    Re: Count bits in VHDL, with loop and unrolled loop producesdifferent results

    On Mar 1, 8:41 am, Peter wrote:
    > Andy, I completely agree with what you have written above.
    > One should strive for maintainable and understandable version.
    > Although, on my particular case, I have to find a good solution
    > in terms of LUT resources, because I need 8 instances of
    > one counters with 64-bit input data. And the device is getting full...


    Peter,
    It looks like you've solved your problem simply by moving to a
    better synthesis tool, so this may not be of interest anymore.
    However, in addition to space, finding a more compact implementation
    often leads to a speed increase as well, AND power savings.
    Additionally, I like this counting bits problem, it turns up often
    enough that it deserves some attention. As to maintainability, nothing
    promotes that more than good comments. Once the function is written it
    can be stuffed away in a package, never dealt with again (If anyone
    copies the code, please include credit).

    See my other post for the most compact/fastest way to implement the 32-
    bit sum using 4-LUTs and taking advantage of carry logic fabric.
    Gabor's post of the 35 LUT number when using 6-LUTs got me looking at
    that case. Here are the results (Spartan 3 for the 4-LUTs, Spartan 6
    for the 6-LUTs, XST for both, I'd be curious if any other synthesis
    tool does better). Synthesizers continually improve, but nothing beats
    a good look at the problem, as the 6-LUT case illustrates with a
    better than 2:1 savings:

    vec32_sum2: 4-input LUTs: 53 Slices: 31
    vec32_sum3: 6-input LUTs: 15 Slices: 4

    Finally, this is a neat problem because it's nice to make the little
    things count.

    Best regards all,
    John L. Smith

    Code for 6-LUT based 32 bit sum:
    function vec32_sum3( in_vec : in UNSIGNED ) return UNSIGNED is
    type Leaf_type is array ( 0 to 63 ) of UNSIGNED( 2 downto
    0 );
    -- Each ROM entry is sum of address bits:
    constant Leaf_ROM : Leaf_type := (
    "000", "001", "001", "010", "001", "010", "010", "011",
    "001", "010", "010", "011", "010", "011", "011", "100",
    "001", "010", "010", "011", "010", "011", "011", "100",
    "010", "011", "011", "100", "011", "100", "100", "101",
    "001", "010", "010", "011", "010", "011", "011", "100",
    "010", "011", "011", "100", "011", "100", "100", "101",
    "010", "011", "011", "100", "011", "100", "100", "101",
    "011", "100", "100", "101", "100", "101", "101", "110" );
    type S3_Type is array ( 0 to 4 ) of UNSIGNED( 2 downto 0 );
    variable S3 : S3_Type;
    variable result : UNSIGNED( 5 downto 0 );
    variable Leaf_Bits : natural range 0 to 63;
    variable S4_1 : UNSIGNED( 3 downto 0 );
    variable S4_2 : UNSIGNED( 3 downto 0 );
    variable S5_1 : UNSIGNED( 4 downto 0 );
    begin
    -- Form five 3-bit sums using three 6-LUTs each:
    for i in 0 to 4 loop
    Leaf_Bits := TO_INTEGER( UNSIGNED( in_vec( 6 * i + 5 downto 6 *
    i ) ) );
    S3( i ) := Leaf_ROM( Leaf_Bits );
    end loop;
    -- Add two 3-bit sums + leftover leaf bits as a carry in to get 4
    bit sums:
    S4_1 := ( "0" & S3( 0 ) ) + ( "0" & S3( 1 ) )
    + ( "000" & in_vec( 30 ) );
    S4_2 := ( "0" & S3( 2 ) ) + ( "0" & S3( 3 ) )
    + ( "000" & in_vec( 31 ) );
    -- Add 4 bit sums to get 5 bit sum:
    S5_1 := ( "0" & S4_1 ) + ( "0" & S4_2 );
    -- Add leftover 3 bit sum to get 5 bit result:
    result := ( "0" & S5_1 )
    + ( "000" & S3( 4 ) );
    return result;
    end vec32_sum3;
    JustJohn, Mar 7, 2011
    #16
  17. a s

    JustJohn Guest

    Re: Count bits in VHDL, with loop and unrolled loop producesdifferent results

    On Mar 6, 11:30 pm, JustJohn <> wrote:
    > On Mar 1, 8:41 am, Peter wrote:
    >
    > > Andy, I completely agree with what you have written above.
    > > One should strive for maintainable and understandable version.
    > > Although, on my particular case, I have to find a good solution
    > > in terms of LUT resources, because I need 8 instances of
    > > one counters with 64-bit input data. And the device is getting full...

    >
    > Peter,
    >   It looks like you've solved your problem simply by moving to a
    > better synthesis tool, so this may not be of interest anymore.
    > However, in addition to space, finding a more compact implementation
    > often leads to a speed increase as well, AND power savings.
    > Additionally, I like this counting bits problem, it turns up often
    > enough that it deserves some attention. As to maintainability, nothing
    > promotes that more than good comments. Once the function is written it
    > can be stuffed away in a package, never dealt with again (If anyone
    > copies the code, please include credit).
    >
    > See my other post for the most compact/fastest way to implement the 32-
    > bit sum using 4-LUTs and taking advantage of carry logic fabric.
    > Gabor's post of the 35 LUT number when using 6-LUTs got me looking at
    > that case. Here are the results (Spartan 3 for the 4-LUTs, Spartan 6
    > for the 6-LUTs, XST for both, I'd be curious if any other synthesis
    > tool does better). Synthesizers continually improve, but nothing beats
    > a good look at the problem, as the 6-LUT case illustrates with a
    > better than 2:1 savings:
    >
    >   vec32_sum2:  4-input LUTs: 53  Slices: 31
    >   vec32_sum3:  6-input LUTs: 15  Slices: 4
    >
    > Finally, this is a neat problem because it's nice to make the little
    > things count.
    >
    > Best regards all,
    > John L. Smith
    >

    Chagrinned OOPS!, synthesizer was throwing the leaf ROMs into BRAMs
    for the 6-LUT case. Actual numbers for the 6-LUT case without BRAMs
    are (and this is not a synthesis benchmark, it is an illustraion of
    efficient circuit structure expressed via coding style to reduce LUT
    usage, applicable across all synthesis tools, thanks for the warning
    Philippe, and BTW I AGREE, EULAs forbidding benchmarks are evil):

    vec32_sum3: 6-input LUTs: 30 Slices: 4

    Still tops 35, although some may consider the code too complex for
    marginal improvement.

    Regards,
    John
    JustJohn, Mar 8, 2011
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. harryajh
    Replies:
    9
    Views:
    626
    harryajh
    Mar 27, 2008
  2. Mark B
    Replies:
    1
    Views:
    399
    Göran Andersson
    Feb 11, 2009
  3. AAaron123
    Replies:
    1
    Views:
    738
    AAaron123
    Sep 13, 2009
  4. a s
    Replies:
    2
    Views:
    1,689
    JustJohn
    Mar 4, 2011
  5. Gabor Sz
    Replies:
    0
    Views:
    849
    Gabor Sz
    Mar 5, 2011
Loading...

Share This Page