Xilinx WebPack 8.1i "desoptimization"

Discussion in 'VHDL' started by Rafal Pietrak, Apr 11, 2006.

  1. Hi,

    I'm playing with Tim Böscke "Minimal 8 Bit CPU" code some time ago found
    at: http://www.tu-harburg.de/~setb0209/cpu/ (currently unaccesable, but
    may be google can reveal other repositories).

    My code is now significantly different from the original, but the basic
    functionality remains: The CPU has 2-bits for instruction code and 6-bits
    of address space.

    And all was well until I tried to change instruction encoding (the
    complete source of the CPU is at the tail of the post). The CPU
    synthesizes to 35 macrocells of xc9536xl CPLD.

    But when encoding changes to "new instruction encoding" listed below in
    source header, Xilinx WebPack is not able to put it in fewer then 45
    macrocells!

    The instruction encoding values are used solely in a few comparations at
    the end of the source (marked by "HERE" comment) - this should be easily
    implemented within product terms block. But it isn't!! And that encoding
    changes synthesize result dramatically!

    Can someone shred some light on possible reasons? I've just started to use
    enumerated type variable wherever 'state decoding' apply, but this example
    shows, that synthesizer may not be entirely trusted with optimal
    synthesize of arbitrary 'codes' encoding/decoding logic.

    So:
    1. Is there an explanation on why the synthesizer runs away with so slight
    (and unimportant) source changes?
    2. Is there a way to know, that synthesizer 'fell into disoptimization'?
    (this is *very* simple design - the problem is easily spotted, yet I would
    have have problems seeing it existed if I wasn't so lucky to choose the
    *correct* encoding in the first run.
    3. Is there a way to hint synthesizer (like VHDL statements) on 'proper'
    encoding - leading to optimized synthesize? How would I know which
    encoding is 'fine' for synthesizer, and which isn't.

    Any comments appreciated!

    -R

    ---------------------------------------------------------------
    --
    -- Minimal 8 Bit CPU
    --
    -- new instruction encoding (for OPCODE):
    -- 00 - JCC (branch if carry clear, clear carry)
    -- 01 - ADD
    -- 10 - NOR
    -- 11 - STA
    --

    library ieee;
    use ieee.std_logic_1164.all;
    use ieee.std_logic_unsigned.all;

    entity CPU8BIT2 is
    port ( data: inout std_logic_vector(7 downto 0);
    adress: out std_logic_vector(5 downto 0);
    oe, we: out std_logic; -- Asynchronous memory interface
    clk, rst: in std_logic);
    end;

    architecture CPU_ARCH of CPU8BIT2 is
    signal akku, state: std_logic_vector(8 downto 0); -- akku(8) is carry !
    signal pc: std_logic_vector(5 downto 0);
    alias execute: std_logic is state(8);
    alias opcode: std_logic_vector(1 downto 0) is state(7 downto 6);
    alias adreg: std_logic_vector(5 downto 0) is state(5 downto 0);
    begin
    process(clk,rst)
    begin
    if (rst = '0') then
    state <= (others => '0');
    akku <= (others => '0');
    pc <= (others => '0'); -- start execution at memory location 0
    elsif rising_edge(clk) then
    if (execute = '0') then -- instruction fetch
    pc <= adreg + 1;
    state <= '1' & data; -- fetch the instruction/address
    else -- instruction execution
    state <= "000" & pc;
    if (opcode = "00") then
    if (akku(8) = '1') then
    akku(8) <= '0'; -- ... branch NOT taken, just clear CARRY
    else
    pc <= adreg + 1; -- branch taken... fetch instruction there
    state <= '1' & data;
    end if;
    elsif (opcode = "10") then -- HERE!!
    akku <= ("0" & akku(7 downto 0)) + data + akku(8);
    elsif (opcode = "11") then -- HERE!!
    akku(7 downto 0) <= akku(7 downto 0) nor data;
    end if;
    end if;
    end if;
    end process;

    -- combinational logic (.... HERE!! ... changes to OPCODE encoding)
    adress <= adreg;
    data <= "ZZZZZZZZ" when opcode /= "01" else akku(7 downto 0);
    we <= '1' when (clk='1' or opcode /= "01" or rst='0') else '0'; -- state "101" (branch not taken)
    oe <= '1' when (clk='1' or opcode = "01" or rst='0') else '0'; -- no memory access during reset and

    end CPU_ARCH;
    Rafal Pietrak, Apr 11, 2006
    #1
    1. Advertising

  2. Rafal Pietrak wrote:

    > But when encoding changes to "new instruction encoding" listed below in
    > source header, Xilinx WebPack is not able to put it in fewer then 45
    > macrocells!


    I don't have the before picture, but lets put your code
    on the quartus viewer:
    http://home.comcast.net/~mike_treseler/cpu8bit2.pdf
    __________________________
    Total logic elements 50
    -- Combinational with no register 26
    -- Register only 1
    -- Combinational with a register 23
    Logic element usage by number of LUT inputs
    -- 4 input functions 4
    -- 3 input functions 23
    -- 2 input functions 19
    -- 1 input functions 3
    __________________________

    Comments:
    23 registers bypassed for combinational functions.
    No output register.
    Gated clock on WE and OE.

    > The instruction encoding values are used solely in a few comparations at
    > the end of the source (marked by "HERE" comment) - this should be easily
    > implemented within product terms block. But it isn't!! And that encoding
    > changes synthesize result dramatically!


    Looks like you added more logic than you thought.

    > Can someone shred some light on possible reasons? I've just started to use
    > enumerated type variable wherever 'state decoding' apply, but this example
    > shows, that synthesizer may not be entirely trusted with optimal
    > synthesize of arbitrary 'codes' encoding/decoding logic.


    The synthesizer did what it was told.
    You have declared no enumerated type.
    The state signal is completely specified.

    > Any comments appreciated!


    Put the changes into the synchronous process instead of
    adding a blob of asynchronous logic afterward.
    Synchronize the outputs and add pipeline to make use of the
    orphaned registers. Run a sim that uses the output ports.

    -- Mike Treseler
    Mike Treseler, Apr 11, 2006
    #2
    1. Advertising

  3. On Tue, 11 Apr 2006 10:39:12 -0700, Mike Treseler wrote:

    > Rafal Pietrak wrote:
    >
    >> But when encoding changes to "new instruction encoding" listed below in
    >> source header, Xilinx WebPack is not able to put it in fewer then 45
    >> macrocells!

    >
    > I don't have the before picture, but lets put your code


    I have a copy here: http://poczta.homelinux.com/~rafal/cpu8bit2.vhd

    But the exact equivalence is not an issue here. (well, it will be in a
    moment, just not now).

    I have made changes, to see if I can 'tell the author's story with my own
    words'. I presume, that to some extend I could. BUT, And this is my code
    now. After some significant changes I did to the original source, I've
    just altered opcode *semantics* - and the design suddenly does not fit
    into 36-macrocell CPLD - with this slight change it's 50% larger.

    Those changes are literally:
    -----------------------
    --- ./snap2/tb02cpu2.vhd 2006-04-11 09:48:03.000000000 +0200
    +++ ./snap3/tb02cpu2.vhd 2006-04-11 10:31:28.000000000 +0200
    @@ -60,9 +60,9 @@
    pc <= adreg + 1;
    state <= '1' & data;
    end if;
    - elsif (opcode = "10") then -- add
    + elsif (opcode = "01") then -- add
    akku <= ("0" & akku(7 downto 0)) + data + akku(8);
    - elsif (opcode = "11") then -- nor
    + elsif (opcode = "10") then -- nor
    akku(7 downto 0) <= akku(7 downto 0) nor data;
    end if;
    end if;
    @@ -71,9 +71,9 @@

    -- output
    adress <= adreg;
    - data <= "ZZZZZZZZ" when opcode /= "01" else akku(7 downto 0);
    - we <= '1' when (clk='1' or opcode /= "01" or rst='0') else '0';
    - oe <= '1' when (clk='1' or opcode = "01" or rst='0') else '0';
    + data <= "ZZZZZZZZ" when opcode /= "11" else akku(7 downto 0);
    + we <= '1' when (clk='1' or opcode /= "11" or rst='0') else '0';
    + oe <= '1' when (clk='1' or opcode = "11" or rst='0') else '0';

    end CPU_ARCH;
    ---------------------------------------------------------------

    Source from SNAP2 (lines with minus sign in front) result in synthesize/
    translate /place&route that fit 35 macrocells of xc95*xl, while SNAP3
    (lines with plus sign in front) synthesize into 45 macrocells of that same
    Xilinx CPLD.


    > on the quartus viewer:
    > http://home.comcast.net/~mike_treseler/cpu8bit2.pdf
    > __________________________


    Hmmm. This RTL diagram looks cleaner then what I get from WebPack - may be
    I should do those exercises in quartus...

    Anyway.

    > Looks like you added more logic than you thought.


    Yes. And I'm trying to figure out where those popped into existance. To
    my naive eye, the changes to the source (opcode encoding, highlighted
    above on form of a diff) are *NOT* significant for product-term featured
    hardware (like CPLD). Apparently, they are. Why?

    >> Can someone shred some light on possible reasons? I've just started to use
    >> enumerated type variable wherever 'state decoding' apply, but this example
    >> shows, that synthesizer may not be entirely trusted with optimal
    >> synthesize of arbitrary 'codes' encoding/decoding logic.

    >
    > The synthesizer did what it was told.
    > You have declared no enumerated type.


    That's correct. I haven't.

    > The state signal is completely specified.


    Yes. Exactly.

    The exercise here is to get the feel of what happens on 'inapropriate'
    state encoding. That's why the encoding is stated explicitly.

    When I declare en enumerated type, synthesizer peeks some encoding for me
    (I've seen 4-state FSM being encoded as 2-bits(register+demux) or as
    4-bits (shift register), depending on synthesizer/brand/whatever).

    With this example I can see, that bad choice of state encoding results in
    significant degradation of synthesize result.

    So I wonder, what will happen if my future design is close to 2000 logic
    slices (like spartan family) or more not just the mare 36 macrocells -
    there will be wast space for similar 'disoptimisation' by synthesizer.

    Here, I've learned, that it is hard enough to spot such 'disoptimisation'
    within a 35 macrocell design. I can't imagine walking through 200 000
    gates RTL schematic in the future. The functionality prove by correct
    simulation will be hard enough. Yet, two-three such 'disoptimisation'
    within a 200k gate design, and I get a 500k gates design instead - bom is
    suddenly 50% up.

    Still, I'm not quite sure if keeping tight cost constrains on the design
    is actually required in real engineer's life - I'm just learning.

    But now, I'd really like to know what actually happened within this design
    at hand - why the slight changes highlighted within the diff above did
    trigger so wast change in macrocell usage.

    -R
    Rafal Pietrak, Apr 11, 2006
    #3
  4. Rafal Pietrak wrote:

    > I have a copy here: http://poczta.homelinux.com/~rafal/cpu8bit2.vhd

    OK, let's view that one:
    http://home.comcast.net/~mike_treseler/cpu8bit.pdf
    Yes, the RTL is much simpler for the raw version.

    > After some significant changes I did to the original source, I've
    > just altered opcode *semantics* - and the design suddenly does not fit
    > into 36-macrocell CPLD - with this slight change it's 50% larger.
    > Those changes are literally:
    > . . .


    No. Something else is different.
    When I back just *that* out,
    I get almost the same usage as your other version:
    http://home.comcast.net/~mike_treseler/cpu8bit_undiff.pdf

    > With this example I can see, that bad choice of state encoding results in
    > significant degradation of synthesize result.


    No. State encoding is normally a small effect.
    Something else is going on.
    Check your version control logs.

    > Here, I've learned, that it is hard enough to spot such 'disoptimisation'
    > within a 35 macrocell design. I can't imagine walking through 200 000
    > gates RTL schematic in the future.


    A very good point.
    Spend time on design rules, simulation and source control.
    Let synthesis do its job.

    > Still, I'm not quite sure if keeping tight cost constrains on the design
    > is actually required in real engineer's life.


    Unexpected synthesis problems are almost
    always due to a missing or unenforced design rule.
    Eventually there are no surprises.
    Good luck

    -- Mike Treseler
    Mike Treseler, Apr 11, 2006
    #4
  5. On Tue, 11 Apr 2006 15:52:40 -0700, Mike Treseler wrote:

    > Rafal Pietrak wrote:
    >
    >> With this example I can see, that bad choice of state encoding results in
    >> significant degradation of synthesize result.

    >
    > No. State encoding is normally a small effect.
    > Something else is going on.
    > Check your version control logs.


    I've double checked it before the original post. Currently I'm 100% sure
    the diffs contain the segment that make the difference. Surprisingly, the
    RTL does not really differ that much, really.

    But when checking this again and again, I've noticed the following
    warrning:
    ----------------------------------------------
    Mapping a total of 36 equations into 2 function blocks...........................................................................ERROR:Cpld:892 - Cannot place signal Madd__n0000__n0001/Madd__n0000__n0001_D2.
    Consider reducing the collapsing input limit or the product term limit to
    prevent the fitter from creating high input and/or high product term
    functions.
    Considering device XC9572XL-5-PC44.
    Density optimization.........
    All timespecs have been ignored. Please select "Use Timing Constraints" in the
    implementation options if the timespecs are to be considered.
    General global resource optimization........
    Re-checking device resources ...
    Mapping a total of 45 equations into 4 function blocks...................................
    Design CPU8BIT2 has been optimized and fit into device XC9572XL-5-PC44.
    ---------------------------------------------

    So may be the 'not-so-optimal' state encoding really resulted in 36
    macrocell requirement ..... which somehow mysteriously didn't fit into 36
    macrocell CPLD. But then reattempt to fit it into a 72 macrocell CPLD
    wasn't exhaustive enough, the fitter become a bit lazy and left the 'space
    optimisation' at 45 macrocell usage, since it normally shouldn't matter
    anyway - we take the whole 72 macrocell CPLD in one single piece?

    The bottom line is, may be I shouldn't worry so much:
    1. under ordinary circumstances, 'optimisation loss' at the level of
    35macocell/36macrocell (-> 2-3%) is not really a big deal.
    2. for real world production, I shouldn't leave designs with just 2-3%
    space left, since after sale usage may come back with modification
    requests that will not fit into remaining 3% space and thus may require
    board redesign instead of just flash upgrade.

    -R
    Rafal Pietrak, Apr 12, 2006
    #5
  6. Rafal Pietrak wrote:

    > The bottom line is, may be I shouldn't worry so much:


    Yes, getting a design simulating correctly is priority one.
    With working code, I can compare the fit of many devices quickly.

    > 1. under ordinary circumstances, 'optimisation loss' at the level of
    > 35macocell/36macrocell (-> 2-3%) is not really a big deal.


    Yes, unless the structure is
    duplicated many times, just let it go.

    > 2. for real world production, I shouldn't leave designs with just 2-3%
    > space left, since after sale usage may come back with modification
    > requests that will not fit into remaining 3% space and thus may require
    > board redesign instead of just flash upgrade.


    Well said.
    If I'm down to my last 2%, I'm on thin ice.
    I work ten times as hard to get a fragile fit,
    and duplicate this effort with each design change.

    -- Mike Treseler
    Mike Treseler, Apr 12, 2006
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Urban Stadler

    xilinx webpack

    Urban Stadler, Aug 17, 2004, in forum: VHDL
    Replies:
    4
    Views:
    764
    Brian Drummond
    Aug 18, 2004
  2. mep

    Xilinx Webpack

    mep, Sep 26, 2004, in forum: VHDL
    Replies:
    10
    Views:
    1,265
    rickman
    Sep 30, 2004
  3. Replies:
    1
    Views:
    2,163
  4. Rafal Pietrak

    Dual data rate in Xilinx WebPACK 7.1

    Rafal Pietrak, Feb 28, 2006, in forum: VHDL
    Replies:
    22
    Views:
    1,315
    Mike Treseler
    Mar 9, 2006
  5. Dave
    Replies:
    1
    Views:
    1,630
    backhus
    Jun 12, 2006
Loading...

Share This Page