Xilinx WebPack 8.1i "desoptimization"

R

Rafal Pietrak

Hi,

I'm playing with Tim Böscke "Minimal 8 Bit CPU" code some time ago found
at: http://www.tu-harburg.de/~setb0209/cpu/ (currently unaccesable, but
may be google can reveal other repositories).

My code is now significantly different from the original, but the basic
functionality remains: The CPU has 2-bits for instruction code and 6-bits
of address space.

And all was well until I tried to change instruction encoding (the
complete source of the CPU is at the tail of the post). The CPU
synthesizes to 35 macrocells of xc9536xl CPLD.

But when encoding changes to "new instruction encoding" listed below in
source header, Xilinx WebPack is not able to put it in fewer then 45
macrocells!

The instruction encoding values are used solely in a few comparations at
the end of the source (marked by "HERE" comment) - this should be easily
implemented within product terms block. But it isn't!! And that encoding
changes synthesize result dramatically!

Can someone shred some light on possible reasons? I've just started to use
enumerated type variable wherever 'state decoding' apply, but this example
shows, that synthesizer may not be entirely trusted with optimal
synthesize of arbitrary 'codes' encoding/decoding logic.

So:
1. Is there an explanation on why the synthesizer runs away with so slight
(and unimportant) source changes?
2. Is there a way to know, that synthesizer 'fell into disoptimization'?
(this is *very* simple design - the problem is easily spotted, yet I would
have have problems seeing it existed if I wasn't so lucky to choose the
*correct* encoding in the first run.
3. Is there a way to hint synthesizer (like VHDL statements) on 'proper'
encoding - leading to optimized synthesize? How would I know which
encoding is 'fine' for synthesizer, and which isn't.

Any comments appreciated!

-R

---------------------------------------------------------------
--
-- Minimal 8 Bit CPU
--
-- new instruction encoding (for OPCODE):
-- 00 - JCC (branch if carry clear, clear carry)
-- 01 - ADD
-- 10 - NOR
-- 11 - STA
--

library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;

entity CPU8BIT2 is
port ( data: inout std_logic_vector(7 downto 0);
adress: out std_logic_vector(5 downto 0);
oe, we: out std_logic; -- Asynchronous memory interface
clk, rst: in std_logic);
end;

architecture CPU_ARCH of CPU8BIT2 is
signal akku, state: std_logic_vector(8 downto 0); -- akku(8) is carry !
signal pc: std_logic_vector(5 downto 0);
alias execute: std_logic is state(8);
alias opcode: std_logic_vector(1 downto 0) is state(7 downto 6);
alias adreg: std_logic_vector(5 downto 0) is state(5 downto 0);
begin
process(clk,rst)
begin
if (rst = '0') then
state <= (others => '0');
akku <= (others => '0');
pc <= (others => '0'); -- start execution at memory location 0
elsif rising_edge(clk) then
if (execute = '0') then -- instruction fetch
pc <= adreg + 1;
state <= '1' & data; -- fetch the instruction/address
else -- instruction execution
state <= "000" & pc;
if (opcode = "00") then
if (akku(8) = '1') then
akku(8) <= '0'; -- ... branch NOT taken, just clear CARRY
else
pc <= adreg + 1; -- branch taken... fetch instruction there
state <= '1' & data;
end if;
elsif (opcode = "10") then -- HERE!!
akku <= ("0" & akku(7 downto 0)) + data + akku(8);
elsif (opcode = "11") then -- HERE!!
akku(7 downto 0) <= akku(7 downto 0) nor data;
end if;
end if;
end if;
end process;

-- combinational logic (.... HERE!! ... changes to OPCODE encoding)
adress <= adreg;
data <= "ZZZZZZZZ" when opcode /= "01" else akku(7 downto 0);
we <= '1' when (clk='1' or opcode /= "01" or rst='0') else '0'; -- state "101" (branch not taken)
oe <= '1' when (clk='1' or opcode = "01" or rst='0') else '0'; -- no memory access during reset and

end CPU_ARCH;
 
M

Mike Treseler

Rafal said:
But when encoding changes to "new instruction encoding" listed below in
source header, Xilinx WebPack is not able to put it in fewer then 45
macrocells!

I don't have the before picture, but lets put your code
on the quartus viewer:
http://home.comcast.net/~mike_treseler/cpu8bit2.pdf
__________________________
Total logic elements 50
-- Combinational with no register 26
-- Register only 1
-- Combinational with a register 23
Logic element usage by number of LUT inputs
-- 4 input functions 4
-- 3 input functions 23
-- 2 input functions 19
-- 1 input functions 3
__________________________

Comments:
23 registers bypassed for combinational functions.
No output register.
Gated clock on WE and OE.
The instruction encoding values are used solely in a few comparations at
the end of the source (marked by "HERE" comment) - this should be easily
implemented within product terms block. But it isn't!! And that encoding
changes synthesize result dramatically!

Looks like you added more logic than you thought.
Can someone shred some light on possible reasons? I've just started to use
enumerated type variable wherever 'state decoding' apply, but this example
shows, that synthesizer may not be entirely trusted with optimal
synthesize of arbitrary 'codes' encoding/decoding logic.

The synthesizer did what it was told.
You have declared no enumerated type.
The state signal is completely specified.
Any comments appreciated!

Put the changes into the synchronous process instead of
adding a blob of asynchronous logic afterward.
Synchronize the outputs and add pipeline to make use of the
orphaned registers. Run a sim that uses the output ports.

-- Mike Treseler
 
R

Rafal Pietrak

I don't have the before picture, but lets put your code

I have a copy here: http://poczta.homelinux.com/~rafal/cpu8bit2.vhd

But the exact equivalence is not an issue here. (well, it will be in a
moment, just not now).

I have made changes, to see if I can 'tell the author's story with my own
words'. I presume, that to some extend I could. BUT, And this is my code
now. After some significant changes I did to the original source, I've
just altered opcode *semantics* - and the design suddenly does not fit
into 36-macrocell CPLD - with this slight change it's 50% larger.

Those changes are literally:
-----------------------
--- ./snap2/tb02cpu2.vhd 2006-04-11 09:48:03.000000000 +0200
+++ ./snap3/tb02cpu2.vhd 2006-04-11 10:31:28.000000000 +0200
@@ -60,9 +60,9 @@
pc <= adreg + 1;
state <= '1' & data;
end if;
- elsif (opcode = "10") then -- add
+ elsif (opcode = "01") then -- add
akku <= ("0" & akku(7 downto 0)) + data + akku(8);
- elsif (opcode = "11") then -- nor
+ elsif (opcode = "10") then -- nor
akku(7 downto 0) <= akku(7 downto 0) nor data;
end if;
end if;
@@ -71,9 +71,9 @@

-- output
adress <= adreg;
- data <= "ZZZZZZZZ" when opcode /= "01" else akku(7 downto 0);
- we <= '1' when (clk='1' or opcode /= "01" or rst='0') else '0';
- oe <= '1' when (clk='1' or opcode = "01" or rst='0') else '0';
+ data <= "ZZZZZZZZ" when opcode /= "11" else akku(7 downto 0);
+ we <= '1' when (clk='1' or opcode /= "11" or rst='0') else '0';
+ oe <= '1' when (clk='1' or opcode = "11" or rst='0') else '0';

end CPU_ARCH;
---------------------------------------------------------------

Source from SNAP2 (lines with minus sign in front) result in synthesize/
translate /place&route that fit 35 macrocells of xc95*xl, while SNAP3
(lines with plus sign in front) synthesize into 45 macrocells of that same
Xilinx CPLD.

on the quartus viewer:
http://home.comcast.net/~mike_treseler/cpu8bit2.pdf
__________________________

Hmmm. This RTL diagram looks cleaner then what I get from WebPack - may be
I should do those exercises in quartus...

Anyway.
Looks like you added more logic than you thought.

Yes. And I'm trying to figure out where those popped into existance. To
my naive eye, the changes to the source (opcode encoding, highlighted
above on form of a diff) are *NOT* significant for product-term featured
hardware (like CPLD). Apparently, they are. Why?
The synthesizer did what it was told.
You have declared no enumerated type.

That's correct. I haven't.
The state signal is completely specified.

Yes. Exactly.

The exercise here is to get the feel of what happens on 'inapropriate'
state encoding. That's why the encoding is stated explicitly.

When I declare en enumerated type, synthesizer peeks some encoding for me
(I've seen 4-state FSM being encoded as 2-bits(register+demux) or as
4-bits (shift register), depending on synthesizer/brand/whatever).

With this example I can see, that bad choice of state encoding results in
significant degradation of synthesize result.

So I wonder, what will happen if my future design is close to 2000 logic
slices (like spartan family) or more not just the mare 36 macrocells -
there will be wast space for similar 'disoptimisation' by synthesizer.

Here, I've learned, that it is hard enough to spot such 'disoptimisation'
within a 35 macrocell design. I can't imagine walking through 200 000
gates RTL schematic in the future. The functionality prove by correct
simulation will be hard enough. Yet, two-three such 'disoptimisation'
within a 200k gate design, and I get a 500k gates design instead - bom is
suddenly 50% up.

Still, I'm not quite sure if keeping tight cost constrains on the design
is actually required in real engineer's life - I'm just learning.

But now, I'd really like to know what actually happened within this design
at hand - why the slight changes highlighted within the diff above did
trigger so wast change in macrocell usage.

-R
 
M

Mike Treseler

Rafal said:
OK, let's view that one:
http://home.comcast.net/~mike_treseler/cpu8bit.pdf
Yes, the RTL is much simpler for the raw version.
After some significant changes I did to the original source, I've
just altered opcode *semantics* - and the design suddenly does not fit
into 36-macrocell CPLD - with this slight change it's 50% larger.
Those changes are literally:
. . .

No. Something else is different.
When I back just *that* out,
I get almost the same usage as your other version:
http://home.comcast.net/~mike_treseler/cpu8bit_undiff.pdf
With this example I can see, that bad choice of state encoding results in
significant degradation of synthesize result.

No. State encoding is normally a small effect.
Something else is going on.
Check your version control logs.
Here, I've learned, that it is hard enough to spot such 'disoptimisation'
within a 35 macrocell design. I can't imagine walking through 200 000
gates RTL schematic in the future.

A very good point.
Spend time on design rules, simulation and source control.
Let synthesis do its job.
Still, I'm not quite sure if keeping tight cost constrains on the design
is actually required in real engineer's life.

Unexpected synthesis problems are almost
always due to a missing or unenforced design rule.
Eventually there are no surprises.
Good luck

-- Mike Treseler
 
R

Rafal Pietrak

No. State encoding is normally a small effect.
Something else is going on.
Check your version control logs.

I've double checked it before the original post. Currently I'm 100% sure
the diffs contain the segment that make the difference. Surprisingly, the
RTL does not really differ that much, really.

But when checking this again and again, I've noticed the following
warrning:
----------------------------------------------
Mapping a total of 36 equations into 2 function blocks...........................................................................ERROR:Cpld:892 - Cannot place signal Madd__n0000__n0001/Madd__n0000__n0001_D2.
Consider reducing the collapsing input limit or the product term limit to
prevent the fitter from creating high input and/or high product term
functions.
Considering device XC9572XL-5-PC44.
Density optimization.........
All timespecs have been ignored. Please select "Use Timing Constraints" in the
implementation options if the timespecs are to be considered.
General global resource optimization........
Re-checking device resources ...
Mapping a total of 45 equations into 4 function blocks...................................
Design CPU8BIT2 has been optimized and fit into device XC9572XL-5-PC44.
---------------------------------------------

So may be the 'not-so-optimal' state encoding really resulted in 36
macrocell requirement ..... which somehow mysteriously didn't fit into 36
macrocell CPLD. But then reattempt to fit it into a 72 macrocell CPLD
wasn't exhaustive enough, the fitter become a bit lazy and left the 'space
optimisation' at 45 macrocell usage, since it normally shouldn't matter
anyway - we take the whole 72 macrocell CPLD in one single piece?

The bottom line is, may be I shouldn't worry so much:
1. under ordinary circumstances, 'optimisation loss' at the level of
35macocell/36macrocell (-> 2-3%) is not really a big deal.
2. for real world production, I shouldn't leave designs with just 2-3%
space left, since after sale usage may come back with modification
requests that will not fit into remaining 3% space and thus may require
board redesign instead of just flash upgrade.

-R
 
M

Mike Treseler

Rafal said:
The bottom line is, may be I shouldn't worry so much:

Yes, getting a design simulating correctly is priority one.
With working code, I can compare the fit of many devices quickly.
1. under ordinary circumstances, 'optimisation loss' at the level of
35macocell/36macrocell (-> 2-3%) is not really a big deal.

Yes, unless the structure is
duplicated many times, just let it go.
2. for real world production, I shouldn't leave designs with just 2-3%
space left, since after sale usage may come back with modification
requests that will not fit into remaining 3% space and thus may require
board redesign instead of just flash upgrade.

Well said.
If I'm down to my last 2%, I'm on thin ice.
I work ten times as hard to get a fragile fit,
and duplicate this effort with each design change.

-- Mike Treseler
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top