Describing pipelined hardware

  • Thread starter Jonathan Bromley
  • Start date
K

KJ

Kim Enkovaara said:
The problem is to make clear functional description so accurately that
it describes all the functionality in the pipeline. At least in my
opinion tracking dependencies between pipeline stages is hard.

I'll certainly agree that tracking dependencies between pipeline stages is
hard...so don't design it that way. Use a good flow control specification
that scales well all the way up and down as your basis for getting things in
and out of sub blocks and I think you'll find that darn near all of your
dependency tracking will no longer be required.

Those descriptions may not need to be so terribly hard to produce as you may
be imagining. Keeping in mind the boundaries, you would be describing the
function of your 'Stage 3' only in terms of the inputs to stage 3 and the
function of 'Stage 25' only in terms of the inputs to Stage 25. Using a
common flow control specification (i.e. Avalon, Wishbone, etc.) as the basis
for getting stuff in and out of each block you would not need to really even
talk about flow control of any of the stages other than to say (and then of
course design to) that flow control specification.
With dependencies I mean something like this as an example:

"Stages 3 and 25 share a common memory and stalls in the pipeline are
not allowed. When the format of incoming data is known we know that
when s3 is accessing memory s25 has propagated data that doesn't need
that access".
OK, now if all the stages are designed say to adhere to Altera's Avalon
specification (as an example, not a sales pitch) then both stage 3 and stage
25 would be designed with a master interface for accessing the slave memory
and if your above statement is true then you would simply find that the
stage 3 read/write output signals do not happen to be set at the same time
as the stage 25 read/write output signals. That being the case, one can
- Simply add an assert to validate during simulation that this condition is
never violated.
- Or, detect and report the condition in a status bit
- Or, cover yourself and realize that stage 3 and stage 25 are competing for
a shared resource and add a simple arbiter.

If you did the design with this approach, you'd find that while you're
working on getting the stage 3 functionality up you would not need to know
or care about stage 25 (or any other stage). Same can be said for stage 25.
When it comes time to writing the logic that ties them all together (for the
most part the 'logic' is simply connecting the outputs of one stage to the
inputs to the next) the simple arbiter that you would need would cost at
most a single logic cell.
Now think that change is needed and one pipeline stage is added.
What are all the dependencies that have to be also changed, what are
the new hazards, is the new condition adding new hazards?
If your above statement about stages 3 and 25 and the memory are still
inherently true then I don't need to think about any changes in
dependencies. I simply design the new stage X to the same flow control
specification, connect it up and everything works out just fine. If adding
the new stage possibly alters the relationship between stages 3 and 25 and
the memory then I've already designed in the arbiter so the only thing that
needs to be considered is does the arbiter now maybe need some buffering on
the data path from each stage to avoid stalling the input to the pipeline.
Those dependencies are really hard to handle.
Then don't do it the hard way ;)
Formal model checking can
be a good tool to proof that hazards are not possible with the used
constrained incoming data.

I've never happened to use them though. How good are they in practice and
how much work are they to use?
Of course there can be error conditions and
the design must get over them and heal itself, or at least indicate that
pipeline reset is needed.

In just pure dataflow pipeline without dependencies is not so hard to
document. It is just defined transactions between stages.

So then I think we agree that documenting the function of the sub blocks is
not the issue with documenting, just the dependency documentation.

But if the dependencies you're tracking are all flow control related, then
design everything to a good flow control specification that is designed to
handle stalls. If your requirement is for no stalls, and you've done your
design properly then all that will happen is that the 'wait' input to the
start of the pipeline will never get set even though the 'wait' input to
some of the individual sub-blocks might. The overhead in logic for using
something like Avalon is almost non-existent and the headaches that get
avoided by having to worry about how stages 3 and 25 and the memory all work
together are worth it.

Even if there are other dependencies outside of flow control to be
considered, I'm betting that flow control related dependencies are actually
far and away the biggest thing.
Also if
the stages can stall sometimes, and there are fifos to handle that,
then hazard handling comes much easier.
If the individual stages can possibly stall but the overall function can not
than there better be some fifos somewhere ;)

Kevin Jennings
 
B

Ben Jones

KJ said:
Minor error in the equation for 'c' in the 'Y' process,

You're right, of course - whoops! I had an async interrupt half-way through
writing that post and I obviously suffered some stack corruption on the way
back.
but simple enough to try on a few different tools...

That would be really interesting - do post your results if you can.

You might also try:

Z: process (clock)
begin
if rising_edge(clock) then
c <= (d and a and b) or (c and not (a and b));
end if;
end process;

(Almost certainly just a 4-input function generator plus register).

Oh, and apologies to anyone reading this on comp.lang.verilog and wondering
why some weirdo keeps posting code snippets in a superior HDL ;-) I'm sure
the equivalent Verilog constructs would be treated similarly by the
synthesizer.

Cheers,

-Ben-
 
A

Andy

Wierd synthesis tricks...

"Hold my beer and watch this!"

process (clk)
variable a, b : std_logic;
begin
if rising_edge(clk) then
a := b;
b := input;
out1 <= a xor b; -- registered xor of combo a, b
end if;
out2 <= a xor b; -- combo xor of registered a, b
end process;

Both out1 and out2 simulate _exactly_ the same (including down to the
delta cycle).

If I comment out the out1 assignment, out2 is a combinatorial xor of
registered a and b values.

If I comment out the out2 assignment, out1 is a registered xor of
combinatorial values for a and b (i.e. b and input).

If I leave both in, Synplicity recognizes them as being the same, and
makes both of them share the registered xor implementation from out1.
Note that retiming was not turned on for this excercise, and there was
no mention of retiming out2.

Andy
 
M

Mike Treseler

Andy said:
Weird synthesis tricks...
"Hold my beer and watch this!"

Sorry, I got distracted and finished it.
process (clk)
variable a, b : std_logic;
begin
if rising_edge(clk) then
a := b;
b := input;
out1 <= a xor b; -- registered xor of combo a, b
end if;
out2 <= a xor b; -- combo xor of registered a, b
end process;

Both out1 and out2 simulate _exactly_ the same (including down to the
delta cycle).
If I comment out the out1 assignment, out2 is a combinatorial xor of
registered a and b values.

To maintain compatibility with my a_rst template,
I keep all logic inside the main IF
and only wires outside. This also eliminates
the possibility of unregistered outputs.

I would code your example as:
....
variable a_v, b_v, out1_v: std_logic;
begin
if rising_edge(clk) then
a_v := b_v;
b_v := input; -- expect input-[dq]-
out_v := a_v xor b_v;
end if;
out <= out_v;
end process;

-- Mike Treseler
 
K

Kim Enkovaara

KJ said:
OK, now if all the stages are designed say to adhere to Altera's Avalon
specification (as an example, not a sales pitch) then both stage 3 and stage
25 would be designed with a master interface for accessing the slave memory
and if your above statement is true then you would simply find that the
stage 3 read/write output signals do not happen to be set at the same time
as the stage 25 read/write output signals. That being the case, one can
- Simply add an assert to validate during simulation that this condition is
never violated.

That is too late stage to detect that kind of problem if the violation should
not be there. It should have been detected already at the documentation phase.
At the simulation stage the code is already written. If assumptions were
incorrect in documentation, the block needs to be recoded. That translates to a
slip in schedule.

Also it is hard to be sure that there is 100% coverage in the simulation.
Of course assertion+formal check handles that side, if the tools can handle
the block.
- Or, detect and report the condition in a status bit

Reporting it to staus bit does not help. Maybe the implementation is in asic and
there is no way to fix it anymore. One worrying trend with fpgas is sloppines of
design coming from sw side "Let's just code this quickly, and why should
we simulate, we can test this in lab, we can always update the image"
- Or, cover yourself and realize that stage 3 and stage 25 are competing for
a shared resource and add a simple arbiter.

And if arbitration is needed, one of the stages has to stall. And that has to
be handled with buffering. Then we run into the interesting question of
stall propabilities and the needed buffer sizes etc.
If you did the design with this approach, you'd find that while you're
working on getting the stage 3 functionality up you would not need to know
or care about stage 25 (or any other stage). Same can be said for stage 25.
When it comes time to writing the logic that ties them all together (for the
most part the 'logic' is simply connecting the outputs of one stage to the
inputs to the next) the simple arbiter that you would need would cost at
most a single logic cell.

Plus the buffering memory to store the data during stalls if memory acces is
arbited. And the system level simulations about the performance penalties.
Then don't do it the hard way ;)

Sometimes there is no way to code the functionality without resorting to
very tight control of the pipeline and it's resources.
I've never happened to use them though. How good are they in practice and
how much work are they to use?

The tools are usable nowadays. But the design style matters quite much. If the
state is stored in big memories the tools are quite bad. But if the amount of
state information is quite small and in FFs the tools have easier time.
Model checkers are purely block level tools in terms of capacity still.

Also the constraints for the design can be problematic to write. Also initial
state might be hard to get right with model checkers. Some tools fix that area
nowadays with hybrid approaches. They have simulator engine and formal tools
integrated, and the formal tool runs from the simulated states forwards.

--Kim
 
K

KJ

Kim Enkovaara said:
That is too late stage to detect that kind of problem if the violation
should
not be there. It should have been detected already at the documentation
phase.
At the simulation stage the code is already written. If assumptions were
incorrect in documentation, the block needs to be recoded. That translates
to a
slip in schedule.
Well, personally I don't think it's ever too late to detect a
problem....that's the first step in fixing it. And that would be the whole
point of adding the minimalist assert....to detect that incorrect assumption
that was not caught until simulation. Otherwise, what mechanism do you use
to 'catch' it in simulation? Looking at the waveforms? Adding the assert
to verify that something you know to be true in fact really is is the way to
go here.

As for the recoding, in this case it would not be either of the individual
blocks that needs recoding (if you had followed my approach of using a
standardized I/O model for each block's interfaces) but the interconnect
logic that interfaces stages 3 and 25 to the memory....in other words the
arbitration logic to the memory. The error is not that stage 3 and 25 need
access to a shared resource, the error was believing that they don't happen
to need simultaneous access to the memory and then basing design decisions
on that....and getting burned.
And if arbitration is needed, one of the stages has to stall. And that has
to
be handled with buffering. Then we run into the interesting question of
stall propabilities and the needed buffer sizes etc.
Well, the early on statement that you gave as an example about 'complex'
dependencies said that this won't happen so in theory an arbiter wouldn't be
needed and my point here was that you could cover the case where you
discover late that it really does happen and have the arbiter. In other
words, recognize the shared resource architecture right up front and design
for it.

As for the stall probabilities and needed buffer size, I agree, figuring
that out is a normal part of designing any circuit that has multiple masters
competing for a shared resource as it is in the example that you posed.
Better to accept that this is what you have right up front and deal with it
properly. You seem to be implying that perhaps with a 'clever' design maybe
you can avoid having the stages compete for the memory at the same time and
perhaps in certain situations that it is true but doing so generally adds
undue risk (i.e. the uncaught condition that wasn't found until too late)
and is certainly not something I would recommend for an ASIC design where
the cost of fixing later is way larger than with FPGA. The need to possibly
stall and/or figure out buffer sizes is not a consequence of the I/O model,
it is a result of having two stages compete for a shared resource which was
a given from your example.
Plus the buffering memory to store the data during stalls if memory acces
is
arbited. And the system level simulations about the performance penalties.
Not sure what your point here is. If the basic architecture requires stages
3 and 25 to access a shared resource (i.e. memory in this case) than you
have to arbitrate between the two. The a priori knowledge that the stages
'shouldn't need' simultaneous access would simply mean that the arbiter
itself would not need to be very fancy at all.
Sometimes there is no way to code the functionality without resorting to
very tight control of the pipeline and it's resources.
And standardizing on a good and scalable I/O model for getting data in and
out will not hinder that in any way.

Kevin Jennings
 
K

KJ

Re-labelled your three proposals as 'W', 'X' and 'Y' and added yet
another form 'Z'. Of the four forms for the input there were two
different forms that popped out after syntesis when varying the tool
and the targetted device.

Form 'W', 'X' and 'Z' as input always got synthesized to a netlist that
is of the form of 'W'.

Form 'Y' tended to get synthesized as written for 'Z' (in fact that is
why I added 'Z' to make it easier to describe the results, even though
using 'Z' as input always produced 'W' as a result...go figure).
Sometimes the tools did actually see that 'Y' really is equivalent to
'W' and implemented in that fashion.

Form 'Z' as written shows two flip flops and using Synplify targetting
Spartan 3, that is what popped out. All the other times that form 'Y'
produced a netlist of the form 'Z', the tool was able to figure out
that only one flip flop was needed.

Using Synplify 8.1 to target either Xilinx Spartan, Virtex, XC3000,
Altera Stratix, Stratix II, Lattice ISPXPGA, Actel PA, ProASCI3E all
produced the same results: outputs 'c1', 'c2' and 'c4' all implemented
in the form of 'W'; output 'c3' implemented in the form of 'Z'.

Using Synplify 8.1 to target Spartan 3E all four outputs were
implemented in the form of 'W'.

Using Quartus 5.0 to target Altera Stratix, Stratix II, Cyclone or
Cyclone II all four outputs were implemented in the form of 'W'.

I was having trouble getting ISE going so I didn't try using that.

So I guess at this point, one can conclude that using Synplify 8.1 that
the 'style' of the code can affect the synthesis results. One could
also concule that Quartus 5.0 seems to not let the 'styls' affect the
synthesis results. Both statements having the caveat that I only did
this over a limited set of devices.

Kevin Jennings

See below for the actual code that I used
------- START OF VHDL ----------
library ieee;
use ieee.std_logic_1164.all;
entity Simple is port(
clock: in std_ulogic;
a: in std_ulogic;
b: in std_ulogic;
d: in std_ulogic;
c1: out std_ulogic;
c2: out std_ulogic;
c3: out std_ulogic;
c4: out std_ulogic);
end Simple;

architecture RTL of Simple is
signal c2_int: std_ulogic;
signal c3_int: std_ulogic;
signal c4_int: std_ulogic;
signal c4_int_delayed: std_ulogic;
begin
W: process (clock)
begin
if rising_edge(clock) then
if (a and b) = '1' then
c1 <= d;
end if;
end if;
end process;

X: process (clock)
begin
if rising_edge(clock) then
if a = '1' then
c2_int <= (d and b) or (c2_int and not b);
end if;
end if;
end process;
c2 <= c2_int;

Y: process (clock)
begin
if rising_edge(clock) then
c3_int <= (d and a and b) or (c3_int and not (a and b));
end if;
end process;
c3 <= c3_int;

Z: process (clock)
begin
if rising_edge(clock) then
c4 <= c4_int;
c4_int_delayed <= c4_int;
end if;
end process;
c4_int <= d when ((a and b) = '1') else c4_int_delayed;
end RTL;
------- END OF VHDL ----------
 
J

Jonathan Bromley

So, here's my question: When writing pipelined designs,
what do all you experts out there do to make the overall
data and control flow as clear and obvious as possible?

Thanks to all the contributors for some fascinating responses
and insights. At least you've given me some level of
confidence that I'm not missing something painfully
obvious...
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
(e-mail address removed)
http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
 
F

Frank A. Vorstenbosch

Ben said:
For timing diagrams I have a neat web-based tool I wrote myself (based on an
idea I stole shamelessly from Frank Vorstenbosch). It's not actually on the

Some people have no shame.

Frank
 
J

john

Hi

A trick that will work in some cases (but not all) is to use concurrent
processes rather than a pipeline. So you would create a state machine
for the process, instance several of them and allow them to iterate
concurrently.

Obviously, this will tend to duplicate any costly resources, eg
multipliers, but you can seperate those out into seperate modules and
arrange for accesses to be sequentialised via an arbiter. So you end up
with the control logic in the processes, the critical resources being
out-boarded, and rather a complex dataflow that goes back and forth
between the processes and the resources.

This way, however, the "program logic" of the antire process is
captured in a single state machine implementation which is about as
clean as you can get for arbitrary algorithms.

The real big problem here is that there's no longer a guarantee that
processes will exhibit side effects or complete in the same order you
kicked them off.

Just some food for thought.

Cheers, John
 
B

Ben Jones

Hi, Frank. :)

Frank A. Vorstenbosch said:
the

Some people have no shame.

Frank

If you have no shame, but you freely admit that you have no shame, does it
really count? :)

FWIW the main difference was I used a single image to hold all the glyphs
and some CSS to change the background offset within each table cell, rather
than having a sea of tiny image files. Using the images as backgrounds,
rather than <img>-type objects, has the advantage that you can add (small)
annotations to show the current value of a bus (or what-have-you).

Cheers,

-Ben-
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top