What is the basis on flip-flops replaced by a latch

W

Weng Tianxiang

Hi,
I finally understand the reason when a flip-flops can be replaced by a
latch.

Here is the excerpt from the paper "Atom Processor Core Made FPGA
Synthesizable"
Optimized for a frequency range from 800MHz to 1.86Ghz,
the original Atom design makes extensive use of latches
to support time borrowing along the critical timing paths.
With level-sensitive latches, a signal may have a delay larger
than the clock period and may flush through the latches
without causing incorrect data propagation, whereas the delay
of a signal in designs with edge-triggered flip-flops must
be smaller than the clock period to ensure the correctness of
data propagation across flip-flop stages [3]. It is well known
that the static timing analysis of latch-based pipeline designs
with level-sensitive latches is challenging due to two
salient characteristics of time borrowing [2, 3, 14]: (1) a
delay in one pipeline stage depends on the delays in the previous
pipeline stage. (2) in a pipeline design, not only do
the longest and shortest delays from a primary input to a
primary output need to be propagated through the pipeline
stages, but also the critical probabilities that the delays on
latches violate setup-time and hold-time constraints. Such
high dependency across the pipeline stages makes it very
difficult to gauge the impact of correlations among delay
random variables, especially the correlations resulting from
reconvergent fanouts. Due to this innate difficulty, synthesis
tools like DC-FPGA simply do not support latch analysis
and synthesis correctly."

In short, a pipeline with several FFs can be replaced with a pipeline
with two FFs in the ends and normal latches inserted between them to
steal time slack.

FF1 ---> FF2 ---> FF3 ---> FF4
FF1 ------->l2 --------> l3--> FF4.

I saw the circuits before, but not realized what the basic reason was.
With the above paper, I now know that the technology is not a new, it
originated in 1980s.

Weng
 
P

Patrick Maupin

Yes, latch-based design is much older than flop-based design, for the
simple reason that it can be cheaper. Think about it -- every flop is
really two latches! (At least for static designs that can be clocked
down to DC...) Where I work (at a chip company), we're still
occasionally converting latch-based designs into flop-based ones.

But (and this is a big but) FPGAs themselves (not just the design
tools) are designed for flop-based design, so if you use latch-based
designs with FPGAs you are not only stressing the timing tools, you
are also avoiding the nice, packaged, back-to-back dedicated latches
they give you called flops.

Pat
 
G

glen herrmannsfeldt

In comp.arch.fpga Patrick Maupin said:
Yes, latch-based design is much older than flop-based design, for the
simple reason that it can be cheaper. Think about it -- every flop is
really two latches! (At least for static designs that can be clocked
down to DC...) Where I work (at a chip company), we're still
occasionally converting latch-based designs into flop-based ones.

Often using a two (or more) phase clock. Some latches work on
one phase, some on the other. With appropriately non-overlapping,
one avoids race conditions and the timing isn't so hard to get right.
But (and this is a big but) FPGAs themselves (not just the design
tools) are designed for flop-based design, so if you use latch-based
designs with FPGAs you are not only stressing the timing tools, you
are also avoiding the nice, packaged, back-to-back dedicated latches
they give you called flops.

Well, you could use a sequence of FF's, clocking on different clock
edges, or the same edge of two clocks.

That allows for some of the advantages. If there was enough demand,
I suppose FPGA companies would build transparent latch based devices.
(Who remembers the 7475?)

In pipelined processors of years past the Earle latch combined one
level of logic with the latch logic, reducing the latch delay.

-- glen
 
R

rickman

Hi,
I finally understand the reason when a flip-flops can be replaced by a
latch.

Here is the excerpt from the paper "Atom Processor Core Made FPGA
Synthesizable"
Optimized for a frequency range from 800MHz to 1.86Ghz,
the original Atom design makes extensive use of latches
to support time borrowing along the critical timing paths.
With level-sensitive latches, a signal may have a delay larger
than the clock period and may flush through the latches
without causing incorrect data propagation, whereas the delay
of a signal in designs with edge-triggered flip-flops must
be smaller than the clock period to ensure the correctness of
data propagation across flip-flop stages [3]. It is well known
that the static timing analysis of latch-based pipeline designs
with level-sensitive latches is challenging due to two
salient characteristics of time borrowing [2, 3, 14]: (1) a
delay in one pipeline stage depends on the delays in the previous
pipeline stage. (2) in a pipeline design, not only do
the longest and shortest delays from a primary input to a
primary output need to be propagated through the pipeline
stages, but also the critical probabilities that the delays on
latches violate setup-time and hold-time constraints. Such
high dependency across the pipeline stages makes it very
difficult to gauge the impact of correlations among delay
random variables, especially the correlations resulting from
reconvergent fanouts. Due to this innate difficulty, synthesis
tools like DC-FPGA simply do not support latch analysis
and synthesis correctly."

In short, a pipeline with several FFs can be replaced with a pipeline
with two FFs in the ends and normal latches inserted between them to
steal time slack.

FF1 ---> FF2 ---> FF3 ---> FF4
FF1 ------->l2 --------> l3--> FF4.

I saw the circuits before, but not realized what the basic reason was.
With the above paper, I now know that the technology is not a new, it
originated in 1980s.

Weng

I'm a little unclear on how this works. Is this just a matter of the
outputs of the latches settling earlier if the logic path is faster so
that the next stage actually has more setup time? This requires that
there be a minimum delay in any given path so that the correct data is
latched on the current clock cycle while the result for the next clock
cycle is still propagating through the logic. I can see where this
might be helpful, but it would be a nightmare to analyze in timing,
mainly because of the wide range of delays with process, voltage and
temperature (PVT). I have been told you need to allow 2:1 range when
considering all three.

I think similar issues are involved when considering async design (or
more accurately termed self-timed). In that design method the
variations in delay affect the timing of both the data path and clock
path so that they are largely nulled out so that the min delays do not
need to include the full 2:1 range compared to the max. Some amount
of slack time must be given so the clock arrives after the data, but
otherwise all the speed of the logic is utilized at all times. This
also is supposed to provide for lower noise designs because there is
no chip wide clock giving rise to simultaneous switching noise. Self-
timed logic does not really result in significant increases in
processing speed because although the max speed can be faster, an
application can never rely on that faster speed being available. But
for applications where there is optional processing that can be done
using the left over clock cycles (poor term in this case, but you know
what I mean) it can be useful.

In the case of using latches in place of registers, the speed gains
are always usable. But can't the same sort of gains be made by
register leveling? If you have logic that is slower than a clock
cycle followed by logic that is faster than a clock cycle, why not
just move some of the slow logic across the register to the faster
logic section?

Rick
 
P

Patrick Maupin

Well, you could use a sequence of FF's, clocking on different clock
edges, or the same edge of two clocks.  

I actually did this in Xilinx FPGAs back in 1999. The specific
problem I was solving was an insufficient number of global clocks (a
lot of interconnects with source-based clocking). Xilinx has
solutions for this now (regional clocks), but not back then. So I
used regular interconnect for clocking, and that was very high skew,
so that you couldn't guarantee that the same edge was, in fact, the
same edge for all the flops on the clock.

The solution was to do as you said -- the inputs to every flop were
from flops clocked on the opposite edge. That, and reducing the
amount of logic in that clock domain and clock-crossing to a "real"
clock domain as soon as possible.
 
P

Patrick Maupin

In the case of using latches in place of registers, the speed gains
are always usable.  But can't the same sort of gains be made by
register leveling?  If you have logic that is slower than a clock
cycle followed by logic that is faster than a clock cycle, why not
just move some of the slow logic across the register to the faster
logic section?

That's a similar technique, to be sure, for speed-gains. But as I
wrote in an earlier post, I think the primary motivation for latch-
based design was originally cost. For example, since each flop is
really two latches, if you are going to have logic which ANDs together
the output of two flops, you could replace that with ANDing the output
of two latches, and outputting that result through another latch, for
a net savings of 75% of the latches.
 
W

Weng Tianxiang

That's a similar technique, to be sure, for speed-gains.  But as I
wrote in an earlier post, I think the primary motivation for latch-
based design was originally cost.  For example, since each flop is
really two latches, if you are going to have logic which ANDs together
the output of two flops, you could replace that with ANDing the output
of two latches, and outputting that result through another latch, for
a net savings of 75% of the latches.

Your method's target and the target used by CPU designers inserting
latches in the pipeline line are totally different.

They use it because a combinational signal time delay is tool long to
fit within one clock cycle and too short within two clock cycles in a
pipeline, not in any places you may want to.

Weng
 
J

John_H

In the case of using latches in place of registers, the speed gains
are always usable.  But can't the same sort of gains be made by
register leveling?  If you have logic that is slower than a clock
cycle followed by logic that is faster than a clock cycle, why not
just move some of the slow logic across the register to the faster
logic section?

Rick

I argued with my coworker for a few days about the benefit of latches
versus registers before I finally realized the advantage of latch
based designs. Not only is granularity less of a problem (e.g., only
able to fit 2 logic delays in a level rather than the maximum 2.8
available, losing nearly 30%) but synchronous delays are different.
Rather than accounting for Tco+Tsu for every register in a chain of a
few clock cycles where register leveling is helpful, only the Tito
transparent latch delay (minus the Tilo LUT delay) needs to be added
for each latch in the chain [using Xilinx timing nomenclature].

I agree that the register based FPGAs are probably designed (and
tested) to minimize Tsu and Tco without strong consideration for Tito
and that the timing analysis is NOT set up to do a good job with
"latch leveled" timing analysis.

When I do use latches (when transferring data between rising/falling
time domains for a fast clock, for instance) I have to specify false
values around the latch for synchronous analysis rather than the
precise values through the latch because the analysis wants to see
registers at each stage even with the proper analysis flag turned on.
If the analyzer would recognize a chain of rise/fall/rise/fall
controlled latches and automatically increase the timing constraint by
a half period for each stage, we'd potentially have a powerful tool at
our disposal. But they don't so we don't. At least not in FPGAs.

- John_H
 
G

glen herrmannsfeldt

(snip)
I argued with my coworker for a few days about the benefit of latches
versus registers before I finally realized the advantage of latch
based designs. Not only is granularity less of a problem (e.g., only
able to fit 2 logic delays in a level rather than the maximum 2.8
available, losing nearly 30%) but synchronous delays are different.
Rather than accounting for Tco+Tsu for every register in a chain of a
few clock cycles where register leveling is helpful, only the Tito
transparent latch delay (minus the Tilo LUT delay) needs to be added
for each latch in the chain [using Xilinx timing nomenclature].

I would have thought that they were fast enough now for that
not to matter so much. My thought would be that clock skew,
even with the fancy clock distribution system, would be the important
factor.

If the granularity is the problem then you might try clocking
some on rising and some on falling edge (if available) or having
two clocks with known phase difference. That would be especially
true if the DLL's could generate the appropriate clocks.
I agree that the register based FPGAs are probably designed (and
tested) to minimize Tsu and Tco without strong consideration for Tito
and that the timing analysis is NOT set up to do a good job with
"latch leveled" timing analysis.
When I do use latches (when transferring data between rising/falling
time domains for a fast clock, for instance) I have to specify false
values around the latch for synchronous analysis rather than the
precise values through the latch because the analysis wants to see
registers at each stage even with the proper analysis flag turned on.
If the analyzer would recognize a chain of rise/fall/rise/fall
controlled latches and automatically increase the timing constraint by
a half period for each stage, we'd potentially have a powerful tool at
our disposal. But they don't so we don't. At least not in FPGAs.

That sounds useful. If it gets popular enough, maybe they
will add it.

-- glen
 
J

John_H

Rather than accounting for Tco+Tsu for every register in a chain of a
few clock cycles where register leveling is helpful, only the Tito
transparent latch delay (minus the Tilo LUT delay) needs to be added
for each latch in the chain [using Xilinx timing nomenclature].

I would have thought that they were fast enough now for that
not to matter so much.  My thought would be that clock skew,
even with the fancy clock distribution system, would be the important
factor.

Clock skew becomes entirely unimportant in the latch scheme as I know
it unless CLK and CLK180 are used instead of normal and inverted
versions of the same clock. The latches are explicitly alternated
posedge/negedge/posedge/negedge effectively decomposing a conceptual
register into its two latches and balancing the logic between them.
For clock skew to be an issue, two consecutive latches would have to
be transparent long enough for the logic path plus delays to sneak
through; that won't happen when using the normal and invert of the
*same* clock net unless things are very, very wrong in the latch
design.
If the granularity is the problem then you might try clocking
some on rising and some on falling edge (if available) or having
two clocks with known phase difference.  That would be especially
true if the DLL's could generate the appropriate clocks.

Some... registers? Using the posedge and negedge in a registered
arrangement would simply exacerbate the granularity problem, able to
fit fewer whole delays into the same clock period by dividing the
logic into two phases. The latches allow longer delays to move the
valid data further toward the end of the transparent window and
shorter delays to move it back, always with the safeguard that data
for the next (half) cycle isn't allowed to be valid any sooner than
the front edge of the transparent window.

The description comes out a little muddy which is why it took me a few
days to buy in to the whole concept. It's sweet! It just takes some
timing diagrams and head scratching. And it's certainly not set up
for proper analysis especially in the Xilinx tools where I
experimented with the phase domain changes.

- John_H
 
P

Patrick Maupin

Your method's target and the target used by CPU designers inserting
latches in the pipeline line are totally different.

They use it because a combinational signal time delay is tool long to
fit within one clock cycle and too short within two clock cycles in a
pipeline, not in any places you may want to.

I was agreeing with rickman that in many cases, register retiming can
achieve similarly satisfactory results, while pointing out there were
originally other reasons besides timing to use latches.

I agree that latches are used for speed reasons, as well as cost
reasons. But, as the paper you cite points out, the timing tools
aren't very good at analyzing the speed, and I don't know about the
specifics of the atom, but these days, if a chip designer wants
something that goes faster, he'll just as often use some domino logic
on a few paths rather than using simple latches -- same concept but
even more complicated.

In any case, you have to get your timing information somewhere -- a
latch really is just half a flop, and you have to decide when to close
it, so often you're either you're doing some fancy self-timing, or
your local clock tree gets a lot more complicated when you are doing
the described time-borrowing.

Regards,
Pat
 
P

Patrick Maupin

The description comes out a little muddy which is why it took me a few
days to buy in to the whole concept.  It's sweet!  It just takes some
timing diagrams and head scratching.  And it's certainly not set up
for proper analysis especially in the Xilinx tools where I
experimented with the phase domain changes.

It's not just FPGA tools. Many of the high-end chip tools don't
support this very well, and to do it you need a PhD in the tool.
 
J

John_H

It's not just FPGA tools.  Many of the high-end chip tools don't
support this very well, and to do it you need a PhD in the tool.

The sad thing is it *shouldn't* be difficult. For each stage of latch
traversed with an opposite clock edge, one more half cycle is added to
the overall timing spec for the path. By analyzing up to each stage,
a logic delay short enough to change the input of a latch that's still
not transparent starts the timing path fresh from this intermediate
latch.

It's such a "pretty" cascade of logic delays that I have to research
what you mean by "domino logic" to make sure we're not talking about
the same thing. It truly would be simple to analyze, no PhD required.
 
W

Weng Tianxiang

The sad thing is it *shouldn't* be difficult.  For each stage of latch
traversed with an opposite clock edge, one more half cycle is added to
the overall timing spec for the path.  By analyzing up to each stage,
a logic delay short enough to change the input of a latch that's still
not transparent starts the timing path fresh from this intermediate
latch.

It's such a "pretty" cascade of logic delays that I have to research
what you mean by "domino logic" to make sure we're not talking about
the same thing.  It truly would be simple to analyze, no PhD required.

John_H,
Read this paper first, then make your conclusion.

"Timing Verification and Optimal Clocking of Synchronous Digital
Circuits", published by 3 professors in University of Michigan in
1990, known as SMO algorithm.

Weng
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top