What is the basis on flip-flops replaced by a latch

Discussion in 'VHDL' started by Weng Tianxiang, Feb 11, 2010.

  1. Hi,
    I finally understand the reason when a flip-flops can be replaced by a

    Here is the excerpt from the paper "Atom Processor Core Made FPGA
    Optimized for a frequency range from 800MHz to 1.86Ghz,
    the original Atom design makes extensive use of latches
    to support time borrowing along the critical timing paths.
    With level-sensitive latches, a signal may have a delay larger
    than the clock period and may flush through the latches
    without causing incorrect data propagation, whereas the delay
    of a signal in designs with edge-triggered flip-flops must
    be smaller than the clock period to ensure the correctness of
    data propagation across flip-flop stages [3]. It is well known
    that the static timing analysis of latch-based pipeline designs
    with level-sensitive latches is challenging due to two
    salient characteristics of time borrowing [2, 3, 14]: (1) a
    delay in one pipeline stage depends on the delays in the previous
    pipeline stage. (2) in a pipeline design, not only do
    the longest and shortest delays from a primary input to a
    primary output need to be propagated through the pipeline
    stages, but also the critical probabilities that the delays on
    latches violate setup-time and hold-time constraints. Such
    high dependency across the pipeline stages makes it very
    difficult to gauge the impact of correlations among delay
    random variables, especially the correlations resulting from
    reconvergent fanouts. Due to this innate difficulty, synthesis
    tools like DC-FPGA simply do not support latch analysis
    and synthesis correctly."

    In short, a pipeline with several FFs can be replaced with a pipeline
    with two FFs in the ends and normal latches inserted between them to
    steal time slack.

    FF1 ---> FF2 ---> FF3 ---> FF4
    FF1 ------->l2 --------> l3--> FF4.

    I saw the circuits before, but not realized what the basic reason was.
    With the above paper, I now know that the technology is not a new, it
    originated in 1980s.

    Weng Tianxiang, Feb 11, 2010
    1. Advertisements

  2. Yes, latch-based design is much older than flop-based design, for the
    simple reason that it can be cheaper. Think about it -- every flop is
    really two latches! (At least for static designs that can be clocked
    down to DC...) Where I work (at a chip company), we're still
    occasionally converting latch-based designs into flop-based ones.

    But (and this is a big but) FPGAs themselves (not just the design
    tools) are designed for flop-based design, so if you use latch-based
    designs with FPGAs you are not only stressing the timing tools, you
    are also avoiding the nice, packaged, back-to-back dedicated latches
    they give you called flops.

    Patrick Maupin, Feb 12, 2010
    1. Advertisements

  3. Often using a two (or more) phase clock. Some latches work on
    one phase, some on the other. With appropriately non-overlapping,
    one avoids race conditions and the timing isn't so hard to get right.
    Well, you could use a sequence of FF's, clocking on different clock
    edges, or the same edge of two clocks.

    That allows for some of the advantages. If there was enough demand,
    I suppose FPGA companies would build transparent latch based devices.
    (Who remembers the 7475?)

    In pipelined processors of years past the Earle latch combined one
    level of logic with the latch logic, reducing the latch delay.

    -- glen
    glen herrmannsfeldt, Feb 12, 2010
  4. Weng Tianxiang

    rickman Guest

    I'm a little unclear on how this works. Is this just a matter of the
    outputs of the latches settling earlier if the logic path is faster so
    that the next stage actually has more setup time? This requires that
    there be a minimum delay in any given path so that the correct data is
    latched on the current clock cycle while the result for the next clock
    cycle is still propagating through the logic. I can see where this
    might be helpful, but it would be a nightmare to analyze in timing,
    mainly because of the wide range of delays with process, voltage and
    temperature (PVT). I have been told you need to allow 2:1 range when
    considering all three.

    I think similar issues are involved when considering async design (or
    more accurately termed self-timed). In that design method the
    variations in delay affect the timing of both the data path and clock
    path so that they are largely nulled out so that the min delays do not
    need to include the full 2:1 range compared to the max. Some amount
    of slack time must be given so the clock arrives after the data, but
    otherwise all the speed of the logic is utilized at all times. This
    also is supposed to provide for lower noise designs because there is
    no chip wide clock giving rise to simultaneous switching noise. Self-
    timed logic does not really result in significant increases in
    processing speed because although the max speed can be faster, an
    application can never rely on that faster speed being available. But
    for applications where there is optional processing that can be done
    using the left over clock cycles (poor term in this case, but you know
    what I mean) it can be useful.

    In the case of using latches in place of registers, the speed gains
    are always usable. But can't the same sort of gains be made by
    register leveling? If you have logic that is slower than a clock
    cycle followed by logic that is faster than a clock cycle, why not
    just move some of the slow logic across the register to the faster
    logic section?

    rickman, Feb 12, 2010
  5. I actually did this in Xilinx FPGAs back in 1999. The specific
    problem I was solving was an insufficient number of global clocks (a
    lot of interconnects with source-based clocking). Xilinx has
    solutions for this now (regional clocks), but not back then. So I
    used regular interconnect for clocking, and that was very high skew,
    so that you couldn't guarantee that the same edge was, in fact, the
    same edge for all the flops on the clock.

    The solution was to do as you said -- the inputs to every flop were
    from flops clocked on the opposite edge. That, and reducing the
    amount of logic in that clock domain and clock-crossing to a "real"
    clock domain as soon as possible.
    Patrick Maupin, Feb 13, 2010
  6. That's a similar technique, to be sure, for speed-gains. But as I
    wrote in an earlier post, I think the primary motivation for latch-
    based design was originally cost. For example, since each flop is
    really two latches, if you are going to have logic which ANDs together
    the output of two flops, you could replace that with ANDing the output
    of two latches, and outputting that result through another latch, for
    a net savings of 75% of the latches.
    Patrick Maupin, Feb 13, 2010
  7. Your method's target and the target used by CPU designers inserting
    latches in the pipeline line are totally different.

    They use it because a combinational signal time delay is tool long to
    fit within one clock cycle and too short within two clock cycles in a
    pipeline, not in any places you may want to.

    Weng Tianxiang, Feb 13, 2010
  8. Weng Tianxiang

    John_H Guest

    I argued with my coworker for a few days about the benefit of latches
    versus registers before I finally realized the advantage of latch
    based designs. Not only is granularity less of a problem (e.g., only
    able to fit 2 logic delays in a level rather than the maximum 2.8
    available, losing nearly 30%) but synchronous delays are different.
    Rather than accounting for Tco+Tsu for every register in a chain of a
    few clock cycles where register leveling is helpful, only the Tito
    transparent latch delay (minus the Tilo LUT delay) needs to be added
    for each latch in the chain [using Xilinx timing nomenclature].

    I agree that the register based FPGAs are probably designed (and
    tested) to minimize Tsu and Tco without strong consideration for Tito
    and that the timing analysis is NOT set up to do a good job with
    "latch leveled" timing analysis.

    When I do use latches (when transferring data between rising/falling
    time domains for a fast clock, for instance) I have to specify false
    values around the latch for synchronous analysis rather than the
    precise values through the latch because the analysis wants to see
    registers at each stage even with the proper analysis flag turned on.
    If the analyzer would recognize a chain of rise/fall/rise/fall
    controlled latches and automatically increase the timing constraint by
    a half period for each stage, we'd potentially have a powerful tool at
    our disposal. But they don't so we don't. At least not in FPGAs.

    - John_H
    John_H, Feb 13, 2010
  9. (snip)
    I would have thought that they were fast enough now for that
    not to matter so much. My thought would be that clock skew,
    even with the fancy clock distribution system, would be the important

    If the granularity is the problem then you might try clocking
    some on rising and some on falling edge (if available) or having
    two clocks with known phase difference. That would be especially
    true if the DLL's could generate the appropriate clocks.
    That sounds useful. If it gets popular enough, maybe they
    will add it.

    -- glen
    glen herrmannsfeldt, Feb 13, 2010
  10. Weng Tianxiang

    John_H Guest

    Clock skew becomes entirely unimportant in the latch scheme as I know
    it unless CLK and CLK180 are used instead of normal and inverted
    versions of the same clock. The latches are explicitly alternated
    posedge/negedge/posedge/negedge effectively decomposing a conceptual
    register into its two latches and balancing the logic between them.
    For clock skew to be an issue, two consecutive latches would have to
    be transparent long enough for the logic path plus delays to sneak
    through; that won't happen when using the normal and invert of the
    *same* clock net unless things are very, very wrong in the latch
    Some... registers? Using the posedge and negedge in a registered
    arrangement would simply exacerbate the granularity problem, able to
    fit fewer whole delays into the same clock period by dividing the
    logic into two phases. The latches allow longer delays to move the
    valid data further toward the end of the transparent window and
    shorter delays to move it back, always with the safeguard that data
    for the next (half) cycle isn't allowed to be valid any sooner than
    the front edge of the transparent window.

    The description comes out a little muddy which is why it took me a few
    days to buy in to the whole concept. It's sweet! It just takes some
    timing diagrams and head scratching. And it's certainly not set up
    for proper analysis especially in the Xilinx tools where I
    experimented with the phase domain changes.

    - John_H
    John_H, Feb 14, 2010
  11. I was agreeing with rickman that in many cases, register retiming can
    achieve similarly satisfactory results, while pointing out there were
    originally other reasons besides timing to use latches.

    I agree that latches are used for speed reasons, as well as cost
    reasons. But, as the paper you cite points out, the timing tools
    aren't very good at analyzing the speed, and I don't know about the
    specifics of the atom, but these days, if a chip designer wants
    something that goes faster, he'll just as often use some domino logic
    on a few paths rather than using simple latches -- same concept but
    even more complicated.

    In any case, you have to get your timing information somewhere -- a
    latch really is just half a flop, and you have to decide when to close
    it, so often you're either you're doing some fancy self-timing, or
    your local clock tree gets a lot more complicated when you are doing
    the described time-borrowing.

    Patrick Maupin, Feb 14, 2010
  12. It's not just FPGA tools. Many of the high-end chip tools don't
    support this very well, and to do it you need a PhD in the tool.
    Patrick Maupin, Feb 14, 2010
  13. Weng Tianxiang

    John_H Guest

    The sad thing is it *shouldn't* be difficult. For each stage of latch
    traversed with an opposite clock edge, one more half cycle is added to
    the overall timing spec for the path. By analyzing up to each stage,
    a logic delay short enough to change the input of a latch that's still
    not transparent starts the timing path fresh from this intermediate

    It's such a "pretty" cascade of logic delays that I have to research
    what you mean by "domino logic" to make sure we're not talking about
    the same thing. It truly would be simple to analyze, no PhD required.
    John_H, Feb 14, 2010
  14. John_H,
    Read this paper first, then make your conclusion.

    "Timing Verification and Optimal Clocking of Synchronous Digital
    Circuits", published by 3 professors in University of Michigan in
    1990, known as SMO algorithm.

    Weng Tianxiang, Feb 15, 2010
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.