binary search has a longer combinational delay than linear search

Discussion in 'VHDL' started by sanborne, Dec 14, 2009.

  1. sanborne

    sanborne Guest

    Hello,

    I am posting to this group because I am seeing some non-intuitive
    behavior for one of my synthesized designs, and I would really like
    understand what is going on. I hope that someone will have some
    insight that they would be willing to share with me.

    I spent quite a lot of time building an HDL design that I thought
    would be a very efficient in hardware. Essentially, I need to have a
    fixed point lookup table. I am approximating a fixed point natural
    logarithm and some other unusual functions. I am using a spline
    approximation to generate a set of "breaks" over a given interval
    where I want these functions defined. An important part of my solution
    is that I need to find the appropriate "break" for a given input
    value. For clarification, see the following pseudo-code:

    --begin code (assuming x is the input value and y is the output, and a
    first order polynomial approximation):
    if x<B0 then
    A <= to_sfixed(...);
    B <= to_sfixed(...);

    elsif x<B1 then
    A <= to_sfixed(...);
    B <= to_sfixed(...);
    elsif
    ....
    elsif x<B9
    A <= to_sfixed(...);
    B <= to_sfixed(...);
    else
    ....
    end if;
    y <= A*x + B;
    --end code (note that my actual code is slightly different...this is
    only here to illustrate my question)

    I assumed from my schooling that a binary search for the correct
    coefficients A and B would be better than a linear search (as is
    illustrated in my code above). But strangely enough, when I synthesize
    both a design that is implemented using linear and a binary search,
    the linear search has a shorter combinational delay and therefore can
    actually be clocked faster! I think there is some kind of optimization
    that is happening in the synthesis tool (I am using Mentor Graphics
    Precision, but have seen similar behavior in Quartus II). But this
    result seems to render unnecessary a ton of work that I did to write
    binary search code instead of linear search.

    Here is an example of how my binary search would look (note that I
    have simulated both approaches and get identical results):

    --begin pseudo-code
    if x<B4
    if x<B1
    if x<B0
    A <= to_sfixed(...);
    B <= to_sfixed(...);
    else
    A <=...
    B <=...
    end if;
    elsif x>B2
    A <=...
    B <=...
    else
    ...
    ...
    end if;
    elsif x>B4

    ....
    --etc
    --end code

    Is there something wrong with my thinking here? Why would a linear
    search have a shorter delay than a binary search? Is the synthesis
    tool automatically pipelining my design? Has anyone else seen results
    like this? Do you believe the results, or do you suspect I am doing
    something wrong in my code?

    Note that I could post more code details, but I am using MATLAB and
    the Simulink HDL Coder, and it would be difficult for me to remove the
    dependency of my code on these tools. I have a hand written design,
    but it is old, and I would probably need permission from my company to
    post specifics anyway. I understand that I may need to put together a
    "toy" example that can be used to more clearly characterize this
    question, but I wonder first if anyone has any general insight.

    I would be very interested and appreciative of any comments.

    Sean Little
     
    sanborne, Dec 14, 2009
    #1
    1. Advertising

  2. On Dec 14, 6:01 pm, sanborne <> wrote:

    > I assumed from my schooling that a binary search for the correct
    > coefficients A and B would be better than a linear search


    If you can access only one search key at a time, sure. But
    in hardware you can access multiple search keys in parallel;
    the cost is not of time, but of area (more comparators).
    By the look of your code, both your binary and linear searches
    are in fact comparing the input value with all keys
    simultaneously - so there is no added area cost in going
    to linear search.

    > But strangely enough, when I synthesize
    > both a design that is implemented using linear and a binary search,
    > the linear search has a shorter combinational delay and therefore can
    > actually be clocked faster! I think there is some kind of optimization
    > that is happening in the synthesis tool (I am using Mentor Graphics
    > Precision, but have seen similar behavior in Quartus II).


    This is interesting, but perhaps not very surprising. A linear
    string of comparisons is a very regular structure and it's quite
    possible that the synth tool will make the same optimization that
    a designer might (more details below). Your hand-crafted binary
    search, though, makes a rather irregular hardware structure and
    it's probably not optimized - so you end up with a cascade of
    comparators (three in your example) on the critical path.

    > But this
    > result seems to render unnecessary a ton of work that I did
    > to write binary search code instead of linear search.


    Synthesis is like that :) It's often best to use a simple
    iterative algorithm and let the tool reshape it. Trying to be
    too clever can defeat the tool's ability to see important
    optimization opportunities.

    Of course, that doesn't mean human ingenuity is useless.
    Particularly when a computation is spread across many
    clock cycles, there may be algorithmic improvements that
    you can see but a synthesis tool wouldn't even get close
    to achieving.

    > Why would a linear
    > search have a shorter delay than a binary search?


    If there are few enough search keys for you to do all
    comparisons simultaneously, and if the keys are in
    order (easy if they're the values in a monotonic table)
    then the "linear" search can be nicely parallelized:
    - Provide one magnitude comparator per search key.
    - Compare the input value with each key. This gives
    you a "thermometer code" of single-bit comparison
    results - something like 1111100000000.
    - Locate the 1/0 boundary of the thermometer code, and
    use its location to decide which table entry to use.

    Converting a thermometer code into a location number is
    a simple problem with well-know, efficient solutions
    (actually they're a kind of binary search, which is ironic)
    and I've often seen synthesis tools doing precisely this
    kind of arrangement.

    > Is the synthesis
    > tool automatically pipelining my design?


    From the code snippet you gave, I very much doubt it.

    As a matter of interest, I've used a similar technique in
    the past in image processing designs to keep a sorted list
    of the largest N values seen in a data stream (for
    small-ish N, up to about 10).

    At a more abstract level, you can think of it like this:
    Linear search for a single value in a collection of N
    keys is an O(N) problem in software - the time taken to
    do it scales linearly with N. In hardware, you still
    have an O(N) problem but now you can make the hardware
    size scale linearly with N (you have N comparators) but
    the whole thing can be done in constant time.

    In a similar way, simple sort algorithms that are O(N^2)
    time in software can easily be re-jigged in hardware to
    be - for example - O(N) area and O(N) clock cycles, or
    O(N^2) area and single-cycle. For large N the software
    approach will always eventually win, given the low cost of
    large memories. But if N is quite small, smearing the
    algorithm across more hardware is an attractive option.
    --
    Jonathan Bromley
     
    Jonathan Bromley, Dec 16, 2009
    #2
    1. Advertising

  3. sanborne

    Andy Guest

    Depending on the target FPGA architecture (you mentioned Quartus), the
    linear search may be optimized to use the "carry logic" resources and
    dedicated interconnect, which are faster than standard LUT resources
    and regular interconnect. The binary search may not be able to use
    these as effectively.

    I completely agree with Jonathan's statement about trying to outsmart
    the synthesis tool. Always code your algorithm in the easiest
    behavioral way to understand what it is doing. Then let the synthesis
    tool take a crack at it. If it does not fit or does not meet timing,
    then you can get more creative and construct the behavior to encourage
    a better implementation. But staying with the simple to read/write/
    understand algorithm will pay big dividends down the road in
    maintenance, etc. A complex description that meets area/timing, but
    with a hidden functional bug is worse than a simple description that
    will not meet area/timing, but is functionally correct.

    Andy
     
    Andy, Dec 16, 2009
    #3
  4. sanborne

    sanborne Guest

    On Dec 16, 9:48 am, Andy <> wrote:
    > Depending on the target FPGA architecture (you mentioned Quartus), the
    > linear search may be optimized to use the "carry logic" resources and
    > dedicated interconnect, which are faster than standard LUT resources
    > and regular interconnect. The binary search may not be able to use
    > these as effectively.
    >
    > I completely agree with Jonathan's statement about trying to outsmart
    > the synthesis tool.  Always code your algorithm in the easiest
    > behavioral way to understand what it is doing. Then let the synthesis
    > tool take a crack at it. If it does not fit or does not meet timing,
    > then you can get more creative and construct the behavior to encourage
    > a better implementation. But staying with the simple to read/write/
    > understand algorithm will pay big dividends down the road in
    > maintenance, etc. A complex description that meets area/timing, but
    > with a hidden functional bug is worse than a simple description that
    > will not meet area/timing, but is functionally correct.
    >
    > Andy


    Very interesting.

    Thanks for the responses, Jonathan and Andy. I feel a bit defeated.
    Clearly I have been under-estimating the power of the synthesis tools.

    Sean
     
    sanborne, Dec 16, 2009
    #4
  5. sanborne

    KJ Guest

    On Dec 16, 1:53 pm, sanborne <> wrote:

    >
    > Thanks for the responses, Jonathan and Andy. I feel a bit defeated.
    > Clearly I have been under-estimating the power of the synthesis tools.
    >


    Not to be a noodge, but algorithm improvements on the source code side
    can many times lead to far better results than any synthesis tool
    could ever come up with...In this particular case, for how you
    happened to code your binary search, it didn't work out.

    A simple example of a better algorithm that a synthesizer wouldn't get
    would be division. Y <= A / B; will synthesize...into something large
    and slow for any 'A', 'B' that is of a useful length (and assuming no
    special hardware, just logic blocks). But instead choose to implement
    the division using lpm_divide or whatever your favorite multi-clock
    cycle divider is and Presto! you have a divider that is not nearly so
    large and runs very fast, the tradeoff being that you have several
    clock cycles of latency before you get the result.

    The reason a synthesis tool usually can't come up with that is because
    it is not allowed to change the function...delaying the availability
    of some result at an I/O pin by even one clock cycle changes the
    function. You on the other hand, as the designer, are free to make
    other compensating changes to flag to the receiver when the results of
    the division are ready. In other words you *can* choose to implement
    something where the final result pops out of a pin one clock cycle
    later.

    A synthesis tool, in principle, could make this transformation if the
    end result ends up at an I/O pin after the correct number of clock
    cycles, in practice it likely will not.

    Another example is long if/elsif...endif statements. To implement the
    'else', there must be logic and the amount of logic grows as you work
    your way further down the 'elsif' tree. The solution here is to look
    for way to parallel-ize the computations so they can be done
    independently. One way is to see if the big 'if' can instead be
    handled as a case statement. 'Case' is less expensive than 'if/elsif'
    because all case conditions are mutually independent. Another way can
    be to implement 'partial' solutions where all of the partial solutions
    get combined at the end all the time independent of conditions.

    Lastly, the most powerful mechanism for improving performance usually
    is pipeling. Usually this leads to faster logic (with longer
    latency), but sometimes also less logic (as it did with division as
    previously mentioned).

    Andy is on target with first cut making for a simple, easy to
    understand description first. If that is fast enough and fits, then
    it might also be good enough. Most anything can be improved upon, so
    if down the road the simple approach falls a bit short than you can
    improve it at that time.

    Bottom line though is, besides respecting the ability of the synthesis
    tool to do a good job at synthesizing logic, you shouldn't feel
    'beaten' by the tool to the point of not trying to improve on the
    results.

    Kevin Jennings
     
    KJ, Dec 17, 2009
    #5
  6. sanborne

    Andy Guest

    KJ has given you excellent advice.

    Some synthesis tools include optional optimizations that can spread
    logic out over multiple clocks (across multiple flops), but you must
    have allowed for the total amount of flops/clock cycles in the first
    place (the synthesis tool cannot alter the observable clock cycle
    based behavior of your description). For instance, if you followed a
    division or other operation (in time) with several empty pipeline
    stages, some tools can spread that division or other operation over
    some of those pipeline stages automatically. Some tools can only move
    logic up to one clock/flop in either direction.

    Any synthesis tool worth its salt does at least a rudimentary
    reachability analysis on statements in a process. If the conditions on
    a long if-then-elsif tree are all demonstrably mutually exclusive to
    the synthesizer, then it will optimize out the priority logic inherent
    in most if-elsif trees. Keep in mind that multiple, sequential if-then-
    end statements can also generate priority logic, because the last
    assignment to a signal in a process wins (during the same clock
    cycle), trumping previous assignments.

    One place where this is handy is you can step through a long if-then-
    end-if sequence with a for-loop; you can't use a loop that way with a
    case statement. If the conditions are mutually exclusive, you avoid
    the priority logic. If they are not, you cannot use a case statement
    for them anyway. Synthesis unrolls the loop automatically for you.

    You can also emulate a long if-then-elsif tree with a for-loop and
    exit statement too.

    Andy
     
    Andy, Dec 18, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ALuPin

    Combinational Loop?

    ALuPin, Aug 30, 2004, in forum: VHDL
    Replies:
    5
    Views:
    6,872
    Ken Smith
    Aug 31, 2004
  2. GuitarNerd
    Replies:
    8
    Views:
    3,794
    Andy Peters
    May 19, 2005
  3. Replies:
    1
    Views:
    1,121
    Roedy Green
    Nov 15, 2005
  4. samehsh
    Replies:
    0
    Views:
    1,865
    samehsh
    Jun 21, 2008
  5. Bogdan

    Binary tree search vs Binary search

    Bogdan, Oct 18, 2010, in forum: C Programming
    Replies:
    22
    Views:
    3,077
    Michael Angelo Ravera
    Oct 21, 2010
Loading...

Share This Page