binary search has a longer combinational delay than linear search

S

sanborne

Hello,

I am posting to this group because I am seeing some non-intuitive
behavior for one of my synthesized designs, and I would really like
understand what is going on. I hope that someone will have some
insight that they would be willing to share with me.

I spent quite a lot of time building an HDL design that I thought
would be a very efficient in hardware. Essentially, I need to have a
fixed point lookup table. I am approximating a fixed point natural
logarithm and some other unusual functions. I am using a spline
approximation to generate a set of "breaks" over a given interval
where I want these functions defined. An important part of my solution
is that I need to find the appropriate "break" for a given input
value. For clarification, see the following pseudo-code:

--begin code (assuming x is the input value and y is the output, and a
first order polynomial approximation):
if x<B0 then
A <= to_sfixed(...);
B <= to_sfixed(...);

elsif x<B1 then
A <= to_sfixed(...);
B <= to_sfixed(...);
elsif
....
elsif x<B9
A <= to_sfixed(...);
B <= to_sfixed(...);
else
....
end if;
y <= A*x + B;
--end code (note that my actual code is slightly different...this is
only here to illustrate my question)

I assumed from my schooling that a binary search for the correct
coefficients A and B would be better than a linear search (as is
illustrated in my code above). But strangely enough, when I synthesize
both a design that is implemented using linear and a binary search,
the linear search has a shorter combinational delay and therefore can
actually be clocked faster! I think there is some kind of optimization
that is happening in the synthesis tool (I am using Mentor Graphics
Precision, but have seen similar behavior in Quartus II). But this
result seems to render unnecessary a ton of work that I did to write
binary search code instead of linear search.

Here is an example of how my binary search would look (note that I
have simulated both approaches and get identical results):

--begin pseudo-code
if x<B4
if x<B1
if x<B0
A <= to_sfixed(...);
B <= to_sfixed(...);
else
A <=...
B <=...
end if;
elsif x>B2
A <=...
B <=...
else
...
...
end if;
elsif x>B4

....
--etc
--end code

Is there something wrong with my thinking here? Why would a linear
search have a shorter delay than a binary search? Is the synthesis
tool automatically pipelining my design? Has anyone else seen results
like this? Do you believe the results, or do you suspect I am doing
something wrong in my code?

Note that I could post more code details, but I am using MATLAB and
the Simulink HDL Coder, and it would be difficult for me to remove the
dependency of my code on these tools. I have a hand written design,
but it is old, and I would probably need permission from my company to
post specifics anyway. I understand that I may need to put together a
"toy" example that can be used to more clearly characterize this
question, but I wonder first if anyone has any general insight.

I would be very interested and appreciative of any comments.

Sean Little
 
J

Jonathan Bromley

I assumed from my schooling that a binary search for the correct
coefficients A and B would be better than a linear search

If you can access only one search key at a time, sure. But
in hardware you can access multiple search keys in parallel;
the cost is not of time, but of area (more comparators).
By the look of your code, both your binary and linear searches
are in fact comparing the input value with all keys
simultaneously - so there is no added area cost in going
to linear search.
But strangely enough, when I synthesize
both a design that is implemented using linear and a binary search,
the linear search has a shorter combinational delay and therefore can
actually be clocked faster! I think there is some kind of optimization
that is happening in the synthesis tool (I am using Mentor Graphics
Precision, but have seen similar behavior in Quartus II).

This is interesting, but perhaps not very surprising. A linear
string of comparisons is a very regular structure and it's quite
possible that the synth tool will make the same optimization that
a designer might (more details below). Your hand-crafted binary
search, though, makes a rather irregular hardware structure and
it's probably not optimized - so you end up with a cascade of
comparators (three in your example) on the critical path.
But this
result seems to render unnecessary a ton of work that I did
to write binary search code instead of linear search.

Synthesis is like that :) It's often best to use a simple
iterative algorithm and let the tool reshape it. Trying to be
too clever can defeat the tool's ability to see important
optimization opportunities.

Of course, that doesn't mean human ingenuity is useless.
Particularly when a computation is spread across many
clock cycles, there may be algorithmic improvements that
you can see but a synthesis tool wouldn't even get close
to achieving.
Why would a linear
search have a shorter delay than a binary search?

If there are few enough search keys for you to do all
comparisons simultaneously, and if the keys are in
order (easy if they're the values in a monotonic table)
then the "linear" search can be nicely parallelized:
- Provide one magnitude comparator per search key.
- Compare the input value with each key. This gives
you a "thermometer code" of single-bit comparison
results - something like 1111100000000.
- Locate the 1/0 boundary of the thermometer code, and
use its location to decide which table entry to use.

Converting a thermometer code into a location number is
a simple problem with well-know, efficient solutions
(actually they're a kind of binary search, which is ironic)
and I've often seen synthesis tools doing precisely this
kind of arrangement.
Is the synthesis
tool automatically pipelining my design?

From the code snippet you gave, I very much doubt it.

As a matter of interest, I've used a similar technique in
the past in image processing designs to keep a sorted list
of the largest N values seen in a data stream (for
small-ish N, up to about 10).

At a more abstract level, you can think of it like this:
Linear search for a single value in a collection of N
keys is an O(N) problem in software - the time taken to
do it scales linearly with N. In hardware, you still
have an O(N) problem but now you can make the hardware
size scale linearly with N (you have N comparators) but
the whole thing can be done in constant time.

In a similar way, simple sort algorithms that are O(N^2)
time in software can easily be re-jigged in hardware to
be - for example - O(N) area and O(N) clock cycles, or
O(N^2) area and single-cycle. For large N the software
approach will always eventually win, given the low cost of
large memories. But if N is quite small, smearing the
algorithm across more hardware is an attractive option.
 
A

Andy

Depending on the target FPGA architecture (you mentioned Quartus), the
linear search may be optimized to use the "carry logic" resources and
dedicated interconnect, which are faster than standard LUT resources
and regular interconnect. The binary search may not be able to use
these as effectively.

I completely agree with Jonathan's statement about trying to outsmart
the synthesis tool. Always code your algorithm in the easiest
behavioral way to understand what it is doing. Then let the synthesis
tool take a crack at it. If it does not fit or does not meet timing,
then you can get more creative and construct the behavior to encourage
a better implementation. But staying with the simple to read/write/
understand algorithm will pay big dividends down the road in
maintenance, etc. A complex description that meets area/timing, but
with a hidden functional bug is worse than a simple description that
will not meet area/timing, but is functionally correct.

Andy
 
S

sanborne

Depending on the target FPGA architecture (you mentioned Quartus), the
linear search may be optimized to use the "carry logic" resources and
dedicated interconnect, which are faster than standard LUT resources
and regular interconnect. The binary search may not be able to use
these as effectively.

I completely agree with Jonathan's statement about trying to outsmart
the synthesis tool.  Always code your algorithm in the easiest
behavioral way to understand what it is doing. Then let the synthesis
tool take a crack at it. If it does not fit or does not meet timing,
then you can get more creative and construct the behavior to encourage
a better implementation. But staying with the simple to read/write/
understand algorithm will pay big dividends down the road in
maintenance, etc. A complex description that meets area/timing, but
with a hidden functional bug is worse than a simple description that
will not meet area/timing, but is functionally correct.

Andy

Very interesting.

Thanks for the responses, Jonathan and Andy. I feel a bit defeated.
Clearly I have been under-estimating the power of the synthesis tools.

Sean
 
K

KJ

Thanks for the responses, Jonathan and Andy. I feel a bit defeated.
Clearly I have been under-estimating the power of the synthesis tools.

Not to be a noodge, but algorithm improvements on the source code side
can many times lead to far better results than any synthesis tool
could ever come up with...In this particular case, for how you
happened to code your binary search, it didn't work out.

A simple example of a better algorithm that a synthesizer wouldn't get
would be division. Y <= A / B; will synthesize...into something large
and slow for any 'A', 'B' that is of a useful length (and assuming no
special hardware, just logic blocks). But instead choose to implement
the division using lpm_divide or whatever your favorite multi-clock
cycle divider is and Presto! you have a divider that is not nearly so
large and runs very fast, the tradeoff being that you have several
clock cycles of latency before you get the result.

The reason a synthesis tool usually can't come up with that is because
it is not allowed to change the function...delaying the availability
of some result at an I/O pin by even one clock cycle changes the
function. You on the other hand, as the designer, are free to make
other compensating changes to flag to the receiver when the results of
the division are ready. In other words you *can* choose to implement
something where the final result pops out of a pin one clock cycle
later.

A synthesis tool, in principle, could make this transformation if the
end result ends up at an I/O pin after the correct number of clock
cycles, in practice it likely will not.

Another example is long if/elsif...endif statements. To implement the
'else', there must be logic and the amount of logic grows as you work
your way further down the 'elsif' tree. The solution here is to look
for way to parallel-ize the computations so they can be done
independently. One way is to see if the big 'if' can instead be
handled as a case statement. 'Case' is less expensive than 'if/elsif'
because all case conditions are mutually independent. Another way can
be to implement 'partial' solutions where all of the partial solutions
get combined at the end all the time independent of conditions.

Lastly, the most powerful mechanism for improving performance usually
is pipeling. Usually this leads to faster logic (with longer
latency), but sometimes also less logic (as it did with division as
previously mentioned).

Andy is on target with first cut making for a simple, easy to
understand description first. If that is fast enough and fits, then
it might also be good enough. Most anything can be improved upon, so
if down the road the simple approach falls a bit short than you can
improve it at that time.

Bottom line though is, besides respecting the ability of the synthesis
tool to do a good job at synthesizing logic, you shouldn't feel
'beaten' by the tool to the point of not trying to improve on the
results.

Kevin Jennings
 
A

Andy

KJ has given you excellent advice.

Some synthesis tools include optional optimizations that can spread
logic out over multiple clocks (across multiple flops), but you must
have allowed for the total amount of flops/clock cycles in the first
place (the synthesis tool cannot alter the observable clock cycle
based behavior of your description). For instance, if you followed a
division or other operation (in time) with several empty pipeline
stages, some tools can spread that division or other operation over
some of those pipeline stages automatically. Some tools can only move
logic up to one clock/flop in either direction.

Any synthesis tool worth its salt does at least a rudimentary
reachability analysis on statements in a process. If the conditions on
a long if-then-elsif tree are all demonstrably mutually exclusive to
the synthesizer, then it will optimize out the priority logic inherent
in most if-elsif trees. Keep in mind that multiple, sequential if-then-
end statements can also generate priority logic, because the last
assignment to a signal in a process wins (during the same clock
cycle), trumping previous assignments.

One place where this is handy is you can step through a long if-then-
end-if sequence with a for-loop; you can't use a loop that way with a
case statement. If the conditions are mutually exclusive, you avoid
the priority logic. If they are not, you cannot use a case statement
for them anyway. Synthesis unrolls the loop automatically for you.

You can also emulate a long if-then-elsif tree with a for-loop and
exit statement too.

Andy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top