Fast/low area Sorting hardware.

john.deepu · Jun 7, 2005

Hi all ,
I wanted to implement a fast and low area sorting algorithm in Verilog
RTL, does anyone have any suggestions?
Any links to IEEE papers, articles are higly welcome...

regards
Deepu John

Stephane · Jun 7, 2005

The question is how many values you want to sort... basically, this is
iterative algorithm, so for a large set, a big memory and controller
will be small but long.

as you posted on comp.arch.fpga, you might like this one:
http://www.xilinx.com/xcell/xl23/xl23_16.pdf

else googlize "rank order filter"

john.deepu · Jun 7, 2005

I wanted to sort 48 8bit unsigned numbers

JJ · Jun 7, 2005

Depends on how fast you want it and what the input order usually is.

Do it in C 1st and cost it so that every read and write into data array
is your memory access cost function and every compare done is your hw
compare cost function.

For a 2 input a,b = sort x,y
a,b <= (x<y)? x,y : y,x;

Arrange this into a merge sort should take 24*6*2 clocks using a DP
ram. Thats around 288 clocks worst case. If no swap, you could save
some clocks. If data is always sorted the time can be halved.

Quicksort would be a waste of time & HW design effort here.
Even bubble sorts are ok if you know the data is already mostly in
order most of the time. Sort only as fast as you actually need.

Since your data is also only 8bits wide the other obvious candidate is
counter sorting. Fill an array of 256 counters with 0.
Then scan the 48 inputs and do cntr[data]++; Then scan the cntr list
and for each cntr value emit that index the value no of times. This
will take maybe 256 clocks to 0, 48 clocks to do the ++, then 256
clocks to scan the counts plus 48 more to emit the values. With dual
port you could get around 300 clocks with out using any compares. This
algorithm works very well when you have lots of data inputs with low
precisions, but not so well other way around. 48 inputs is a bit low
for this design.

johnjakson at usa dot com

john.deepu · Jun 7, 2005

Hi Stephane.
I wanted to sort 48 8bit unsigned numbers

thanks
Deepu

John_H · Jun 7, 2005

I wanted to sort 48 8bit unsigned numbers

And do you want the 384 bit result for a new set of numbers every 10 ns?
Can you wait 1 ms for the result?
Is it fine to keep the inputs and outputs in a BlockRAM or are you
presenting them parallel?

Falk Brunner · Jun 7, 2005

Hi Stephane.
I wanted to sort 48 8bit unsigned numbers

Just some ideas. Hold the numbers in a BRAM. Create a small FSM to pull out
the numbers, store them in a temporary register and compare. Bubble sort is
usually the easiest but slowest approach in software. Maybe this is also
true for an FPGA. Misc sort is fastest, so it sould be for hardware too. You
could use both ports aof a BRAm to simultaneously access two pieces of data
to increase speed.

Regards
Falk

c d saunter · Jun 7, 2005

: : > Hi Stephane.
: > I wanted to sort 48 8bit unsigned numbers

: Just some ideas. Hold the numbers in a BRAM. Create a small FSM to pull out
: the numbers, store them in a temporary register and compare. Bubble sort is
: usually the easiest but slowest approach in software. Maybe this is also
: true for an FPGA. Misc sort is fastest, so it sould be for hardware too. You
: could use both ports aof a BRAm to simultaneously access two pieces of data
: to increase speed.

Bubble sort should actually be quite fast - you can store all 48 values in
registers, then compare and swap if necessary odd pairs on odd clock cycles
and even pairs on even cycles. After 48 cycles the registers should hold
a sorted data set.

Probably this aproach is about as fast as you can get? The speedup comes
from the fact that many many bubble sort opperations can occour in
parallel. Is this the most hardware efficient sort that runs at this speed
though?

If possible data should be shifted sequentially into and out of (or at least
out of) the registers as a readout mux would be a resource hog. Using a
Generate statement in VHDL (or equiv in Verilog) I'm guessing the sort will come
to less than 100 lines of code.

cheers
cds

Falk Brunner · Jun 7, 2005

Bubble sort should actually be quite fast - you can store all 48 values in
registers, then compare and swap if necessary odd pairs on odd clock cycles
and even pairs on even cycles. After 48 cycles the registers should hold
a sorted data set.

This sounds like a almost full parallel approach. Could be quite fast, but
also quite resouce hungry.

Probably this aproach is about as fast as you can get? The speedup comes
from the fact that many many bubble sort opperations can occour in
parallel. Is this the most hardware efficient sort that runs at this speed
though?

If possible data should be shifted sequentially into and out of (or at least
out of) the registers as a readout mux would be a resource hog. Using a

Thats why a BRAM is quite handy. Using both ports you can pull two datas out
in one cycle, compare, and write them back on a second cycle.

Regards
Falk

c d saunter · Jun 7, 2005

Falk Brunner ([email protected]) wrote:
:
: > Bubble sort should actually be quite fast - you can store all 48 values in
: > registers, then compare and swap if necessary odd pairs on odd clock
: cycles
: > and even pairs on even cycles. After 48 cycles the registers should hold
: > a sorted data set.

: This sounds like a almost full parallel approach. Could be quite fast, but
: also quite resouce hungry.

Yup. Unless the OP posts some details about desired performance it's impossible
to know which one is best...

: Thats why a BRAM is quite handy. Using both ports you can pull two datas out
: in one cycle, compare, and write them back on a second cycle.

Indeed. There is a nice intermediate level of parallelism availible by using
LUT RAMs to build units 16 words deep, each of which sort sequentially, with
every other set of sorts crossing the LUT RAM boundaries. This would also be
the most complicated case to code

cheers
cds

Kolja Sulimma · Jun 8, 2005

I wanted to sort 48 8bit unsigned numbers

As others suggested the big question is: How are you numbers presented?
If you have 48 bit serial inputs that present the data MSB first you can
build a systolic 48x6 array of units containing two LUTs and two
registers . As each cell connects only to three neighbours it can run at
extremely high clock rates. The throughput in your case is 7 words per
clock.

If your data arrives in words, one word at a time consider a systolic
priority queue. (Similar to parallel heap sort). The simplest version
has about 48x8 registers and can still run at fairly high speeds. The
throughput is one word per clock.

If your data is allready in RAM (on or of chip) sort as you would on a
CPU. Counting sort comes to mind. In your case it uses a single BRAM and
is guaranteed to complete in 256+2N clock cycles on a single BRAM port.
If you use both ports the number of cycles is halved.
3.7 clock cycles per word is not much better than other sort algorithms
for a small data set as yours, but the implementation of counting sort
is very simple and the number of clock cycles per word will actually
decrease if the number of words increases.

Kolja Sulimma

Stephane · Jun 8, 2005

a BRAM-based "software like" solution will surely be the most area
efficient;
anyway, for signal processing, John might need real time?

here is another solution to find the MEDIAN of M values of N bits.

inputs are stored in M registers. (You might want to shift them along
your sampled values)
design a M bits inputs, 1 bit output module. The output value is 1 if
the number of ones is greater than the number of zeroes.
Plug this on all MSb, the output is the MSb of the final result.

Now design quite the same module, but with Mx2+1 inputs. The Mx2 are
from your M values, starting with D_M(N downto N-1), the lonely input
'i' is the previous module's output.
for each pair, if i=D_M(N), keep D_M(N-1) as is.
if not, take not(D_M(N-1)).
plug the eventually modified M bits, and plug them into a M-->1 module
just like in step 1.
you instanciate N-1 of such modules, and chain them. N outputs make the
median value requested.

for small M, you can do it in one cycle. with 48, you'll get poor
performance, so you'll design a sequential counter, or pipeline the
modules. If you try, give feedback, it's interresting.

-------------------------------
Now consider this: the method was patented in 92 by Thomson. I don't
know if it is still pending... How can they dare patent such a method
for a given algorithm? Would you dare patenting for example a fast carry
adder??? I think it is a weak patent (haven't checked the claims),
because just an implementation of a well know algorithm, and an
anteriority shall be findable.

Weng Tianxiang · Jun 11, 2005

Hi,
Here you are for 'carry adder' key word in all patents title field:

PAT. NO. Title
1 5,045,681 Optoelectric ripple carry adder
2 4,931,981 Multi-place ripple-carry adder
3 4,899,305 Manchester carry adder circuit
4 4,839,849 Ripple-carry adder
5 4,704,701 Conditional carry adder for a multibit digital computer
6 4,675,838 Conditional-carry adder for multibit digital computer
7 4,660,165 Pyramid carry adder circuit

Weng Tianxiang · Jun 11, 2005

Stephane,
Can you give a digital example on the median value algorithm or any
references to its original paper.

Thank you.

Weng

[ANN] LOOPGEN-Fast hardware looping VHDL IPs	0	Jun 13, 2013
Onchip SRAM Vs Registers	1	Mar 25, 2005
How to clarify algorithm of hardware(HDL)?	2	Apr 16, 2006
JSA - Special Issue on Hardware/Software Co-Design	0	Mar 15, 2009
CFP - Journal of Systems Architecture, Embedded Software Design(Elsevier), Special Issue on Hardware	0	Aug 24, 2009
Problem with starting value during simulation	5	May 30, 2010
comp.lang.vhdl FAQ part 3 of 4: products & services	0	Jul 8, 2003
Duck Shoot Game	0	Aug 1, 2010

Fast/low area Sorting hardware.

john.deepu

Stephane

john.deepu

JJ

john.deepu

John_H

Falk Brunner

c d saunter

Falk Brunner

c d saunter

Kolja Sulimma

Stephane

Weng Tianxiang

Weng Tianxiang

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads