Weng said:
Hi,
Can I find a website that lists the time for sorting 1 million
64-bit/32-bit random data using the fastest software sorting algorithm
in a PC?
Thank you.
Weng
A post to c.p, c.a.f, c.l.v for 3 different sets of answers.
wikipedia is probably a good place to start for software only solution.
Largely depends on overall memory performance since very little work is
done for the cost of memory access and also what you really want to
sort for, whether it is already mostly sorted or always random ordered
and if you are only counting frequencies rather than true sorting.
Research for quicksort, mergesort plus radix sort and counter sorts
for the larger cases and bubble or insertion sort for the tiny cases.
If you have Knuths 3 vol set that will help, the material has been
repeated many times in every CS text. Most give you the answer in
terms of ideal machines using big O notation. Knuths reference books
were written in the 60s when machines ran below 1mips with memory in
same ballpark as cpu, so no caches.
For the PC case
For very small sorts say upto many thousands of items, the entire sort
can be done from L1, L2 caches and is likely to be a good test of the
cpu & cache system. Bigger sorts can use these smaller sorts in
succession with some sort of merging. 1 random million isn't even
remotely going to take longer than a blink since L2 caches are big
enough to hold a good fraction of the 1M dataset.
In any nix system you already have a basic quicksort on the cmd line
but IIRC its not known to be stable for data that is already nearly
sorted. If you code up quicksort, you can randomize the order to
guarantee quicksort sticks to O N log N time otherwise it can degrade
badly. Mergesort always takes bigger O N log N time since it does more
work but it doesn't degrade with data patterns. A quicksort with pre
randomization probably comes out even.
Now if computers still had a form of memory that had the performance of
SRAM but the cost and size of DRAM, the answer would be much easier to
predict and radix sort would come out handily as that sort time follows
the no of partitions performed on the word * no of items. Most radix
sorts will use byte sorting so that would mean 4 passes of byte testing
and writing items to 1 of 256 respective buckets which can get very
large, esp if the data is mostly the same. If you have enough of this
flat memory to hold the data on input and output, then you can do it in
2 passes with 65536 buckets, this will break modern PC caches though.
Since most PCs don't have SRAM like memory they simulate that with
cache hierarchy the answer gets more complicated to predict. Fully
random uncached accesses into PC main memory are surprisingly slow,
towards several 100ns. Thats called a Memory wall.
If you sort for N random ordered items and sweep N from very small to
large to enormous, the quicksort and others follow O N log N time but
with abrupt steps in O that increase as the hardware works ever harder
to simulate flat memory access times. IIRC O can vary over over a
factor of 10. O isn't anywhere near constant except on ideal machines
with something like RLDRAM or infinitely big caches. I believe that
mergesort will have a much more constant O.
For the FPGA case.
Each compare and swap could be done in an atomic operation needing 2
reads and possibly 2 conditional writes in merge sort. WIth a dual port
memory that might take 1 or 2 clocks in a pipeline fashion for sorts
that fit into Blockram. Now if the data is spread over many sort
engines, the parallelism may make up for the much lower clock.
I might favor radix 3 sort since you could effectively use BlockRams as
temporary buckets. Basically stream all the DRAM values through to 2048
distinct buckets depending in turn on upper, middle, lower 11b fields.
As each bucket fills, save it back to DRAM and reset to empty. Same can
be done slower with radix 4 with 256 buckets or even faster with radix
2 with 65536 buckets on a really big FPGA. The sort time will approach
6M (or resp 8M, 4M) memory cycles at the fastest possible rate for just
the keys. Some complexity comes in managing the buckets in the output
stream into seemingly contiguous buckets.
If you are just doing counter sort of values limited to x values,
things get so much simpler, just funnel a memory read stream through x
counters. Does x have to be 2^32?
John Jakson
transputer guy