How much time does it need to sort 1 million random 64-bit/32-bit integers?

W

Weng Tianxiang

Hi,
Can I find a website that lists the time for sorting 1 million
64-bit/32-bit random data using the fastest software sorting algorithm
in a PC?

Thank you.

Weng
 
R

Richard Heathfield

Weng Tianxiang said:
Hi,
Can I find a website that lists the time for sorting 1 million
64-bit/32-bit random data using the fastest software sorting algorithm
in a PC?

Apparently not.

Note that, even if you could, such results would be highly
hardware-dependent (although the /relative/ ordering of various sorting
algorithms should be fairly consistent on any given platform). PCs vary
wildly in performance characteristics. The first PC I used was an Atari ST,
running at - what was it, 8 MHz? The first IBM-compatible I used was an
8088 running at - er, I forget, but it was in the 8-12MHz range, I think.
The first PC I /owned/ was an 80386-SX running at 20MHz. Whoa, that baby
was *fast*. Currently, I'm using a PIII which is probably a brazillion
times faster than the 386. The term "a PC" doesn't mean a lot when it comes
to benchmarking.

I suggest you just write up the algorithms yourself, or grab them off the
Net, and try them on your own hardware.

Beware of the results being corrupted by caching. Do not measure I/O times.
Do each test a few hundred times.
 
J

Jonathan Bromley

Weng Tianxiang said:
Can I find a website that lists the time for sorting 1 million
64-bit/32-bit random data using the fastest software sorting algorithm

Weng,

Just about every "good" sorting algorithm has an execution time that
scales roughly as N.log(N) with the number of items to be sorted.
There are, of course, many tradeoffs about speed vs. space
vs. hardware cost, but for any given algorithm that's not too
far from right.

A million data items will fit into main memory easily on a modern PC,
so there's no need to worry about file I/O time.

Donald Knuth's book "Sorting and Searching" is about as good as
it gets on this topic.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
(e-mail address removed)
http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
 
K

Kolja Sulimma

Weng said:
Hi,
Can I find a website that lists the time for sorting 1 million
64-bit/32-bit random data using the fastest software sorting algorithm
in a PC?

That depends on a lot of parameters, but I can give you one datapoint as
an example:
- Sorting 16777216 pair wise different 24-bit numbers takes no time at all.

Kolja Sulimma
 
W

Weng Tianxiang

Hi,
Actually I read the related article in a website that lists the timing
run by the author, but I couldn't find it again.

For example, he lists times in parallel with number of data, the
frequency of his computer and all related parameters that may affect
the performance, he also lists the performances using different
algorithms to do a comparison and made some conclusions.

Thank you.

Weng
 
J

John_H

Weng Tianxiang said:
Hi,
Can I find a website that lists the time for sorting 1 million
64-bit/32-bit random data using the fastest software sorting algorithm
in a PC?

Thank you.

Weng

Since you posted on fpga and vhdl newsgroups in addition to programming, are
you interested in an FPGA solution? Most of the information that one *can*
find on the web deals with serial implementations. Parallel implementations
for an FPGA such as the Bitonic sort can speed up the implementation
significantly.
 
J

JJ

Weng said:
Hi,
Can I find a website that lists the time for sorting 1 million
64-bit/32-bit random data using the fastest software sorting algorithm
in a PC?

Thank you.

Weng

A post to c.p, c.a.f, c.l.v for 3 different sets of answers.

wikipedia is probably a good place to start for software only solution.

Largely depends on overall memory performance since very little work is
done for the cost of memory access and also what you really want to
sort for, whether it is already mostly sorted or always random ordered
and if you are only counting frequencies rather than true sorting.

Research for quicksort, mergesort plus radix sort and counter sorts
for the larger cases and bubble or insertion sort for the tiny cases.
If you have Knuths 3 vol set that will help, the material has been
repeated many times in every CS text. Most give you the answer in
terms of ideal machines using big O notation. Knuths reference books
were written in the 60s when machines ran below 1mips with memory in
same ballpark as cpu, so no caches.

For the PC case
For very small sorts say upto many thousands of items, the entire sort
can be done from L1, L2 caches and is likely to be a good test of the
cpu & cache system. Bigger sorts can use these smaller sorts in
succession with some sort of merging. 1 random million isn't even
remotely going to take longer than a blink since L2 caches are big
enough to hold a good fraction of the 1M dataset.

In any nix system you already have a basic quicksort on the cmd line
but IIRC its not known to be stable for data that is already nearly
sorted. If you code up quicksort, you can randomize the order to
guarantee quicksort sticks to O N log N time otherwise it can degrade
badly. Mergesort always takes bigger O N log N time since it does more
work but it doesn't degrade with data patterns. A quicksort with pre
randomization probably comes out even.

Now if computers still had a form of memory that had the performance of
SRAM but the cost and size of DRAM, the answer would be much easier to
predict and radix sort would come out handily as that sort time follows
the no of partitions performed on the word * no of items. Most radix
sorts will use byte sorting so that would mean 4 passes of byte testing
and writing items to 1 of 256 respective buckets which can get very
large, esp if the data is mostly the same. If you have enough of this
flat memory to hold the data on input and output, then you can do it in
2 passes with 65536 buckets, this will break modern PC caches though.

Since most PCs don't have SRAM like memory they simulate that with
cache hierarchy the answer gets more complicated to predict. Fully
random uncached accesses into PC main memory are surprisingly slow,
towards several 100ns. Thats called a Memory wall.

If you sort for N random ordered items and sweep N from very small to
large to enormous, the quicksort and others follow O N log N time but
with abrupt steps in O that increase as the hardware works ever harder
to simulate flat memory access times. IIRC O can vary over over a
factor of 10. O isn't anywhere near constant except on ideal machines
with something like RLDRAM or infinitely big caches. I believe that
mergesort will have a much more constant O.



For the FPGA case.
Each compare and swap could be done in an atomic operation needing 2
reads and possibly 2 conditional writes in merge sort. WIth a dual port
memory that might take 1 or 2 clocks in a pipeline fashion for sorts
that fit into Blockram. Now if the data is spread over many sort
engines, the parallelism may make up for the much lower clock.

I might favor radix 3 sort since you could effectively use BlockRams as
temporary buckets. Basically stream all the DRAM values through to 2048
distinct buckets depending in turn on upper, middle, lower 11b fields.
As each bucket fills, save it back to DRAM and reset to empty. Same can
be done slower with radix 4 with 256 buckets or even faster with radix
2 with 65536 buckets on a really big FPGA. The sort time will approach
6M (or resp 8M, 4M) memory cycles at the fastest possible rate for just
the keys. Some complexity comes in managing the buckets in the output
stream into seemingly contiguous buckets.

If you are just doing counter sort of values limited to x values,
things get so much simpler, just funnel a memory read stream through x
counters. Does x have to be 2^32?


John Jakson
transputer guy
 
J

Joe Wright

Weng said:
Hi,
Can I find a website that lists the time for sorting 1 million
64-bit/32-bit random data using the fastest software sorting algorithm
in a PC?

Thank you.

Weng
About a half second on my machine. Is that what you really wanted to know?
 
W

Weng Tianxiang

Hi,
What I really want to know is that the following formula for best
sorting timing is correct (it is copied from Knuth's "Sorting and
Searching") and it is not too far away from the PC computer reality:
14.5* N * (lg N).

There are about 20 algorithms, and I reall don't know what formula
should be selected as a representative for best software algorithm
running for 1 million random data.

14.5* N * (lg N) is likely the best one.

Thank you.

Weng
 
O

Oliver Wong

Weng Tianxiang said:
Hi,
Can I find a website that lists the time for sorting 1 million
64-bit/32-bit random data using the fastest software sorting algorithm
in a PC?

Here's a ballpark figure:

Let n = 1000000

n*log2(n) / 3Ghz = 6.64385619 milliseconds

- Oliver
 
D

Dann Corbit

Weng Tianxiang said:
Hi,
What I really want to know is that the following formula for best
sorting timing is correct (it is copied from Knuth's "Sorting and
Searching") and it is not too far away from the PC computer reality:
14.5* N * (lg N).

There are about 20 algorithms, and I reall don't know what formula
should be selected as a representative for best software algorithm
running for 1 million random data.

14.5* N * (lg N) is likely the best one.

That will not be optimal.

Integers would probably be a good candidate for Knuth's 'merge insertion'
algorithm.

This article "Implementing HEAPSORT with (n logn - 0.9n) and QUICKSORT with
(n logn + 0.2n) comparisons"

may be of some interest to you. He also supplies the source code and driver
program for the article. It would be fairly simple to test on your
platform, I imagine.

In addition, this algorithm (especially for integer data type) may be of
special interest for your particular case:
http://www.nada.kth.se/~snilsson/public/code/nloglogn.c

Of course, it is a distribution based sort (as a comparison sort operating
that quickly is provably impossible).
 
W

Weng Tianxiang

Hi Wong,
n*log2(n) is the number of comparison operations that must be done with
a comparison sorting algorithm.

It must have some overhead to do a loop and to swap two compared data
into place.

That is the reason why the formula has a 14.5 coefficient.

Weng
 
L

Logan Shaw

Weng said:
What I really want to know is that the following formula for best
sorting timing is correct (it is copied from Knuth's "Sorting and
Searching") and it is not too far away from the PC computer reality:
14.5* N * (lg N).

No, the formula is not correct. The formula does not have any units;
therefore, the constant factor is meaningless and the formula cannot
be correct.

Also, it's not clear whether lg() is a natural logarithm (the usual
symbol for that is ln()) or a base-2 logarithm or some other base.
So that adds in a second unknown constant factor.

Consider, for a moment, if you sort 1 million numbers, each 32 bits
in size. Should I let N = 1 million? If so, then when I evaluate
that formula, I get some number like 200324903, or about 200 million.
But what does this number mean? Is it clock cycles? Microseconds?
Memory fetches? Microjoules of electrical energy used to power
the computer while it performs this computation?

- Logan
 
T

Thomas Stanka

HI,

Weng said:
What I really want to know is that the following formula for best
sorting timing is correct (it is copied from Knuth's "Sorting and
Searching") and it is not too far away from the PC computer reality:
14.5* N * (lg N).

There are about 20 algorithms, and I reall don't know what formula
should be selected as a representative for best software algorithm
running for 1 million random data.

14.5* N * (lg N) is likely the best one.

Thats totaly wrong for 1 Mio 32-Bit integers. In a HW-group you should
be aware of space time tradeoffs.
A simple fpga with enough memory(4G) should need o(N) to sort that
number just by storing in the adress of its value. Maybe you need to be
be aware if you fold two numbers with the same value, this would
increase your memory to a max of 120Gbit (30 bit counter value results
in a memory 32x30) and should increase your runtime from o(N) to o(N)+1
for pipelined lookup.

Knuth is applicable for real great numbers of N, but nowadays we should
consider 1 mio numbers as small.

bye Thomasd
 
K

Kolja Sulimma

Weng said:
Hi Wong,
n*log2(n) is the number of comparison operations that must be done with
a comparison sorting algorithm.

Forget about comaprisons on PC hardware. You are interested in the
number of load/stores because these are a lot slower than comparisons.
A typical comparison sort will have two loads and two store per comparison.

Radix Sort with 11-bit digits requires three O(N) passes for a total of
about 12M loads and 6M stores.
If your cache is big enough you could also do only two passes using
16-bit digits.

Kolja Sulimma
 
R

robertwessel2

Weng said:
Hi,
What I really want to know is that the following formula for best
sorting timing is correct (it is copied from Knuth's "Sorting and
Searching") and it is not too far away from the PC computer reality:
14.5* N * (lg N).

There are about 20 algorithms, and I reall don't know what formula
should be selected as a representative for best software algorithm
running for 1 million random data.

14.5* N * (lg N) is likely the best one.


If you're referring to the table I think you are, you're completely
misinterpreting it. Knuth compares a number of different internal sort
algorithms, and gives average and worst case timing formulas based on
MIX assembler implementations of the algorithms.

So you can at best, get estimates of the average and maximum runtimes
for a MIX implementation. And then you need to define "best." The
table shows a number of algorithms that have better time complexity
than (14.5n*lg n), although they may impose requirements on the input
data or require extra space. Do you define best as average or maximum
run times, stable or not, using minimum memory or not...?

As others pointed out, there are plenty of sorting algorithms which are
not bound by the (n*lg n) canard, it's just that they tend not to be as
useful as the old standby's.

FWIW, for 32 or 64 bit integers, some extra memory (a bit more than
double what you need to hold the data) and a high-order radix sort will
trivially lead to O(N). Is that "best" in some scenario?
 
W

Weng Tianxiang

Hi Kolja,
I am talking about a real world measurements, not a theoretical one:
1. Input data are in random;
2. Input data are 32-bit or 64-bit;
3. Sorting must be stable;

Radix sort cannot be applied to my case: how can you sort by 11-bit for
a data set of 32-bits? You may imagine a set of data with same all
lowest 11 bits, but different upper 21 bits?

Now I have clearer idea on what I really want:
As indicated by someone above, 14.5*N*(lg N) is for MIX referenced by
Knuth and is not useful in getting real PC world computation.

So now I want to know the coefficient in the following formula in real
PC world to replace Knuth's calculation formula:
Coefficiant * N * (lg N) for a real PC environment.

PC running frequency and cache size are not my concern, because when a
coefficient is given, a PC frequency and cache size may be appended to
the result and everyone knows their relations.

If a table of coefficient can be available based on frequency and cache
size, it is convenient for everyone to get a taste on how fast a
sorting algorithm is.

Thank you.

Weng
 
T

Thomas Stanka

Logan said:
No, the formula is not correct. The formula does not have any units;
therefore, the constant factor is meaningless and the formula cannot
be correct.

Maybe you should google a bit after landau symbols and complexity
theorie.
In general complexity therorie constants are ignored, but for more
detailed (or practical oriented) inspections they are still used.

bye Thomas
 
O

Oliver Wong

Thomas Stanka said:
Maybe you should google a bit after landau symbols and complexity
theorie.
In general complexity therorie constants are ignored, but for more
detailed (or practical oriented) inspections they are still used.

I think the point Logan is making is that the original question was "How
long will this algorithm take to run?" and someone else said (essentially)
"200 million." To which Logan responds "200 million what? Milliseconds?
Clock cycles? Something else?"

Therefore, "200 million" is an "incorrect" answer in that it does not
actually answer the question asked.

- Oliver
 
K

Kolja Sulimma

Weng said:
Hi Kolja,
I am talking about a real world measurements, not a theoretical one:
1. Input data are in random;
2. Input data are 32-bit or 64-bit;
3. Sorting must be stable;

Radix sort cannot be applied to my case: how can you sort by 11-bit for
a data set of 32-bits? You may imagine a set of data with same all
lowest 11 bits, but different upper 21 bits?

Maybe this would be the time to read up how radix sort works?
Hint: Try google.

Radix sort is stable. It can applied to all sorting problems, it just
might not be the fastest one. E.g. unlinke other methods it will be half
the speed for 64-bit than for 32-bit.
Hint: In my post I spoke about three passes with 11-bits or two passes
with 16-bit. How does this relate to the 32-bit of the values?

PC running frequency and cache size are not my concern, because when a
coefficient is given, a PC frequency and cache size may be appended to
the result and everyone knows their relations.
Definitely not.
Cache access time is access pattern dependant.

Kolja Sulimma
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top