indirect sort

T

tnorgd

Dear Group,

I would like to sort an array of object in respect to one of the
fields, say of type double. I wrote a Comparator (I don't want to use
Comparable in order to be able to choose the field I am sorting by)
and ... it seems to be slow. I checked with profiler and my
Comparator.compare() method takes most of the execution time (well,
this sounds obvious). But I pretty sure that sorting double[] would be
much faster than sorting MyObject[] (and calling compare()). Is there
a way to create a double[] array of my fields I want to sort by, sort
that array and then recover the permutation?

For example GNU Scientific Library has a method gsl_heapsort_index()
which doesn't actually sort a given array, but creates a permutation
(which is of a type int[]) which says what is order of my objects to
make them sorted. I think this is exactly what I need. Do I have to
write it on my own or somebody has already did it?


Best regards,
Dominik
 
K

Knute Johnson

Dear Group,

I would like to sort an array of object in respect to one of the
fields, say of type double. I wrote a Comparator (I don't want to use
Comparable in order to be able to choose the field I am sorting by)
and ... it seems to be slow. I checked with profiler and my
Comparator.compare() method takes most of the execution time (well,
this sounds obvious). But I pretty sure that sorting double[] would be
much faster than sorting MyObject[] (and calling compare()). Is there
a way to create a double[] array of my fields I want to sort by, sort
that array and then recover the permutation?

For example GNU Scientific Library has a method gsl_heapsort_index()
which doesn't actually sort a given array, but creates a permutation
(which is of a type int[]) which says what is order of my objects to
make them sorted. I think this is exactly what I need. Do I have to
write it on my own or somebody has already did it?


Best regards,
Dominik

I doubt that will improve the speed of the sort significantly if at all.
Maybe you could post your class and comparator code.
 
T

Tom Anderson

I would like to sort an array of object in respect to one of the fields,
say of type double. I wrote a Comparator (I don't want to use Comparable
in order to be able to choose the field I am sorting by) and ... it
seems to be slow. I checked with profiler and my Comparator.compare()
method takes most of the execution time (well, this sounds obvious). But
I pretty sure that sorting double[] would be much faster than sorting
MyObject[] (and calling compare()). Is there a way to create a double[]
array of my fields I want to sort by, sort that array and then recover
the permutation?

There's nothing like that it standard java.

As i'm sure you've realised, you could write something of your own without
too much trouble - just do a normal sort, but keep a side array of the
objects, intitially set to the range from 0 to N-1, and every time you
swap doubles, swap objects too. It would be a chore to write, but not a
nightmare.

Alternatively, build a hashmap from doubles to objects, put the doubles in
an array, sort, then build the resulting object array by putting the
doubles through the map. This would be disguisting, but would work as long
as there were no duplicate doubles (and if there were, you could handle
that without too much trouble).

tom
 
M

Mark Space

For example GNU Scientific Library has a method gsl_heapsort_index()
which doesn't actually sort a given array, but creates a permutation


I'm with Knute. There's no reason to suspect that this would be faster
than the built in Array.sort(), which only moves references.

Post up a SSCCE of your program and we'll take a look at, there could be
some issue I suppose.

http://sscce.org/
 
T

tnorgd

So I did the actual test. The code is pasted below. For me it looks
like the version that iterates through the array of objects, extracts
double data, puts it into an array and sorts this array is faster than
just sorting the objects. About 4-5 times faster. It looks like it is
worthy to have my own implementation of sort(), that swaps integers in
a second array while sorting the double[]

Dominik

PS. I would think that the test favors slightly sorting objects, i.e.
in my real case sorting double will be relatively even faster. The
argument for that is the following: Now both array are filled in a
single loop. It is likely that objects are located close to each
other, which in the real case won't happen. The array of doubles will
always be a solid chunk of memory, allocated and filled just before
sorting. But I am not sure about that.

//----------------------------- TestArraySort.java -------------------
import java.util.Arrays;
import java.util.Comparator;
import java.util.Random;

public class TestArraySort {

public static void main(String[] args) {

for(int k = 0;k < 10;k++) { // Let's make the test 10 times to
gather some statistics
// Prepare data for the test
int N = 300000;
DataRecord[] objects = new DataRecord[N];
double[] justDoubles = new double[N];
for(int i = 0;i < N;i++)
objects = new DataRecord();


//copy double fields and sort these
long t3 = System.nanoTime();
for(int i = 0;i < N;i++)
justDoubles = objects.d;
Arrays.sort(justDoubles);
long t4 = System.nanoTime();

// Sort objects
long t1 = System.nanoTime();
Arrays.sort(objects,new DataComparator());
long t2 = System.nanoTime();

System.err.printf("%f %f\n",(t2 - t1) / 1000000000.0,(t4 - t3) /
1000000000.0);
}
}

public static void sortObjects(DataRecord[] records) {

Arrays.sort(records,new DataComparator());
}

// The comparator used to sort objects
public static class DataComparator implements Comparator<DataRecord>
{

public int compare(DataRecord d1, DataRecord d2) {

return Double.compare(d1.d,d2.d);
}
}

// Data type for objects being sorted. Strings are to make objects
"bulky"
public static class DataRecord {

public final String firstLine;
public final String secondLine;
public final double d;
public final int i1;
public final int i2;

public DataRecord() {

for(int i = 0;i < 20;i++) {
s1 = chars[gen.nextInt(chars.length)];
s2 = chars[gen.nextInt(chars.length)];
}
firstLine = new String(s1);
secondLine = new String(s1);
i1 = gen.nextInt(10000);
i2 = gen.nextInt(10000);
d = gen.nextDouble();
}

private static final Random gen = new Random();

private static final char[] s1 = new char[20];
private static final char[] s2 = new char[20];
}

private static final char[] chars =
("ACDEFGHIJKLMNOPQRSDTUWVXYZ1234567890!@#$%^&*())
qwertyuiopplkjhgfdsazxcvbnm,.")
.toCharArray();
}
 
L

Lew

Tom said:
Alternatively, build a hashmap from doubles to objects,

Strictly speaking, that would be a map from Doubles to Objects (or better,
from Doubles to the type under sort), but autoboxing will mask that somewhat.
 
R

Roedy Green

I would like to sort an array of object in respect to one of the
fields, say of type double. I wrote a Comparator (I don't want to use
Comparable in order to be able to choose the field I am sorting by)
and ... it seems to be slow. I checked with profiler and my
Comparator.compare() method takes most of the execution time (well,
this sounds obvious). But I pretty sure that sorting double[] would be
much faster than sorting MyObject[] (and calling compare()). Is there
a way to create a double[] array of my fields I want to sort by, sort
that array and then recover the permutation?

If you sort an array of double vs Double it will sort the data rather
than an array of pointers to the objects. I would expect double to be
faster than Double, the reverse of what you anticipate.

See http://mindprod.com/jgloss/sort.html
--
Roedy Green Canadian Mind Products
http://mindprod.com

"It wasn’t the Exxon Valdez captain’s driving that caused the Alaskan oil spill. It was yours."
~ Greenpeace advertisement New York Times 1990-02-25
 
L

Lew

So I did the actual test. The code is pasted below. For me it looks
like the version that iterates through the array of objects, extracts
double data, puts it into an array and sorts this array is faster than
just sorting the objects. About 4-5 times faster. It looks like it is
worthy to have my own implementation of sort(), that swaps integers in
a second array while sorting the double[]

     For what it's worth, I ran your test on my system and got
similar results: Arrays.sort(double[]) took about 1/4 the time
of Arrays.sort(Whatever[],Comparator).  I guess that's the price
you pay for the flexibility of a Comparator versus a hard-wired
comparison buried inside the sorting code.

More likely it's the difference between primitive comparisons and
object comparisons based on attributes of the type. A 'double'
comparator can simple subtract one argument from the other and cast
the result to 'int'. Attribute comparisons generally involve multiple
'if' tests.

More telling would be a comparison of Arrays.sort( Whatever /* extends
Comparable */) to Arrays.sort( Whatever, Comparator ), where the
innate Whatever#compareTo() and the Comparator#compare() implement the
same ordering.

I wonder how it would affect the speed if the Comparator implemented
the double subtraction directly instead of doing

public static class DataComparator implements Comparator<DataRecord>
{
public int compare(DataRecord d1, DataRecord d2) {
return Double.compare(d1.d,d2.d);
}
}

Also, was this timing test done with the "-server" option to Java?
Benchmark runs without a complete disclosure of benchmark conditions
are pretty useless.

Also also, I note that there's no warmup run to let Hotspot do its
magic. Does that reflect the usage of the class in the field?

Also also also, the timing test does not account for the time to
access the 'DataRecord' array in the resulting desired sort order, but
leaves off after sorting the copied array. This is not helpful for
the usual case where one needs the array of actual objects in a
desired order.

However, for the limited use case where an object is a thin wrapper
for primitives and one only needs the unwrapped result of the sort,
where the sort is shown by profiling to be an actual bottleneck in a
situation where every last nanometer per second of speed is needed,
and code complexity and maintainability are less important
considerations, this technique could be a useful micro-optimization if
used very judiciously.
 
L

Lew

Eric said:
     I hope you're not in the habit of comparing doubles this way,
as it is faulty in several respects.  For example, it declares
that 5.9 and 6.2 are equal, that 3.5E9 is less than 1.0E9, and so on.
I don't know what it would do with infinities and NaNs, but there's
a pretty good chance it would botch them.

Right you are. Oops.
 
T

Tom Anderson

So do CPUs not have instructions that compare doubles directly?
Right you are. Oops.

I assumed that by 'cast' you meant 'reinterpret', which is java is spelt
'doubleBitsToLong' (and then do a shift down to 32 bits), since the top
bit of an IEEE floating-point number is the sign bit, and when
reinterpreted as an int, will be the negatively-valued MSB, so if the
double is positive, and the sign bit 0, the integer'll be positive, and if
the double is negative, and the sign bit is 1, the integer'll be negative.

Since the floating-point zero is all zeroes, that will come out as zero,
as it should.

Negative zero will come out as a negative integer, which is also correct,
i think - the only way to get -0 from a subtraction is from (-0 - +0),
which you do want to come out negative, since -0 < +0.

The only comparisons for which this doesn't work are ones where the
difference is positive, and so small that it's denormal and enough leading
bits of the mantissa are zero that when you shift down to 32 bits, you
lose all the ones, and the comparison comes out as zero, even though the
inputs were different. This won't happen for floats. You *might* be able
to protect yourself from this happening by casting the doubles to floats
before subtracting - i don't know what happens to really small denormal
doubles when you cast to float. Do they go to zero, or do they become the
smallest float denormal?

Obviously all bets are off in the presence of NaNs, but the solution there
is not to have NaNs.

tom
 
T

Tom Anderson

So I did the actual test. The code is pasted below. For me it looks
like the version that iterates through the array of objects, extracts
double data, puts it into an array and sorts this array is faster than
just sorting the objects. About 4-5 times faster. It looks like it is
worthy to have my own implementation of sort(), that swaps integers in
a second array while sorting the double[]

For what it's worth, I ran your test on my system and got
similar results: Arrays.sort(double[]) took about 1/4 the time
of Arrays.sort(Whatever[],Comparator). I guess that's the price
you pay for the flexibility of a Comparator versus a hard-wired
comparison buried inside the sorting code.

That's part of it - Arrays.sort(Object[], Comparator) involves making one
interface method call per comparison, plus whatever work the comparison
actually does, whereas comparing two doubles is one machine instruction,
or at most a handful of them. There's also cache density to think about,
doubles being much smaller than DataRecords. And bear in mind that
Dominik's little test (which i know nobody is claiming is truly
definitive) sorts the doubles, but doesn't attempt to transfer that
sorting to any objects; adding the code to also swap around a paralle
pointer array would probably slow it down, albeit only fractionally.

But probably the biggest reason is that they're different algorithms: the
object sort is a mergesort, and the double sort is a quicksort. Java
chooses mergesort for arbitrary objects, because it's stable, and
quicksort for primitives, because it's fast, and since primitives which
compare equal are indistinguishable even in principle, stability doesn't
matter.

So, a fairer (in some sense) comparison would be between two custom sorts,
both using the quicksort code from OpenJDK, one of which sorts a double[]
and swaps a pointer array in parallel, and another which sorts an array of
pointers, and doesn't have to make virtual method calls to do it (being
hardwired to sort DataRecords). I'd suggest counting the time taken to set
up the double[] in the time for the former, unless the OP has some cunning
way to keep that array ready before the sort.

tom
 
T

Tom Anderson

Strictly speaking, that would be a map from Doubles to Objects (or
better, from Doubles to the type under sort), but autoboxing will mask
that somewhat.

Actually, i think Doubles to objects is the most precise, since as you
say, the values wouldn't necessarily be Objects, although they would have
to be objects!

tom
 
M

Mark Space

Eric said:
If you can assume no NaNs

If you look at the source for Arrays.sort( double ), one of the first
things it does is scan the array for NaN, push them to the end, and then
sort the array from 0 to length-#_of_NaNs. I just thought I'd point
that out.
 
A

Arne Vajhøj

Tom said:
So do CPUs not have instructions that compare doubles directly?

Some do. Examples:

IA-32 : FCOM instruction
VAX : CMPF/CMPD/CMPG instructions
Alpha : CMPGEQ/CMPTEQ instructions

My guess is that most do.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,152
Latest member
LorettaGur
Top