Return all points with x km of y

I

Ian Pattison

Hi,

I'm trying to put together what I think is a fairly complex database query.

I have a table of about 32,000 latitiude and longitude pairs and I need to
be able to query for two different things:

1. All points that are within X kilometers (or miles for that matter) of
coordinate Y (both are supplied from an HTML form).
2. The nearest X points to coordinate Y (again, supplied by a form).

The first should be simple enough... I already know how to measure the
distance between two points (once I translate all that trig into perl) so
all I need to do is run the query and discard any result that's outside the
circle. The second is tougher. Does anyone know how to sort a list like
that?

Thanks,

Ian
 
B

Bob Walton

Ian said:
Hi,

I'm trying to put together what I think is a fairly complex database query.


Question about your terminology: by the term "database query", do you
mean something written in SQL requesting data from a relational
database? That is what that term generally means to computer
programmers. I'm no SQL expert, but queries of the form you mention are
generally not SQL-ish things. So I'll assume your "database" is
something more like a flat file consisting of two numbers per record.

I have a table of about 32,000 latitiude and longitude pairs and I need to
be able to query for two different things:

1. All points that are within X kilometers (or miles for that matter) of
coordinate Y (both are supplied from an HTML form).


For such computations, one thing which may be done is to recognize that
squared distance is monotonically related to distance, and may be
interchangeably used with distance for comparison purposes. Thus, one
never needs to actually compute the distance, just the squared distance,
which saves one a repetitive square root computation. There may be
other monotonic relationships in the calculations involving longitude
and latitude which can be similarly exploited for efficiency.

2. The nearest X points to coordinate Y (again, supplied by a form).

The first should be simple enough... I already know how to measure the
distance between two points (once I translate all that trig into perl) so
all I need to do is run the query and discard any result that's outside the
circle. The second is tougher. Does anyone know how to sort a list like
that?


The conceptually simple thing would be to compute a list of all the
squared distances from coordinate Y to all your points (note that you
*will* have to consider each and every possible distance before you can
assure yourself you have the correct X elements), sort an index to the
list in ascending squared-distance order, and pick off the first X
indexed elements. The sort, however, will be O(n*log(n)), and it would
be possible to pick off the smallest X numbers of the list in an O(n)
computation -- if nothing else, by determining the minimum of the list,
removing that, determining the minimum of the remaining list, etc
repeated X times. Or one could start with the first X elements,
determine the maximum of them, then compare the rest of the elements
with that maximum. If a point is lower than that maximum, the old
maximum point is removed from the list of X points, the new element is
added, and a new maximum of the X values is determined. When one gets
to the end of the list, one will have the X smallest elements of the
list. Perl's sort(), however, is so efficient that you might find it
will be faster for a mere 32000-element list. Consider the "Schwartzian
Transform" for implementing the sort.


....


HTH.
 
G

Gregory Toomey

It was a dark and stormy night, and Ian Pattison managed to scribble:
Hi,

I'm trying to put together what I think is a fairly complex database
query.

I have a table of about 32,000 latitiude and longitude pairs and I need to
be able to query for two different things:

1. All points that are within X kilometers (or miles for that matter) of
coordinate Y (both are supplied from an HTML form).
2. The nearest X points to coordinate Y (again, supplied by a form).

The first should be simple enough... I already know how to measure the
distance between two points (once I translate all that trig into perl) so
all I need to do is run the query and discard any result that's outside
the circle. The second is tougher. Does anyone know how to sort a list
like that?

Thanks,

Ian

I've worked in the area of spatial inxexing in the past. There is a vast scientific literature on storing & retrieving spatial/cadastral information.

Here's a few notes on you question.

1. To do the exact calculation of distance given lat & long requires spherical trigonometry. Over small distances, you can get good local approximate results by knowing the numbeer of km for each unit of latitude & longiture. This varies over the earths surface.

2. To get the poiints within X kilometers, the simplest algorithm has two parts.
2a. First, get all points within a rectangle X km. A simple way to do this in sql is select * from table where lat between 999.999 and 999.999 and long between 999.999 and 999.999
2b. From this subset, use the algorithm in 1.
Unforfunately this is a linear algorithm.

3. There are data structures that solve these problems quickly, like R-trees
http://www.cs.ucr.edu/~marioh/spatialindex/

gtoomey
 
G

Gregory Toomey

It was a dark and stormy night, and Bob Walton managed to scribble:
Question about your terminology: by the term "database query", do you
mean something written in SQL requesting data from a relational
database? That is what that term generally means to computer
programmers. I'm no SQL expert, but queries of the form you mention are
generally not SQL-ish things. So I'll assume your "database" is
something more like a flat file consisting of two numbers per record.

You can do it in SQL, and its often done with quadtrees, which is a method of subdividiing down an image into smaller "quardrangles" or "quarters". You can then represent the hirearchy of these images in base 4 arithmetic.
Then each image can be represented by base 4 numbers, which as stored in a database.

You can then improve the time complexity of common operations like proximity, point-in-polygon by querying the database.

Also see the mysql spatial operations.
http://www.mysql.com/doc/en/Spatial_extensions_in_MySQL.html

gtoomey
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Gregory Toomey
1. To do the exact calculation of distance given lat & long requires
spherical trigonometry. Over small distances, you can get good
local approximate results by knowing the numbeer of km for each
unit of latitude & longiture. This varies over the earths
surface.

Bulls. First of all, the "distance along surface" can be easily
calculated given the "distance along straight tunnel". To find the
latter, one just needs to recalculate from spherical coordinates to
(x,y.z)-rectangular coordinates: 2 cosines and 2 sines per point.
2. To get the poiints within X kilometers

Points within the given distance are within a "spherical hat"; in
other words, they are given by inequality

Ax+By+Cz >= const

Here (A,B,C) are coordinates of the "center", (x,y.z) are coordinates
of the otehr point, and const is easily calculatable.

Hope this helps,
Ilya
 
D

David K. Wall

Ilya Zakharevich said:
[A complimentary Cc of this posting was sent to
Gregory Toomey
1. To do the exact calculation of distance given lat & long
requires
spherical trigonometry. Over small distances, you can get
good local approximate results by knowing the numbeer of km
for each unit of latitude & longiture. This varies over the
earths surface.

Bulls. First of all, the "distance along surface" can be easily
calculated given the "distance along straight tunnel". To find
the latter, one just needs to recalculate from spherical
coordinates to (x,y.z)-rectangular coordinates: 2 cosines and 2
sines per point.

Ok, none of this has anything in particular to do with Perl, so...

There once was a fellow named Lloyd
Whom geography made quite annoyed:
"I'd thought it was clear
That the earth was a sphere
Now I find it's an oblate spheroid!"

(Yes, I wrote it. Please don't hit me.)
 
B

Bob Walton

Gregory Toomey wrote:

....
You can do it in SQL, and its often done with quadtrees, which is a method of subdividiing down an image into smaller "quardrangles" or "quarters". You can then represent the hirearchy of these images in base 4 arithmetic.
Then each image can be represented by base 4 numbers, which as stored in a database.

You can then improve the time complexity of common operations like proximity, point-in-polygon by querying the database.

Also see the mysql spatial operations.
http://www.mysql.com/doc/en/Spatial_extensions_in_MySQL.html

gtoomey

Wow -- thanks for the info -- I had no idea.
 
C

ctcgag

Ian Pattison said:
Thanks for all of your assistance on this one. Both for the programming
ideas as well as the geography/math lessons.

My final result: I found a module on CPAN that would do all the
calculations I needed. (Geo-Distance-0.05) and from there it was simple.

In response to Bob's question... I'm using Win32:ODBC to pull data from
an access database. Currently there are about 30 tables, each with around
32,000 rows but in essence I'm using it as a flat file. I'm still working
exactly how to get the nearest, say, 500 to a given point but finding all
the ones with X km is done.

I'd start by doing something like:

my @data;
while (my($key,$dist)=next_point()) {
push @data, [$key,$dist];
};
@data = (sort {$a->[1]<=>$b->[1]})[0..$x-1];

And if that took too much memory I'd do:

my @data;
while (my($key,$dist)=next_point()) {
push @data, [$key,$dist];
@data = (sort {$a->[1]<=>$b->[1]} @data)[0..$x-1] if @data>$x*8;
};
@data = (sort {$a->[1]<=>$b->[1]} @data)[0..$x-1];


And if that didn't cut I'd seriously look into getting a real
GIS before I'd spend more time on rolling my own.

Xho
 
A

Anno Siegel

Ian Pattison said:
Thanks for all of your assistance on this one. Both for the programming
ideas as well as the geography/math lessons.

My final result: I found a module on CPAN that would do all the
calculations I needed. (Geo-Distance-0.05) and from there it was simple.

In response to Bob's question... I'm using Win32:ODBC to pull data from
an access database. Currently there are about 30 tables, each with around
32,000 rows but in essence I'm using it as a flat file. I'm still working
exactly how to get the nearest, say, 500 to a given point but finding all
the ones with X km is done.

I'd start by doing something like:

my @data;
while (my($key,$dist)=next_point()) {
push @data, [$key,$dist];
};
@data = (sort {$a->[1]<=>$b->[1]})[0..$x-1];

Hmm... Sorting is not an efficient way to find the top n of a population
if n is considerably smaller than the population. Sorting is slow and
(usually) needs all the data in memory. Both can be avoided using a heap.

For this particular problem, we'd want a maximum heap, that is, one that
allows the extraction of the largest element at all times. Assuming we
have a Heap class with methods new, add, size and extract_maximum, this
finds the $x closest points:

my $heap = Heap->new( sub { $a->[1] <=> $b->[1] });
while ( my( $key, $dist) = next_point() ) {
$heap->add( [ $key, $dist]);
# discard the most distant point if no longer needed
$heap->extract_maximum if $heap->size > $x;
}
my @data = reverse map $heap->extract_maximum, 1 .. $heap->size;

If $x is small compared to the overall number of locations, this runs
essentially in linear time and space proportional to $x. If $x approaches
the total, it seamlessly mutates into a heap sort.

There are two or three heap modules on CPAN, though none of them may
have the interface I assumed.
And if that took too much memory I'd do:

[snip]

The heap method does a better job of keeping intermediate storage small.
And if that didn't cut I'd seriously look into getting a real
GIS before I'd spend more time on rolling my own.

Indeed.

Anno
 
A

Anno Siegel

Ian Pattison said:
Thanks for all of your assistance on this one. Both for the programming
ideas as well as the geography/math lessons.

My final result: I found a module on CPAN that would do all the
calculations I needed. (Geo-Distance-0.05) and from there it was simple.

In response to Bob's question... I'm using Win32:ODBC to pull data from
an access database. Currently there are about 30 tables, each with around
32,000 rows but in essence I'm using it as a flat file. I'm still working
exactly how to get the nearest, say, 500 to a given point but finding all
the ones with X km is done.

I'd start by doing something like:

my @data;
while (my($key,$dist)=next_point()) {
push @data, [$key,$dist];
};
@data = (sort {$a->[1]<=>$b->[1]})[0..$x-1];

Hmm... Sorting is not an efficient way to find the top n of a population
if n is considerably smaller than the population. Sorting is slow and
(usually) needs all the data in memory. Both can be avoided using a heap.

For this particular problem, we'd want a maximum heap, that is, one that
allows the extraction of the largest element at all times. Assuming we
have a Heap class with methods new, add, size and extract_maximum, this
finds the $x closest points:

my $heap = Heap->new( sub { $a->[1] <=> $b->[1] });
while ( my( $key, $dist) = next_point() ) {
$heap->add( [ $key, $dist]);
# discard the most distant point if no longer needed
$heap->extract_maximum if $heap->size > $x;
}
my @data = reverse map $heap->extract_maximum, 1 .. $heap->size;

If $x is small compared to the overall number n of locations, this runs
essentially in time proportional to n and space proportional to $x. If
$x approaches the total, it seamlessly mutates into a heap sort.

There are two or three heap modules on CPAN, though none of them may
have the interface I assumed.
And if that took too much memory I'd do:

[snip]

The heap method does a better job of keeping intermediate storage small.
And if that didn't cut I'd seriously look into getting a real
GIS before I'd spend more time on rolling my own.

Indeed.

Anno
 
C

ctcgag

I'd start by doing something like:

my @data;
while (my($key,$dist)=next_point()) {
push @data, [$key,$dist];
};
@data = (sort {$a->[1]<=>$b->[1]})[0..$x-1];

Hmm... Sorting is not an efficient way to find the top n of a population
if n is considerably smaller than the population. Sorting is slow and
(usually) needs all the data in memory. Both can be avoided using a
heap.

Sorting is actually quite fast, and doesn't need all the data in memory
if you make the simple modification I gave for doing intermediate sorts.
(Now that I've implemented and tested it, I'd recommed 3 or 4 rather than 8
as a good multiplier for the itermediate sorts. If you are space paranoid,
even 1.5 gives good perfomance.)

It is only "not an efficient way" in the same sense that Perl is not
an efficient way to program much of anything. Now if heaps were a native
Perl data structure the way hashes are, I would have definitely gone with
that over sorts.

For this particular problem, we'd want a maximum heap, that is, one that
allows the extraction of the largest element at all times. Assuming we
have a Heap class with methods new, add, size and extract_maximum, this
finds the $x closest points:

my $heap = Heap->new( sub { $a->[1] <=> $b->[1] });
while ( my( $key, $dist) = next_point() ) {
$heap->add( [ $key, $dist]);
# discard the most distant point if no longer needed
$heap->extract_maximum if $heap->size > $x;
}
my @data = reverse map $heap->extract_maximum, 1 .. $heap->size;

If $x is small compared to the overall number n of locations, this runs
essentially in time proportional to n and space proportional to $x. If
$x approaches the total, it seamlessly mutates into a heap sort.

The multiple-sorting method has the same properties, only the constants of
proportionality and the overheads are different. And I think they favor
the sorting method. In the limiting case, one method becomes a single
huge sort implemented in C with all data in memory, while the other becomes
a single huge sort implemented in Perl (OO-Perl, at that) with all data in
memory.
There are two or three heap modules on CPAN, though none of them may
have the interface I assumed.

Another advantage of using sort. It has the interface I assumed :), and I
didn't have to install (I couldn't even get Heap::Simple to install with
CPAN) and learn a new module.

(Not that learning a new module is a bad thing, but when it decreases my
programming speed by a factor of 2 or so, decreases the program speed by a
factor of 4 or so, descreases portability by requiring yet another non-core
module to be installed, and makes the program longer to boot, it isn't such
a good thing, except perhaps as an intellectual challenge.)

And if that took too much memory I'd do:

[snip]

The heap method does a better job of keeping intermediate storage small.


Not necessarily. The memory overhead of the heap data structure, as
implemented in Heap::Binomial for example, is substantial. Unless the key
size is quite large (>0.5kb, in which case they should probably be
decomposed into a true key and a lazy-loaded data portion), or x is quite
large, the multiple sorting method is likely to be both faster (3 to 6
times, in the tests I did) and smaller.

Xho
 
S

Sarita Moodie

Hi Ian,

I know this is an old thread, but I just found it and want to add what
I know.

Its not a big deal to do great circle distances using SQL. Access
should be able to handle it fine. The query looks big, but thats not
important if you dont let it bother you.

The trick is to encode the great circle formula using SQL syntax. Look
here for the formulas:

http://williams.best.vwh.net/avform.htm

You may have to change the syntax to make it work with whatever
database you use.

Now that you can compute the distance, you can sort on this distance
using ORDER BY, and then take the top X values. Or you can easily put
it in a WHERE clause to show only the records within Z kilometers.

Now the problem with this approach is that the query will start to slow
down when you get into large record sets (your 32,000 is already fairly
large). It will be a lot faster if you use a database that allows
spatial indexing. When using such databases your queries can scale
much larger. New versions of MySQL, PostgreSQL, and Oracle all support
spatial indexes. Good luck!

-Robin Chauhan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top