V
vincent64
Here is some elementary code to detect the presence of a clustering
structure in a 2-dimensional dataset. It's more heuristic than
scientific, so take it with a grain of salt, as even the concept of
cluster is highly fuzzy.
The seed routine creates a cluster of 1000 points, saved in
cluster.txt: each row corresponds to a point; the first column is the
cluster number, and the next two columns are the x and y coordinates.
The cluster number is automatically incremented each time a new call
to seed is made, resulting in the creation of a new cluster. The
distance routine computes the distance between two points, for 100
points randomly selected in the data set previously created
(cluster.txt). The output is a file dist.txt, with one row per pair
of
points, with two fields: the first column is an indicator and is
equal
to 1 if both points belong to the same cluster; the second column is
the distance between the two points. This script illustrates that it
is possible to check whether a data set contains one or two clusters
by looking at the distribution of distances: a gap in the
distribution
means the presence of distinct clusters. It also suggests that the
computational complexity of computing whether a data set contains one
of more clusters is well below O(n), possibly O(n0.5), if one uses
sampling techniques.
Source code: http://datashaping.com/regress.shtml
structure in a 2-dimensional dataset. It's more heuristic than
scientific, so take it with a grain of salt, as even the concept of
cluster is highly fuzzy.
The seed routine creates a cluster of 1000 points, saved in
cluster.txt: each row corresponds to a point; the first column is the
cluster number, and the next two columns are the x and y coordinates.
The cluster number is automatically incremented each time a new call
to seed is made, resulting in the creation of a new cluster. The
distance routine computes the distance between two points, for 100
points randomly selected in the data set previously created
(cluster.txt). The output is a file dist.txt, with one row per pair
of
points, with two fields: the first column is an indicator and is
equal
to 1 if both points belong to the same cluster; the second column is
the distance between the two points. This script illustrates that it
is possible to check whether a data set contains one or two clusters
by looking at the distribution of distances: a gap in the
distribution
means the presence of distinct clusters. It also suggests that the
computational complexity of computing whether a data set contains one
of more clusters is well below O(n), possibly O(n0.5), if one uses
sampling techniques.
Source code: http://datashaping.com/regress.shtml