Perl routine for cluster detection

Discussion in 'Perl Misc' started by vincent64@yahoo.com, Apr 28, 2007.

  1. Guest

    Here is some elementary code to detect the presence of a clustering
    structure in a 2-dimensional dataset. It's more heuristic than
    scientific, so take it with a grain of salt, as even the concept of
    cluster is highly fuzzy.

    The seed routine creates a cluster of 1000 points, saved in
    cluster.txt: each row corresponds to a point; the first column is the
    cluster number, and the next two columns are the x and y coordinates.
    The cluster number is automatically incremented each time a new call
    to seed is made, resulting in the creation of a new cluster. The
    distance routine computes the distance between two points, for 100
    points randomly selected in the data set previously created
    (cluster.txt). The output is a file dist.txt, with one row per pair
    of
    points, with two fields: the first column is an indicator and is
    equal
    to 1 if both points belong to the same cluster; the second column is
    the distance between the two points. This script illustrates that it
    is possible to check whether a data set contains one or two clusters
    by looking at the distribution of distances: a gap in the
    distribution
    means the presence of distinct clusters. It also suggests that the
    computational complexity of computing whether a data set contains one
    of more clusters is well below O(n), possibly O(n0.5), if one uses
    sampling techniques.


    Source code: http://datashaping.com/regress.shtml
     
    , Apr 28, 2007
    #1
    1. Advertising

  2. Mirco Wahab Guest

    wrote:
    > Here is some elementary code to detect the presence of a clustering
    > structure in a 2-dimensional dataset. It's more heuristic than
    > scientific, so take it with a grain of salt, as even the concept of
    > cluster is highly fuzzy.


    Before going into details, I'd like to ask what
    you think what the following part of your program
    does:

    ...
    sub seed {
    local($x,$y)=@_;
    $kmax=1000;

    $x=rand($x)-0.5;
    $y=rand($y)-0.5;

    for ($k=0; $k<$kmax; $k++) {
    print "\t$cluster [$1]\n";
    $x=$x+rand($1)-0.5;
    $y=$y+rand($1)-0.5;
    $px[$k]=$x;
    $py[$k]=$y;
    }
    ...


    Aside from beeing not able to run under 'strict',
    what's meant with

    $x=$x+rand($1)-0.5;
    $y=$y+rand($1)-0.5;

    because '$1' is, at this point, not set.

    > The seed routine creates a cluster of 1000 points, saved in
    > cluster.txt: each row corresponds to a point; the first column is the
    > cluster number, and the next two columns are the x and y coordinates.


    Don't do that. The convention in this business is.
    First comes x, then y, then z. Because your 'cluster number'
    is somehow 'a plane' in your problem space, you should make
    it that ('z', third column).

    > The cluster number is automatically incremented each time a new call
    > to seed is made, resulting in the creation of a new cluster. The
    > distance routine computes the distance between two points, for 100
    > points randomly selected in the data set previously created
    > (cluster.txt). The output is a file dist.txt, with one row per pair
    > of points, with two fields: the first column is an indicator and is
    > equal to 1 if both points belong to the same cluster; the second column is
    > the distance between the two points. This script illustrates that it
    > is possible to check whether a data set contains one or two clusters
    > by looking at the distribution of distances: a gap in the
    > distribution means the presence of distinct clusters. It also suggests
    > that the computational complexity of computing whether a data set contains
    > one of more clusters is well below O(n), possibly O(n0.5), if one uses
    > sampling techniques.


    Whats the point of that? You have, say 10^7 2D-points, then you
    select 100 pair-samples from them, compute their distance and
    claim you have 'complexity well below O(n), possibly O(n0.5)'?

    I don't get that ...

    ==>your code was: datashaping.com/cluster_pl.txt

    I'd recommend to translate the code from Perl3-style
    to Perl5, which is not really that difficult, because
    the code does basically almost nothing.

    Starting point: ==>

    use strict;
    use warnings;

    my $idclust = 0;

    dmp_seed([1.0,1.0], 1000, $idclust++, '>cluster.txt');
    dmp_seed([25.0,25.0], 1000, $idclust++, '>>cluster.txt');

    distance(100, 'cluster.txt', 'dist.txt'); # nsamp read write


    sub distance {
    my ($nsamp, $fn_clust, $fn_dist) = @_;

    open my $fc,'<', $fn_clust or die "no coord in: $!";
    my @pc = map [/(\S+)/g], <$fc>;
    close $fc;

    open my $fd, '>', $fn_dist or die "no dist out: $!";
    for (1 .. $nsamp) {
    my ($pm, $pn) = ( $pc[int rand @pc], $pc[int rand @pc] );
    printf $fd "%d\t%.8f\n", 1-($pm->[2] == $pn->[2]),
    sqrt(($pm->[0]-$pn->[0])**2 + ($pm->[1]-$pn->[1])**2)
    }
    close $fd;
    }

    sub dmp_seed {
    my ($rseed, $nmax, $cluid, $fname_mod) = @_;
    my ($x, $y) = map $_+rand(1)-0.5, @$rseed;

    open my $fh, $fname_mod or die "no way out: $!";
    for(1 .. $nmax) {
    printf $fh "%.8f\t%.8f\t%d\n", $x+=rand(1)-0.5, $y+=rand(1)-0.5, $cluid
    }
    close $fh;
    }
    <==

    Regards

    M.
     
    Mirco Wahab, Apr 29, 2007
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Max
    Replies:
    0
    Views:
    403
  2. /usr/ceo
    Replies:
    32
    Views:
    1,956
    Mark Space
    Sep 15, 2008
  3. Bazil
    Replies:
    2
    Views:
    126
    Tad McClellan
    Dec 6, 2003
  4. Arthur

    Perl Sort Routine

    Arthur, Feb 12, 2004, in forum: Perl Misc
    Replies:
    2
    Views:
    107
    Paul Lalli
    Feb 12, 2004
  5. Rahul
    Replies:
    8
    Views:
    463
    Rahul
    Feb 11, 2009
Loading...

Share This Page