convenient module to take statistics for hashed structures?

George Mpouras · Mar 16, 2011

# This one meet your requirements
# It can handle even more data than the previous versions

my @col;
my %data;
ReadData();

$_ = query('01',75);
print "Field=$_->[0],Value=@{$_->[1]}\n";

sub ReadData
{
while (<DATA>) { chomp;
my @a = split /\s*\|\s*/, $_, -1;
if (-1 == $#col){push @col, @a[1..$#a] ;next}
unless (1+$#col==$#a) {warn "Skip line number $. \"$_\" because it have
".(1+$#a)." fields, while it should have ".(1+$#col)."\n";next}
$data{$a[0]}->[0]++;
for(my $i=1;$i<=$#a;$i++){$data{$a[0]}->[1]->[$i-1]->[0]->{$a[$i]}++}}

foreach my $id (keys %data)
{
foreach my $f ( @{$data{$id}->[1]} )
{
foreach my $v ( keys %{$f->[0]} )
{
push @{ $f->[1]->{int 100*( $f->[0]->{$v}/$data{$id}->[0])} }, $v
}

# remove unnecessary structures
$f = $f->[1]
}

# remove unnecessary structures
$data{$id} = $data{$id}->[1]
}

#use Data:

umper; print Dumper(\%data);exit;
}

sub query
{
for(my $i=$#{$data{$_[0]}}; $i>=0; $i--)
{
foreach my $RANK (keys %{$data{$_[0]}->[$i]})
{
return [$col[$i], $data{$_[0]}->[$i]->{$RANK}] if $RANK >= $_[1]
}
}

['',[]]
}

__DATA__
ID|B|C|D|E|F|G|H
01|3|7|9|3|4|2|3
01|3|7|9|3|4|2|2
01|3|7|9|5|8|6|6
01|3|7|9|3|4|2|3
02|4|7|9|3|4|2|1
02|4|7|9|3|4|2|2
02|4|7|9|3|4|2|3
02|4|7|9|3|4|2|3

ela · Mar 16, 2011

George Mpouras said:
# The following version is much faster than the previous

Yes, your latest implementation works very fast, even for a million records!

sub query {
for (my $i=$#{$data{$_[0]}->[1]}; $i>=0; $i--) {
return [$col[$i], $data{$_[0]}->[1]->[$i]->[1]->{$_[1]}] if exists
$data{$_[0]}->[1]->[$i]->[1]->{$_[1]} }
['',[]]
}

I wanna change your implementation from "discrete" checking to "continuous"
one, the logic is to first sort (rank keys: *** expected range: (0-100] ***)
numerically, then test if the "largest" key (e.g. 100, 75 etc) is larger
than the threshold specified. My problem is that I don't know how to refer
to the keys under

$data{$_[0]}->[1]->[$i]->[1]

Writing something like "foreach my $field (sort {$b<=>$a} keys
%data{$_[0]}->[1]->[$i]->[1])" (Thanks for McClellan's teaching on
appropriately using sort here) does not work. Moreover, there's no need to
"foreach" here as if the largest one also can't surpass the threshold,
neither the smaller ones can. So how to avoid "foreach" here?

ela · Mar 17, 2011

George Mpouras said:
I wanna change your implementation from "discrete" checking to
"continuous" one, the logic is to first sort (rank keys: *** expected
range: (0-100] ***) numerically, then test if the "largest" key (e.g.
100, 75 etc) is larger than the threshold specified. My problem is that I
don't know how to refer to the keys under

$data{$_[0]}->[1]->[$i]->[1]

Writing something like "foreach my $field (sort {$b<=>$a} keys
%data{$_[0]}->[1]->[$i]->[1])" (Thanks for McClellan's teaching on
appropriately using sort here) does not work. Moreover, there's no need
to "foreach" here as if the largest one also can't surpass the threshold,
neither the smaller ones can. So how to avoid "foreach" here?

Click to expand...

For the data

ID|B|C|D|E|F|G|H
01|3|7|9|3|4|2|3
01|3|7|9|3|4|2|2
01|3|7|9|5|8|6|6
01|3|7|9|3|4|2|3
02|4|7|9|3|4|2|1
02|4|7|9|3|4|2|2
02|4|7|9|3|4|2|3
02|4|7|9|3|4|2|3

what would be your input and what do you expect ?
An example make things more clear.

Dear George,

First of all, I really appreciate your accomodating character. Well, from
your example, what I expect is that, providing a threshold of 100%, for
ID=01, then by looking up the table, H, then G, F, E fail; and then D=9
should return. And if threshold of 70% is given, then H fails (50% for two
3's) but G=2 (three 2's, 75% > 70%) should return. The same threshold will
be used for all the million rows that may have up to 100k unique ID's. So in
this case, 70% threshold will make analysis for ID=02 also return G=2.

George Mpouras · Mar 17, 2011

ela said:
George Mpouras said:

# This one meet your requirements
# It can handle even more data than the previous versions

Click to expand...

I'm sorry to report that when I feed the threshold to be smaller than 50,
e.g. 10, the result will be

H=2,6;

This result, although passes the threshold, in fact I only needs the
largest one, i.e. 3 (50% abundant), which is not reported and then why
does statement

foreach my $RANK (keys %{$data{$_[0]}->[$i]})
{
return [$col[$i], $data{$_[0]}->[$i]->{$RANK}] if $RANK >= $_[1]
}

can return two values? when one of the "$RANK" fulfill the requirement, I
think the subroutine will return....

# You only have to change one line for this behavior
#
# foreach my $RANK (sort {$b <=> $a} keys %{$data{$_[0]}->[$i]})
#
# All together again is

my @col;
my %data;
ReadData();

$_ = query('01',10);
print "Field=$_->[0],Value=@{$_->[1]}\n";

sub ReadData
{
while (<DATA>) { chomp;
my @a = split /\s*\|\s*/, $_, -1;
if (-1 == $#col){push @col, @a[1..$#a] ;next}
unless (1+$#col==$#a) {warn "Skip line number $. \"$_\" because it have
".(1+$#a)." fields, while it should have ".(1+$#col)."\n";next}
$data{$a[0]}->[0]++;
for(my $i=1;$i<=$#a;$i++){$data{$a[0]}->[1]->[$i-1]->[0]->{$a[$i]}++}}

foreach my $id (keys %data)
{
foreach my $f ( @{$data{$id}->[1]} )
{
foreach my $v ( keys %{$f->[0]} )
{
push @{ $f->[1]->{int 100*( $f->[0]->{$v}/$data{$id}->[0])} }, $v
}

# remove unnecessary structures
$f = $f->[1]
}

# remove unnecessary structures
$data{$id} = $data{$id}->[1]
}

#use Data:

umper; print Dumper(\%data);exit;
}

sub query
{
for(my $i=$#{$data{$_[0]}}; $i>=0; $i--)
{
foreach my $RANK (sort {$b <=> $a} keys %{$data{$_[0]}->[$i]})
{
return [$col[$i], $data{$_[0]}->[$i]->{$RANK}] if $RANK >= $_[1]
}
}

['',[]]
}

__DATA__
ID|B|C|D|E|F|G|H
01|3|7|9|3|4|2|3
01|3|7|9|3|4|2|2
01|3|7|9|5|8|6|6
01|3|7|9|3|4|2|3
02|4|7|9|3|4|2|1
02|4|7|9|3|4|2|2
02|4|7|9|3|4|2|3
02|4|7|9|3|4|2|3

ela · Mar 18, 2011

George Mpouras said:
# This one meet your requirements
# It can handle even more data than the previous versions

I'm sorry to report that when I feed the threshold to be smaller than 50,
e.g. 10, the result will be

H=2,6;

This result, although passes the threshold, in fact I only needs the largest
one, i.e. 3 (50% abundant), which is not reported and then why does
statement

foreach my $RANK (keys %{$data{$_[0]}->[$i]})
{
return [$col[$i], $data{$_[0]}->[$i]->{$RANK}] if $RANK >= $_[1]
}

can return two values? when one of the "$RANK" fulfill the requirement, I
think the subroutine will return....

Hashed to death... help needed.	4	May 23, 2007
Looking for someone to take alook at this code and help	2	Mar 10, 2023
Need help for javascript code	3	Sep 28, 2022
Translater + module + tkinter	1	Feb 16, 2023
Minimum Total Difficulty	0	Nov 15, 2023
Save instance when rotating	0	Sep 27, 2023
Filename undefined for Blob ?	1	Oct 28, 2023
How to multiply two matrices of size in using inline assembly in C++	2	Mar 3, 2024

convenient module to take statistics for hashed structures?

George Mpouras

ela

ela

George Mpouras

ela

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads