convenient module to take statistics for hashed structures?

G

George Mpouras

# This one meet your requirements
# It can handle even more data than the previous versions



my @col;
my %data;
ReadData();

$_ = query('01',75);
print "Field=$_->[0],Value=@{$_->[1]}\n";



sub ReadData
{
while (<DATA>) { chomp;
my @a = split /\s*\|\s*/, $_, -1;
if (-1 == $#col){push @col, @a[1..$#a] ;next}
unless (1+$#col==$#a) {warn "Skip line number $. \"$_\" because it have
".(1+$#a)." fields, while it should have ".(1+$#col)."\n";next}
$data{$a[0]}->[0]++;
for(my $i=1;$i<=$#a;$i++){$data{$a[0]}->[1]->[$i-1]->[0]->{$a[$i]}++}}

foreach my $id (keys %data)
{
foreach my $f ( @{$data{$id}->[1]} )
{
foreach my $v ( keys %{$f->[0]} )
{
push @{ $f->[1]->{int 100*( $f->[0]->{$v}/$data{$id}->[0])} }, $v
}

# remove unnecessary structures
$f = $f->[1]
}

# remove unnecessary structures
$data{$id} = $data{$id}->[1]
}

#use Data::Dumper; print Dumper(\%data);exit;
}


sub query
{
for(my $i=$#{$data{$_[0]}}; $i>=0; $i--)
{
foreach my $RANK (keys %{$data{$_[0]}->[$i]})
{
return [$col[$i], $data{$_[0]}->[$i]->{$RANK}] if $RANK >= $_[1]
}
}

['',[]]
}


__DATA__
ID|B|C|D|E|F|G|H
01|3|7|9|3|4|2|3
01|3|7|9|3|4|2|2
01|3|7|9|5|8|6|6
01|3|7|9|3|4|2|3
02|4|7|9|3|4|2|1
02|4|7|9|3|4|2|2
02|4|7|9|3|4|2|3
02|4|7|9|3|4|2|3
 
E

ela

George Mpouras said:
# The following version is much faster than the previous
Yes, your latest implementation works very fast, even for a million records!
sub query {
for (my $i=$#{$data{$_[0]}->[1]}; $i>=0; $i--) {
return [$col[$i], $data{$_[0]}->[1]->[$i]->[1]->{$_[1]}] if exists
$data{$_[0]}->[1]->[$i]->[1]->{$_[1]} }
['',[]]
}

I wanna change your implementation from "discrete" checking to "continuous"
one, the logic is to first sort (rank keys: *** expected range: (0-100] ***)
numerically, then test if the "largest" key (e.g. 100, 75 etc) is larger
than the threshold specified. My problem is that I don't know how to refer
to the keys under

$data{$_[0]}->[1]->[$i]->[1]

Writing something like "foreach my $field (sort {$b<=>$a} keys
%data{$_[0]}->[1]->[$i]->[1])" (Thanks for McClellan's teaching on
appropriately using sort here) does not work. Moreover, there's no need to
"foreach" here as if the largest one also can't surpass the threshold,
neither the smaller ones can. So how to avoid "foreach" here?
 
E

ela

George Mpouras said:
I wanna change your implementation from "discrete" checking to
"continuous" one, the logic is to first sort (rank keys: *** expected
range: (0-100] ***) numerically, then test if the "largest" key (e.g.
100, 75 etc) is larger than the threshold specified. My problem is that I
don't know how to refer to the keys under

$data{$_[0]}->[1]->[$i]->[1]

Writing something like "foreach my $field (sort {$b<=>$a} keys
%data{$_[0]}->[1]->[$i]->[1])" (Thanks for McClellan's teaching on
appropriately using sort here) does not work. Moreover, there's no need
to "foreach" here as if the largest one also can't surpass the threshold,
neither the smaller ones can. So how to avoid "foreach" here?


For the data

ID|B|C|D|E|F|G|H
01|3|7|9|3|4|2|3
01|3|7|9|3|4|2|2
01|3|7|9|5|8|6|6
01|3|7|9|3|4|2|3
02|4|7|9|3|4|2|1
02|4|7|9|3|4|2|2
02|4|7|9|3|4|2|3
02|4|7|9|3|4|2|3

what would be your input and what do you expect ?
An example make things more clear.

Dear George,

First of all, I really appreciate your accomodating character. Well, from
your example, what I expect is that, providing a threshold of 100%, for
ID=01, then by looking up the table, H, then G, F, E fail; and then D=9
should return. And if threshold of 70% is given, then H fails (50% for two
3's) but G=2 (three 2's, 75% > 70%) should return. The same threshold will
be used for all the million rows that may have up to 100k unique ID's. So in
this case, 70% threshold will make analysis for ID=02 also return G=2.
 
G

George Mpouras

ela said:
George Mpouras said:
# This one meet your requirements
# It can handle even more data than the previous versions

I'm sorry to report that when I feed the threshold to be smaller than 50,
e.g. 10, the result will be

H=2,6;

This result, although passes the threshold, in fact I only needs the
largest one, i.e. 3 (50% abundant), which is not reported and then why
does statement

foreach my $RANK (keys %{$data{$_[0]}->[$i]})
{
return [$col[$i], $data{$_[0]}->[$i]->{$RANK}] if $RANK >= $_[1]
}

can return two values? when one of the "$RANK" fulfill the requirement, I
think the subroutine will return....



# You only have to change one line for this behavior
#
# foreach my $RANK (sort {$b <=> $a} keys %{$data{$_[0]}->[$i]})
#
# All together again is









my @col;
my %data;
ReadData();

$_ = query('01',10);
print "Field=$_->[0],Value=@{$_->[1]}\n";



sub ReadData
{
while (<DATA>) { chomp;
my @a = split /\s*\|\s*/, $_, -1;
if (-1 == $#col){push @col, @a[1..$#a] ;next}
unless (1+$#col==$#a) {warn "Skip line number $. \"$_\" because it have
".(1+$#a)." fields, while it should have ".(1+$#col)."\n";next}
$data{$a[0]}->[0]++;
for(my $i=1;$i<=$#a;$i++){$data{$a[0]}->[1]->[$i-1]->[0]->{$a[$i]}++}}

foreach my $id (keys %data)
{
foreach my $f ( @{$data{$id}->[1]} )
{
foreach my $v ( keys %{$f->[0]} )
{
push @{ $f->[1]->{int 100*( $f->[0]->{$v}/$data{$id}->[0])} }, $v
}

# remove unnecessary structures
$f = $f->[1]
}

# remove unnecessary structures
$data{$id} = $data{$id}->[1]
}

#use Data::Dumper; print Dumper(\%data);exit;
}


sub query
{
for(my $i=$#{$data{$_[0]}}; $i>=0; $i--)
{
foreach my $RANK (sort {$b <=> $a} keys %{$data{$_[0]}->[$i]})
{
return [$col[$i], $data{$_[0]}->[$i]->{$RANK}] if $RANK >= $_[1]
}
}

['',[]]
}


__DATA__
ID|B|C|D|E|F|G|H
01|3|7|9|3|4|2|3
01|3|7|9|3|4|2|2
01|3|7|9|5|8|6|6
01|3|7|9|3|4|2|3
02|4|7|9|3|4|2|1
02|4|7|9|3|4|2|2
02|4|7|9|3|4|2|3
02|4|7|9|3|4|2|3
 
E

ela

George Mpouras said:
# This one meet your requirements
# It can handle even more data than the previous versions

I'm sorry to report that when I feed the threshold to be smaller than 50,
e.g. 10, the result will be

H=2,6;

This result, although passes the threshold, in fact I only needs the largest
one, i.e. 3 (50% abundant), which is not reported and then why does
statement

foreach my $RANK (keys %{$data{$_[0]}->[$i]})
{
return [$col[$i], $data{$_[0]}->[$i]->{$RANK}] if $RANK >= $_[1]
}

can return two values? when one of the "$RANK" fulfill the requirement, I
think the subroutine will return....
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top