convenient module to take statistics for hashed structures?

E

ela

__DATA__
ID B C D E F G H
1 3 7 9 3 4 2 3
1 3 7 9 3 4 2 2
1 3 7 9 5 8 6 6
1 3 7 9 3 4 2 3
2 4 7 9 3 4 2 1
2 4 7 9 3 4 2 2
2 4 7 9 3 4 2 3
2 4 7 9 3 4 2 3

For each ID (the above example has two (1 and 2)), I want to identify the
"last common
ancestor: LCA, H being higher preference than B " based on some defined
threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9; if
set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)

While hash is good at allocating different instance easily, I don't know
whether perl supports simple architecture to get the max/min. For the
following code,

@array

For each row
for ($i=0; $i<$numcol; $i++)
$array[$i]{$key}++;


if just for B, I know the following should be written in this way:

foreach $key (keys %B) { $Bpertpot{$key} = $B{$key}/$total; }

for ($i=0; $i<$numcol; $i++) {
$maxcol[$i] = 0;
foreach $key (keys %Bpertpot) { if ($Bpertpot{$key}> $maxcol[$i]) {
$maxcol[$i] = $Bpertpot{$key}; }
}

but then I don't know how to do that for array of hash to traverse... i.e.
replace the %B and %Bpertpot to something that is compatible with the array
structure... In fact, I wonder if there is already well-established modules
that may have handled this kind of max-min statistics problems that seem to
encounter frequently in the business sector...
 
S

smallpond

__DATA__
ID    B    C    D    E    F    G    H
1    3    7    9    3    4    2    3
1    3    7    9    3    4    2    2
1    3    7    9    5    8    6    6
1    3    7    9    3    4    2    3
2    4    7    9    3    4    2    1
2    4    7    9    3    4    2    2
2    4    7    9    3    4    2    3
2    4    7    9    3    4    2    3

For each ID (the above example has two (1 and 2)), I want to identify the
"last common
ancestor: LCA, H being higher preference than B " based on some defined
threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9; if
set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)

While hash is good at allocating different instance easily, I don't know
whether perl supports simple architecture to get the max/min. For the
following code,

@array

For each row
    for ($i=0; $i<$numcol; $i++)
        $array[$i]{$key}++;

if just for B, I know the following should be written in this way:

foreach $key (keys %B) {     $Bpertpot{$key} = $B{$key}/$total;   }

for ($i=0; $i<$numcol; $i++) {
$maxcol[$i] = 0;
foreach $key (keys %Bpertpot) { if ($Bpertpot{$key}> $maxcol[$i]) {
$maxcol[$i] = $Bpertpot{$key};    }

}

but then I don't know how to do that for array of hash to traverse...  i.e.
replace the %B and %Bpertpot to something that is compatible with the array
structure... In fact, I wonder if there is already well-established modules
that may have handled this kind of max-min statistics problems that seem to
encounter frequently in the business sector...

Have a look at List::Util which is a core module - it has max and
first
functions that you will find useful.

Any book on Perl will explain how to create and use a hash of arrays
or an
array of arrays.
 
G

George Mpouras

ela said:
__DATA__
ID B C D E F G H
1 3 7 9 3 4 2 3
1 3 7 9 3 4 2 2
1 3 7 9 5 8 6 6
1 3 7 9 3 4 2 3
2 4 7 9 3 4 2 1
2 4 7 9 3 4 2 2
2 4 7 9 3 4 2 3
2 4 7 9 3 4 2 3

For each ID (the above example has two (1 and 2)), I want to identify the
"last common
ancestor: LCA, H being higher preference than B " based on some defined
threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9;
if set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)

While hash is good at allocating different instance easily, I don't know
whether perl supports simple architecture to get the max/min. For the
following code,

@array

For each row
for ($i=0; $i<$numcol; $i++)
$array[$i]{$key}++;


if just for B, I know the following should be written in this way:

foreach $key (keys %B) { $Bpertpot{$key} = $B{$key}/$total; }

for ($i=0; $i<$numcol; $i++) {
$maxcol[$i] = 0;
foreach $key (keys %Bpertpot) { if ($Bpertpot{$key}> $maxcol[$i]) {
$maxcol[$i] = $Bpertpot{$key}; }
}

but then I don't know how to do that for array of hash to traverse...
i.e.
replace the %B and %Bpertpot to something that is compatible with the
array
structure... In fact, I wonder if there is already well-established
modules
that may have handled this kind of max-min statistics problems that seem
to
encounter frequently in the business sector...


set it to 75%, then it is F=2
I do not see any 2 at F column. I have problem to undestand what you
mean/what you want.
 
G

George Mpouras

ela said:
George Mpouras said:
ela said:
__DATA__
ID B C D E F G H
1 3 7 9 3 4 2 3
1 3 7 9 3 4 2 2
1 3 7 9 5 8 6 6
1 3 7 9 3 4 2 3
2 4 7 9 3 4 2 1
2 4 7 9 3 4 2 2
2 4 7 9 3 4 2 3
2 4 7 9 3 4 2 3

For each ID (the above example has two (1 and 2)), I want to identify
the "last common
ancestor: LCA, H being higher preference than B " based on some defined
threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9;
if set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50%
H=3)

While hash is good at allocating different instance easily, I don't know
whether perl supports simple architecture to get the max/min. For the
following code,

@array

For each row
for ($i=0; $i<$numcol; $i++)
$array[$i]{$key}++;


if just for B, I know the following should be written in this way:

foreach $key (keys %B) { $Bpertpot{$key} = $B{$key}/$total; }

for ($i=0; $i<$numcol; $i++) {
$maxcol[$i] = 0;
foreach $key (keys %Bpertpot) { if ($Bpertpot{$key}> $maxcol[$i]) {
$maxcol[$i] = $Bpertpot{$key}; }
}

but then I don't know how to do that for array of hash to traverse...
i.e.
replace the %B and %Bpertpot to something that is compatible with the
array
structure... In fact, I wonder if there is already well-established
modules
that may have handled this kind of max-min statistics problems that seem
to
encounter frequently in the business sector...


set it to 75%, then it is F=2
I do not see any 2 at F column. I have problem to undestand what you
mean/what you want.
Thanks for correcting the mistake. It is G=2 (2,2,6,2; so fulfilling the
75% requirement) and not F=2. Always check from H (or the last column
first). H's majority is 3, for only 50% abundant, and then look up one by
one (F, E, D, ...). Each ID (without knowing how many incidents
beforehand) has to repeat the same process again and again.


#!/usr/bin/perl
#
# ok here is your homework .
# next time try not cheat , because even if
# you pass the lesson, will not learn !



my %col;
my %data;
ReadData();

$_ = query(1,100);
print "id=1, thr=100% -> Field=$_->[0],Value=@{$_->[1]}\n";

$_ = query(1,75);
print "id=1, thr=100% -> Field=$_->[0],Value=@{$_->[1]}\n";

$_ = query(2,100);
print "id=2, thr=100% -> Field=$_->[0],Value=@{$_->[1]}\n";

$_ = query(2,75);
print "id=2, thr=75% -> Field=$_->[0],Value=@{$_->[1]}\n";

$_ = query(2,50);
print "id=2, thr=50% -> Field=$_->[0],Value=@{$_->[1]}\n";

$_ = query(2,25);
print "id=2, thr=25% -> Field=$_->[0],Value=@{$_->[1]}\n";



sub ReadData {
while(<DATA>){
chomp;
my @a = split /\s+/;
unless (exists $col{1}){@col{1..$#a}=@a[1..$#a];next}
++$data{$a[0]}->{lines};
for(my $i=1;$i<=$#a;$i++){
++$data{$a[0]}->{field}->{$col{$i}}->{data}->{$a[$i]} } }
foreach my $id (keys %data) {
foreach my $field (keys %{$data{$id}->{field}} ) {
foreach my $item (keys %{$data{$id}->{field}->{$field}->{data}} ) {
push @{ $data{$id}->{field}->{$field}->{rank}->{ 100*(
$data{$id}->{field}->{$field}->{data}->{$item} / $data{$id}->{lines} ) } } ,
$item}}}
#use Data::Dumper; print Dumper(\%data);exit;
}


sub query {
my ($id,$rank)=@_;
foreach my $field (reverse sort keys %col) {
if ( exists $data{$id}->{field}->{$col{$field}}->{rank}->{ $rank } ) {
return [ $col{$field},
$data{$id}->{field}->{$col{$field}}->{rank}->{$rank}] }
}
['',[]]
}




__DATA__
ID B C D E F G H
1 3 7 9 3 4 2 3
1 3 7 9 3 4 2 2
1 3 7 9 5 8 6 6
1 3 7 9 3 4 2 3
2 4 7 9 3 4 2 1
2 4 7 9 3 4 2 2
2 4 7 9 3 4 2 3
2 4 7 9 3 4 2 3
 
G

George Mpouras

Sorry for the silly joke at the comment.

$col{1} #what does 1 refer to?
this is a just a check to see if we are reading the first line with the
column names

is the last item index of an array. Synonymous are
$array[ -1 + scalar @array ]
$array[-1]
@col{1..$#a} #array of hash?
This is called hash slice; used to create a hash from an array
my @array = qw/a b c/
my %hash = ();
@hash{ @array } = some values

@a[1..$#a] #array of what?
Oh some array elements
@array[2..4] -> $array[2], $array[3], $array[4]

++$data{$a[0]}->{lines}; #hash of hash? and an arbitrary name "line" is
given?
Lets keep the total lines of every ID to a hash reference with key "lines"


++$data{$a[0]}->{field}->{$col{$i}}->{data}->{$a[$i]} } } #oh, this line
is really... hard to know why arrow can be used again and again....
Arrows are not neccassery, but I found them beautifull
we want to keep our data isolate to a different sub-hash with key data

push @{ $data{$id}->{field}->{$field}->{rank}->{ 100*(
$data{$id}->{field}->{$field}->{data}->{$item} / $data{$id}->{lines} ) } } ,
$item}}} #what advantage of using push here?

Here we want to keep all the occasions with the same threshold !
So we if for example there are four different numbers , we can report
back all of the, if the questioned threshold is 25%

['',[]] #what is this...?!
This the default answer if no threshiold is found. They are to items the
'' , and an empty array representing the (no) found values.



If you check what I ve done you will find out that it can be re-written
to be almost 10 times faster, but it is goog enough for a start.


Peace.
 
G

George Mpouras

uncomment the line
#use Data::Dumper; print Dumper(\%data);exit;
and you will undestand the underlying logic by your own.
 
N

news.ntua.gr

Ο "John W. Krahn" έγÏαψε στο μήνυμα
This is called hash slice;
Correct.

used to create a hash

Only my() can create a hash.


I thought that local, our, state, could also do the job
 
U

Uri Guttman

nng> Ο "John W. Krahn" έγÏαψε στο μήνυμα
nng>
nng> Correct.

nng> Only my() can create a hash.

nng> I thought that local, our, state, could also do the job

our doesn't create a variable. it only creates a lexical alias to the
variable of the same name in the current package.

local doesn't create a variable. it pushes the value of a variable and
allows for a new value to be put in its place.

state variables are just like my but they don't get reinitialized when
the enclosing block is entered.

uri
 
E

ela

George Mpouras said:
ela said:
__DATA__
ID B C D E F G H
1 3 7 9 3 4 2 3
1 3 7 9 3 4 2 2
1 3 7 9 5 8 6 6
1 3 7 9 3 4 2 3
2 4 7 9 3 4 2 1
2 4 7 9 3 4 2 2
2 4 7 9 3 4 2 3
2 4 7 9 3 4 2 3

For each ID (the above example has two (1 and 2)), I want to identify the
"last common
ancestor: LCA, H being higher preference than B " based on some defined
threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9;
if set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)

While hash is good at allocating different instance easily, I don't know
whether perl supports simple architecture to get the max/min. For the
following code,

@array

For each row
for ($i=0; $i<$numcol; $i++)
$array[$i]{$key}++;


if just for B, I know the following should be written in this way:

foreach $key (keys %B) { $Bpertpot{$key} = $B{$key}/$total; }

for ($i=0; $i<$numcol; $i++) {
$maxcol[$i] = 0;
foreach $key (keys %Bpertpot) { if ($Bpertpot{$key}> $maxcol[$i]) {
$maxcol[$i] = $Bpertpot{$key}; }
}

but then I don't know how to do that for array of hash to traverse...
i.e.
replace the %B and %Bpertpot to something that is compatible with the
array
structure... In fact, I wonder if there is already well-established
modules
that may have handled this kind of max-min statistics problems that seem
to
encounter frequently in the business sector...


set it to 75%, then it is F=2
I do not see any 2 at F column. I have problem to undestand what you
mean/what you want.
Thanks for correcting the mistake. It is G=2 (2,2,6,2; so fulfilling the 75%
requirement) and not F=2. Always check from H (or the last column first).
H's majority is 3, for only 50% abundant, and then look up one by one (F, E,
D, ...). Each ID (without knowing how many incidents beforehand) has to
repeat the same process again and again.
 
G

George Mpouras

our doesn't create a variable. it only creates a lexical alias to the
variable of the same name in the current package.

Alias of what if our is only the definition ;
#!/usr/bin/perl
our $var=1;
print $var;
exit 0
 
G

George Mpouras

ela said:
While the suggested solution works perfectly for the example data inside
the perl script, it does not work when the data change to:

__DATA__
Identity of query sequence Superkingdom Kingdom Subkingdom Phylum
Class Order Family Genus Species group Species
NODE_124_length_77_cov_13.792208 Bacteria undef undef
Proteobacteria Gammaproteobacteria Enterobacteriales
Enterobacteriaceae Escherichia undef Escherichia coli
NODE_124_length_77_cov_13.792208 Bacteria undef undef
Proteobacteria Gammaproteobacteria Enterobacteriales
Enterobacteriaceae Escherichia undef Escherichia coli



Garbage in, garbage out.



Your data are completely inconsistent. You have to solve your input data
problem first.



Make sure that your data contain the same number of columns, and each column
is separated from the other with exactly the same string.



Are you sure the first line will always describe your column names ;

For a start, separated your data using the | no comma , so tabs , no
spaces.



Make sure that you have the same number of | at every line.

change the split to my @a = split /\|/, $_, -1;



So after your data looks like the following, try again



ID|C1|C2|C3

1|a1|b1|c1

2|a2|b2|c2

3|a3|b3|c3
 
E

ela

#!/usr/bin/perl
#
# ok here is your homework .
# next time try not cheat , because even if
# you pass the lesson, will not learn !

Thanks for your solution and I'd like to know more about what your codes are
doing...

$col{1} #what does 1 refer to?
@col{1..$#a} #array of hash?
@a[1..$#a] #array of what?

++$data{$a[0]}->{lines}; #hash of hash? and an arbitrary name "line" is
given?
++$data{$a[0]}->{field}->{$col{$i}}->{data}->{$a[$i]} } } #oh, this line
is really... hard to know why arrow can be used again and again....

push @{ $data{$id}->{field}->{$field}->{rank}->{ 100*(
$data{$id}->{field}->{$field}->{data}->{$item} / $data{$id}->{lines} ) } } ,
$item}}} #what advantage of using push here?

['',[]] #what is this...?!
 
G

George Mpouras

ela said:
The problem arises due to the number of fields. Since there are more than
9
fields, and the sort has to do in this way:

foreach my $field (reverse sort {$a<=>$b} keys %col) {

so perl treats 10 as bigger than 9.

Now the remaining problem is how to solve the threshold problem cleverly.
Originally, George's solution makes use of hash for fast access and in
fact
progressive test, 100,99,98, ..., $threshold can be performed. However,
when
the data is large (>1 million rows with a lot of ID's), this may not be a
good idea......

If your columns are

Identity of query sequence | Superkingdom | Kingdom | Subkingdom |
Phylum | Class | Order | Family
| Genus | Species | group | Species

you will notice that you have two columns with the same name "Species" so I
assumed this should be dome Species2 or something . Now you are ready. Run
the following code against the attached data.txt




my %col;
my %data;
ReadData('./data.txt');

$_ = query('NODE_124_length_77_cov_13.792208',100);
print "Field=$_->[0],Value=@{$_->[1]}\n";


sub ReadData {
open INPUT, '<', $_[0] or die "$^E\n";
while(<INPUT>){
chomp;
my @a = split /\s*\|\s*/, $_, -1;
unless (exists $col{1}){@col{1..$#a}=@a[1..$#a];next}
++$data{$a[0]}->{lines};
for(my $i=1;$i<=$#a;$i++){
++$data{$a[0]}->{field}->{$col{$i}}->{data}->{$a[$i]} } }
foreach my $id (keys %data) {
foreach my $field (keys %{$data{$id}->{field}} ) {
foreach my $item (keys %{$data{$id}->{field}->{$field}->{data}} ) {
push @{ $data{$id}->{field}->{$field}->{rank}->{ 100*(
$data{$id}->{field}->{$field}->{data}->{$item} / $data{$id}->{lines} ) } } ,
$item}}}
close INPUT;
#use Data::Dumper; print Dumper(\%data);exit;
}


sub query {
my ($id,$rank)=@_;
foreach my $field (sort {$b<=>$a} keys %col)
{
if ( exists $data{$id}->{field}->{$col{$field}}->{rank}->{ $rank } ) {
return [ $col{$field},
$data{$id}->{field}->{$col{$field}}->{rank}->{$rank}]
}
}
['',[]]
}
 
G

George Mpouras

# The following version is much faster than the previous


my @col;
my %data;
ReadData();


$_ = query('NODE_124_length_77_cov_13.792208', 100);
print "Field=$_->[0],Value=@{$_->[1]}\n";

$_ = query('NODE_124_length_77_cov_13.792208', 50);
print "Field=$_->[0],Value=@{$_->[1]}\n";



sub ReadData
{
while (<DATA>) {
chomp;
my @a = split /\s*\|\s*/, $_, -1;
if (-1 == $#col){ push @col, @a[1..$#a] ;next}
$data{$a[0]}->[0]++;
for(my $i=1;$i<=$#a;$i++){$data{$a[0]}->[1]->[$i-1]->[0]->{$a[$i]}++} }
foreach my $id ( keys %data ) {
foreach my $f ( @{$data{$id}->[1]} ) {
foreach my $v ( keys %{$f->[0]} ) {
push @{ $f->[1]->{int 100*( $f->[0]->{$v}/$data{$id}->[0])} }, $v}}}
#use Data::Dumper; print Dumper(\%data);exit;
}


sub query {
for (my $i=$#{$data{$_[0]}->[1]}; $i>=0; $i--) {
return [$col[$i], $data{$_[0]}->[1]->[$i]->[1]->{$_[1]}] if exists
$data{$_[0]}->[1]->[$i]->[1]->{$_[1]} }
['',[]]
}


__DATA__
Identity of query sequence | Superkingdom | Kingdom | Subkingdom |
Phylum | Class | Order | Family
| Genus | Species2 | group | Species
NODE_124_length_77_cov_13.792208 | Bacteria | undef | undef |
Proteobacteria | Gammaproteobacteria | Enterobacteriales |
Enterobacteriaceae | Escherichia | undef | Escherichi1 | coli
NODE_124_length_77_cov_13.792208 | Bacteria | undef | undef |
Proteobacteria | Gammaproteobacteria | Enterobacteriales |
Enterobacteriaceae | Escherichia | undef | Escherichi2 | coli
 
P

Peter J. Holzer

foreach my $id ( keys %data ) {
foreach my $f ( @{$data{$id}->[1]} ) {
foreach my $v ( keys %{$f->[0]} ) {
push @{ $f->[1]->{int 100*( $f->[0]->{$v}/$data{$id}->[0])} }, $v}}} [...]
__DATA__
Identity of query sequence | Superkingdom | Kingdom | Subkingdom |
Phylum | Class | Order | Family
| Genus | Species2 | group | Species
NODE_124_length_77_cov_13.792208 | Bacteria | undef | undef |
Proteobacteria | Gammaproteobacteria | Enterobacteriales |
Enterobacteriaceae | Escherichia | undef | Escherichi1 | coli
[...]

When posting code to a newsgroup, please ensure that proper indentation
and line wraps are preserved. This is very hard to read because of the
missing indentation and doesn't work as posted because of the extra
newlines.

hp
 
U

Uri Guttman

GM> Alias of what if our is only the definition ;
GM> #!/usr/bin/perl
GM> our $var=1;
GM> print $var;
GM> exit 0

the current package as i said. what is the default current package?

uri
 
E

ela

While the suggested solution works perfectly for the example data inside the
perl script, it does not work when the data change to:

__DATA__
Identity of query sequence Superkingdom Kingdom Subkingdom
Phylum Class Order Family Genus Species group Species
NODE_124_length_77_cov_13.792208 Bacteria undef undef
Proteobacteria Gammaproteobacteria Enterobacteriales
Enterobacteriaceae Escherichia undef Escherichia coli
NODE_124_length_77_cov_13.792208 Bacteria undef undef
Proteobacteria Gammaproteobacteria Enterobacteriales
Enterobacteriaceae Escherichia undef Escherichia coli


even though I changed

my @a = split /\s+/; to my @a = split /\t/;

Moreover, the current implementation is hard-coding a rank instead of
surpassing a threshold so if I input 10 (i.e. incident abundance exceed 10%
is ok) instead of 100, no result will return at all. Even if I use "100",
the result returned still differs from my expectation.

I expect the program gives the result

id=NODE_124_length_77_cov_13.792208, thr=100% ->
Field=Species,Value=Escherichia coli

but it gives me:

id=NODE_124_length_77_cov_13.792208, thr=100% ->
Field=Order,Value=Enterobacteriales

this statement:
push @{ $data{$id}->{field}->{$field}->{rank}->{
100*($data{$id}->{field}->{$field}->{data}->{$item} /
$data{$id}->{lines} ) } } , $item}

looks very new to me, and can anybody further tell me what it is for?

usually I use push like:

push @array, $element;

and the above code even does not have the symbol ";" but no runtime error
......
 
E

ela

The problem arises due to the number of fields. Since there are more than 9
fields, and the sort has to do in this way:

foreach my $field (reverse sort {$a<=>$b} keys %col) {

so perl treats 10 as bigger than 9.

Now the remaining problem is how to solve the threshold problem cleverly.
Originally, George's solution makes use of hash for fast access and in fact
progressive test, 100,99,98, ..., $threshold can be performed. However, when
the data is large (>1 million rows with a lot of ID's), this may not be a
good idea......
 
G

George Mpouras

I wanna change your implementation from "discrete" checking to
"continuous" one, the logic is to first sort (rank keys: *** expected
range: (0-100] ***) numerically, then test if the "largest" key (e.g. 100,
75 etc) is larger than the threshold specified. My problem is that I don't
know how to refer to the keys under

$data{$_[0]}->[1]->[$i]->[1]

Writing something like "foreach my $field (sort {$b<=>$a} keys
%data{$_[0]}->[1]->[$i]->[1])" (Thanks for McClellan's teaching on
appropriately using sort here) does not work. Moreover, there's no need to
"foreach" here as if the largest one also can't surpass the threshold,
neither the smaller ones can. So how to avoid "foreach" here?


For the data

ID|B|C|D|E|F|G|H
01|3|7|9|3|4|2|3
01|3|7|9|3|4|2|2
01|3|7|9|5|8|6|6
01|3|7|9|3|4|2|3
02|4|7|9|3|4|2|1
02|4|7|9|3|4|2|2
02|4|7|9|3|4|2|3
02|4|7|9|3|4|2|3

what would be your input and what do you expect ?
An example make things more clear.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top