# convenient module to take statistics for hashed structures?

Discussion in 'Perl Misc' started by ela, Mar 13, 2011.

1. ### elaGuest

__DATA__
ID B C D E F G H
1 3 7 9 3 4 2 3
1 3 7 9 3 4 2 2
1 3 7 9 5 8 6 6
1 3 7 9 3 4 2 3
2 4 7 9 3 4 2 1
2 4 7 9 3 4 2 2
2 4 7 9 3 4 2 3
2 4 7 9 3 4 2 3

For each ID (the above example has two (1 and 2)), I want to identify the
"last common
ancestor: LCA, H being higher preference than B " based on some defined
threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9; if
set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)

While hash is good at allocating different instance easily, I don't know
whether perl supports simple architecture to get the max/min. For the
following code,

@array

For each row
for (\$i=0; \$i<\$numcol; \$i++)
\$array[\$i]{\$key}++;

if just for B, I know the following should be written in this way:

foreach \$key (keys %B) { \$Bpertpot{\$key} = \$B{\$key}/\$total; }

for (\$i=0; \$i<\$numcol; \$i++) {
\$maxcol[\$i] = 0;
foreach \$key (keys %Bpertpot) { if (\$Bpertpot{\$key}> \$maxcol[\$i]) {
\$maxcol[\$i] = \$Bpertpot{\$key}; }
}

but then I don't know how to do that for array of hash to traverse... i.e.
replace the %B and %Bpertpot to something that is compatible with the array
structure... In fact, I wonder if there is already well-established modules
that may have handled this kind of max-min statistics problems that seem to
encounter frequently in the business sector...

ela, Mar 13, 2011

2. ### smallpondGuest

On Mar 13, 11:32 am, "ela" <> wrote:
> __DATA__
> ID    B    C    D    E    F    G    H
> 1    3    7    9    3    4    2    3
> 1    3    7    9    3    4    2    2
> 1    3    7    9    5    8    6    6
> 1    3    7    9    3    4    2    3
> 2    4    7    9    3    4    2    1
> 2    4    7    9    3    4    2    2
> 2    4    7    9    3    4    2    3
> 2    4    7    9    3    4    2    3
>
> For each ID (the above example has two (1 and 2)), I want to identify the
> "last common
> ancestor: LCA, H being higher preference than B " based on some defined
> threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9; if
> set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)
>
> While hash is good at allocating different instance easily, I don't know
> whether perl supports simple architecture to get the max/min. For the
> following code,
>
> @array
>
> For each row
>     for (\$i=0; \$i<\$numcol; \$i++)
>         \$array[\$i]{\$key}++;
>
> if just for B, I know the following should be written in this way:
>
> foreach \$key (keys %B) {     \$Bpertpot{\$key} = \$B{\$key}/\$total;   }
>
> for (\$i=0; \$i<\$numcol; \$i++) {
> \$maxcol[\$i] = 0;
> foreach \$key (keys %Bpertpot) { if (\$Bpertpot{\$key}> \$maxcol[\$i]) {
> \$maxcol[\$i] = \$Bpertpot{\$key};    }
>
> }
>
> but then I don't know how to do that for array of hash to traverse...  i.e.
> replace the %B and %Bpertpot to something that is compatible with the array
> structure... In fact, I wonder if there is already well-established modules
> that may have handled this kind of max-min statistics problems that seem to
> encounter frequently in the business sector...

Have a look at List::Util which is a core module - it has max and
first
functions that you will find useful.

Any book on Perl will explain how to create and use a hash of arrays
or an
array of arrays.

smallpond, Mar 14, 2011

3. ### George MpourasGuest

"ela" <> wrote in message
news:ilgvng\$7ec\$...
> __DATA__
> ID B C D E F G H
> 1 3 7 9 3 4 2 3
> 1 3 7 9 3 4 2 2
> 1 3 7 9 5 8 6 6
> 1 3 7 9 3 4 2 3
> 2 4 7 9 3 4 2 1
> 2 4 7 9 3 4 2 2
> 2 4 7 9 3 4 2 3
> 2 4 7 9 3 4 2 3
>
> For each ID (the above example has two (1 and 2)), I want to identify the
> "last common
> ancestor: LCA, H being higher preference than B " based on some defined
> threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9;
> if set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)
>
> While hash is good at allocating different instance easily, I don't know
> whether perl supports simple architecture to get the max/min. For the
> following code,
>
> @array
>
> For each row
> for (\$i=0; \$i<\$numcol; \$i++)
> \$array[\$i]{\$key}++;
>
>
> if just for B, I know the following should be written in this way:
>
> foreach \$key (keys %B) { \$Bpertpot{\$key} = \$B{\$key}/\$total; }
>
> for (\$i=0; \$i<\$numcol; \$i++) {
> \$maxcol[\$i] = 0;
> foreach \$key (keys %Bpertpot) { if (\$Bpertpot{\$key}> \$maxcol[\$i]) {
> \$maxcol[\$i] = \$Bpertpot{\$key}; }
> }
>
> but then I don't know how to do that for array of hash to traverse...
> i.e.
> replace the %B and %Bpertpot to something that is compatible with the
> array
> structure... In fact, I wonder if there is already well-established
> modules
> that may have handled this kind of max-min statistics problems that seem
> to
> encounter frequently in the business sector...
>
>
>

set it to 75%, then it is F=2
I do not see any 2 at F column. I have problem to undestand what you
mean/what you want.

George Mpouras, Mar 14, 2011
4. ### George MpourasGuest

"ela" <> wrote in message
news:ilkui6\$mhk\$...
>
> "George Mpouras" <> wrote in message
> news:ilklas\$qos\$...
>>
>> "ela" <> wrote in message
>> news:ilgvng\$7ec\$...
>>> __DATA__
>>> ID B C D E F G H
>>> 1 3 7 9 3 4 2 3
>>> 1 3 7 9 3 4 2 2
>>> 1 3 7 9 5 8 6 6
>>> 1 3 7 9 3 4 2 3
>>> 2 4 7 9 3 4 2 1
>>> 2 4 7 9 3 4 2 2
>>> 2 4 7 9 3 4 2 3
>>> 2 4 7 9 3 4 2 3
>>>
>>> For each ID (the above example has two (1 and 2)), I want to identify
>>> the "last common
>>> ancestor: LCA, H being higher preference than B " based on some defined
>>> threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9;
>>> if set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50%
>>> H=3)
>>>
>>> While hash is good at allocating different instance easily, I don't know
>>> whether perl supports simple architecture to get the max/min. For the
>>> following code,
>>>
>>> @array
>>>
>>> For each row
>>> for (\$i=0; \$i<\$numcol; \$i++)
>>> \$array[\$i]{\$key}++;
>>>
>>>
>>> if just for B, I know the following should be written in this way:
>>>
>>> foreach \$key (keys %B) { \$Bpertpot{\$key} = \$B{\$key}/\$total; }
>>>
>>> for (\$i=0; \$i<\$numcol; \$i++) {
>>> \$maxcol[\$i] = 0;
>>> foreach \$key (keys %Bpertpot) { if (\$Bpertpot{\$key}> \$maxcol[\$i]) {
>>> \$maxcol[\$i] = \$Bpertpot{\$key}; }
>>> }
>>>
>>> but then I don't know how to do that for array of hash to traverse...
>>> i.e.
>>> replace the %B and %Bpertpot to something that is compatible with the
>>> array
>>> structure... In fact, I wonder if there is already well-established
>>> modules
>>> that may have handled this kind of max-min statistics problems that seem
>>> to
>>> encounter frequently in the business sector...
>>>
>>>
>>>

>>
>>
>> set it to 75%, then it is F=2
>> I do not see any 2 at F column. I have problem to undestand what you
>> mean/what you want.

> Thanks for correcting the mistake. It is G=2 (2,2,6,2; so fulfilling the
> 75% requirement) and not F=2. Always check from H (or the last column
> first). H's majority is 3, for only 50% abundant, and then look up one by
> one (F, E, D, ...). Each ID (without knowing how many incidents
> beforehand) has to repeat the same process again and again.
>

#!/usr/bin/perl
#
# ok here is your homework .
# next time try not cheat , because even if
# you pass the lesson, will not learn !

my %col;
my %data;

\$_ = query(1,100);
print "id=1, thr=100% -> Field=\$_->[0],Value=@{\$_->[1]}\n";

\$_ = query(1,75);
print "id=1, thr=100% -> Field=\$_->[0],Value=@{\$_->[1]}\n";

\$_ = query(2,100);
print "id=2, thr=100% -> Field=\$_->[0],Value=@{\$_->[1]}\n";

\$_ = query(2,75);
print "id=2, thr=75% -> Field=\$_->[0],Value=@{\$_->[1]}\n";

\$_ = query(2,50);
print "id=2, thr=50% -> Field=\$_->[0],Value=@{\$_->[1]}\n";

\$_ = query(2,25);
print "id=2, thr=25% -> Field=\$_->[0],Value=@{\$_->[1]}\n";

while(<DATA>){
chomp;
my @a = split /\s+/;
unless (exists \$col{1}){@col{1..\$#a}=@a[1..\$#a];next}
++\$data{\$a[0]}->{lines};
for(my \$i=1;\$i<=\$#a;\$i++){
++\$data{\$a[0]}->{field}->{\$col{\$i}}->{data}->{\$a[\$i]} } }
foreach my \$id (keys %data) {
foreach my \$field (keys %{\$data{\$id}->{field}} ) {
foreach my \$item (keys %{\$data{\$id}->{field}->{\$field}->{data}} ) {
push @{ \$data{\$id}->{field}->{\$field}->{rank}->{ 100*(
\$data{\$id}->{field}->{\$field}->{data}->{\$item} / \$data{\$id}->{lines} ) } } ,
\$item}}}
#use Data:umper; print Dumper(\%data);exit;
}

sub query {
my (\$id,\$rank)=@_;
foreach my \$field (reverse sort keys %col) {
if ( exists \$data{\$id}->{field}->{\$col{\$field}}->{rank}->{ \$rank } ) {
return [ \$col{\$field},
\$data{\$id}->{field}->{\$col{\$field}}->{rank}->{\$rank}] }
}
['',[]]
}

__DATA__
ID B C D E F G H
1 3 7 9 3 4 2 3
1 3 7 9 3 4 2 2
1 3 7 9 5 8 6 6
1 3 7 9 3 4 2 3
2 4 7 9 3 4 2 1
2 4 7 9 3 4 2 2
2 4 7 9 3 4 2 3
2 4 7 9 3 4 2 3

George Mpouras, Mar 14, 2011
5. ### George MpourasGuest

Sorry for the silly joke at the comment.

> \$col{1} #what does 1 refer to?

this is a just a check to see if we are reading the first line with the
column names

> \$#a

is the last item index of an array. Synonymous are
\$array[ -1 + scalar @array ]
\$array[-1]

> @col{1..\$#a} #array of hash?

This is called hash slice; used to create a hash from an array
my @array = qw/a b c/
my %hash = ();
@hash{ @array } = some values

> @a[1..\$#a] #array of what?

Oh some array elements
@array[2..4] -> \$array[2], \$array[3], \$array[4]

>
> ++\$data{\$a[0]}->{lines}; #hash of hash? and an arbitrary name "line" is
> given?

Lets keep the total lines of every ID to a hash reference with key "lines"

> ++\$data{\$a[0]}->{field}->{\$col{\$i}}->{data}->{\$a[\$i]} } } #oh, this line
> is really... hard to know why arrow can be used again and again....

Arrows are not neccassery, but I found them beautifull
we want to keep our data isolate to a different sub-hash with key data

> push @{ \$data{\$id}->{field}->{\$field}->{rank}->{ 100*(
> \$data{\$id}->{field}->{\$field}->{data}->{\$item} / \$data{\$id}->{lines} ) } } ,
> \$item}}} #what advantage of using push here?

Here we want to keep all the occasions with the same threshold !
So we if for example there are four different numbers , we can report
back all of the, if the questioned threshold is 25%

> ['',[]] #what is this...?!

This the default answer if no threshiold is found. They are to items the
'' , and an empty array representing the (no) found values.

If you check what I ve done you will find out that it can be re-written
to be almost 10 times faster, but it is goog enough for a start.

Peace.

George Mpouras, Mar 14, 2011
6. ### George MpourasGuest

uncomment the line
#use Data:umper; print Dumper(\%data);exit;
and you will undestand the underlying logic by your own.

George Mpouras, Mar 14, 2011
7. ### John W. KrahnGuest

George Mpouras wrote:
> Sorry for the silly joke at the comment.
>
>
> [ snip ]
>
>
>> @col{1..\$#a} #array of hash?

> This is called hash slice;

Correct.

> used to create a hash

Only my() can create a hash.

Used to add keys and values to a hash.

> from an array

from a LIST of keys and a LIST of values.

John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction. -- Albert Einstein

John W. Krahn, Mar 14, 2011
8. ### news.ntua.grGuest

ÎŸ "John W. Krahn" Î­Î³ÏÎ±ÏˆÎµ ÏƒÏ„Î¿ Î¼Î®Î½Ï…Î¼Î±
news:gqwfp.59406\$...

>
>> @col{1..\$#a} #array of hash?

> This is called hash slice;

Correct.

> used to create a hash

Only my() can create a hash.

I thought that local, our, state, could also do the job

news.ntua.gr, Mar 14, 2011
9. ### Uri GuttmanGuest

>>>>> "nng" == news ntua gr <> writes:

nng> ÎŸ "John W. Krahn" Î­Î³ÏÎ±ÏˆÎµ ÏƒÏ„Î¿ Î¼Î®Î½Ï…Î¼Î±
nng> news:gqwfp.59406\$...

>>
>>> @col{1..\$#a} #array of hash?

>> This is called hash slice;

nng> Correct.

>> used to create a hash

nng> Only my() can create a hash.

nng> I thought that local, our, state, could also do the job

our doesn't create a variable. it only creates a lexical alias to the
variable of the same name in the current package.

local doesn't create a variable. it pushes the value of a variable and
allows for a new value to be put in its place.

state variables are just like my but they don't get reinitialized when
the enclosing block is entered.

uri

--
Uri Guttman ------ -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

Uri Guttman, Mar 14, 2011
10. ### elaGuest

"George Mpouras" <> wrote in message
news:ilklas\$qos\$...
>
> "ela" <> wrote in message
> news:ilgvng\$7ec\$...
>> __DATA__
>> ID B C D E F G H
>> 1 3 7 9 3 4 2 3
>> 1 3 7 9 3 4 2 2
>> 1 3 7 9 5 8 6 6
>> 1 3 7 9 3 4 2 3
>> 2 4 7 9 3 4 2 1
>> 2 4 7 9 3 4 2 2
>> 2 4 7 9 3 4 2 3
>> 2 4 7 9 3 4 2 3
>>
>> For each ID (the above example has two (1 and 2)), I want to identify the
>> "last common
>> ancestor: LCA, H being higher preference than B " based on some defined
>> threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9;
>> if set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)
>>
>> While hash is good at allocating different instance easily, I don't know
>> whether perl supports simple architecture to get the max/min. For the
>> following code,
>>
>> @array
>>
>> For each row
>> for (\$i=0; \$i<\$numcol; \$i++)
>> \$array[\$i]{\$key}++;
>>
>>
>> if just for B, I know the following should be written in this way:
>>
>> foreach \$key (keys %B) { \$Bpertpot{\$key} = \$B{\$key}/\$total; }
>>
>> for (\$i=0; \$i<\$numcol; \$i++) {
>> \$maxcol[\$i] = 0;
>> foreach \$key (keys %Bpertpot) { if (\$Bpertpot{\$key}> \$maxcol[\$i]) {
>> \$maxcol[\$i] = \$Bpertpot{\$key}; }
>> }
>>
>> but then I don't know how to do that for array of hash to traverse...
>> i.e.
>> replace the %B and %Bpertpot to something that is compatible with the
>> array
>> structure... In fact, I wonder if there is already well-established
>> modules
>> that may have handled this kind of max-min statistics problems that seem
>> to
>> encounter frequently in the business sector...
>>
>>
>>

>
>
> set it to 75%, then it is F=2
> I do not see any 2 at F column. I have problem to undestand what you
> mean/what you want.

Thanks for correcting the mistake. It is G=2 (2,2,6,2; so fulfilling the 75%
requirement) and not F=2. Always check from H (or the last column first).
H's majority is 3, for only 50% abundant, and then look up one by one (F, E,
D, ...). Each ID (without knowing how many incidents beforehand) has to
repeat the same process again and again.

ela, Mar 15, 2011
11. ### George MpourasGuest

>
> our doesn't create a variable. it only creates a lexical alias to the
> variable of the same name in the current package.

Alias of what if our is only the definition ;
#!/usr/bin/perl
our \$var=1;
print \$var;
exit 0

George Mpouras, Mar 15, 2011
12. ### George MpourasGuest

"ela" <> wrote in message
news:iln04v\$ev1\$...
>
> While the suggested solution works perfectly for the example data inside
> the perl script, it does not work when the data change to:
>
> __DATA__
> Identity of query sequence Superkingdom Kingdom Subkingdom Phylum
> Class Order Family Genus Species group Species
> NODE_124_length_77_cov_13.792208 Bacteria undef undef
> Proteobacteria Gammaproteobacteria Enterobacteriales
> Enterobacteriaceae Escherichia undef Escherichia coli
> NODE_124_length_77_cov_13.792208 Bacteria undef undef
> Proteobacteria Gammaproteobacteria Enterobacteriales
> Enterobacteriaceae Escherichia undef Escherichia coli

Garbage in, garbage out.

Your data are completely inconsistent. You have to solve your input data
problem first.

Make sure that your data contain the same number of columns, and each column
is separated from the other with exactly the same string.

Are you sure the first line will always describe your column names ;

For a start, separated your data using the | no comma , so tabs , no
spaces.

Make sure that you have the same number of | at every line.

change the split to my @a = split /\|/, \$_, -1;

So after your data looks like the following, try again

ID|C1|C2|C3

1|a1|b1|c1

2|a2|b2|c2

3|a3|b3|c3

George Mpouras, Mar 15, 2011
13. ### elaGuest

"George Mpouras" <> wrote in message
news:ill20h\$2ech\$...

>
> #!/usr/bin/perl
> #
> # ok here is your homework .
> # next time try not cheat , because even if
> # you pass the lesson, will not learn !
>

doing...

\$col{1} #what does 1 refer to?
@col{1..\$#a} #array of hash?
@a[1..\$#a] #array of what?

++\$data{\$a[0]}->{lines}; #hash of hash? and an arbitrary name "line" is
given?
++\$data{\$a[0]}->{field}->{\$col{\$i}}->{data}->{\$a[\$i]} } } #oh, this line
is really... hard to know why arrow can be used again and again....

push @{ \$data{\$id}->{field}->{\$field}->{rank}->{ 100*(
\$data{\$id}->{field}->{\$field}->{data}->{\$item} / \$data{\$id}->{lines} ) } } ,
\$item}}} #what advantage of using push here?

['',[]] #what is this...?!

ela, Mar 15, 2011
14. ### George MpourasGuest

"ela" <> wrote in message
news:iln9fk\$ibv\$...
> The problem arises due to the number of fields. Since there are more than
> 9
> fields, and the sort has to do in this way:
>
> foreach my \$field (reverse sort {\$a<=>\$b} keys %col) {
>
> so perl treats 10 as bigger than 9.
>
> Now the remaining problem is how to solve the threshold problem cleverly.
> Originally, George's solution makes use of hash for fast access and in
> fact
> progressive test, 100,99,98, ..., \$threshold can be performed. However,
> when
> the data is large (>1 million rows with a lot of ID's), this may not be a
> good idea......
>
>

Identity of query sequence | Superkingdom | Kingdom | Subkingdom |
Phylum | Class | Order | Family
| Genus | Species | group | Species

you will notice that you have two columns with the same name "Species" so I
assumed this should be dome Species2 or something . Now you are ready. Run
the following code against the attached data.txt

my %col;
my %data;

\$_ = query('NODE_124_length_77_cov_13.792208',100);
print "Field=\$_->[0],Value=@{\$_->[1]}\n";

open INPUT, '<', \$_[0] or die "\$^E\n";
while(<INPUT>){
chomp;
my @a = split /\s*\|\s*/, \$_, -1;
unless (exists \$col{1}){@col{1..\$#a}=@a[1..\$#a];next}
++\$data{\$a[0]}->{lines};
for(my \$i=1;\$i<=\$#a;\$i++){
++\$data{\$a[0]}->{field}->{\$col{\$i}}->{data}->{\$a[\$i]} } }
foreach my \$id (keys %data) {
foreach my \$field (keys %{\$data{\$id}->{field}} ) {
foreach my \$item (keys %{\$data{\$id}->{field}->{\$field}->{data}} ) {
push @{ \$data{\$id}->{field}->{\$field}->{rank}->{ 100*(
\$data{\$id}->{field}->{\$field}->{data}->{\$item} / \$data{\$id}->{lines} ) } } ,
\$item}}}
close INPUT;
#use Data:umper; print Dumper(\%data);exit;
}

sub query {
my (\$id,\$rank)=@_;
foreach my \$field (sort {\$b<=>\$a} keys %col)
{
if ( exists \$data{\$id}->{field}->{\$col{\$field}}->{rank}->{ \$rank } ) {
return [ \$col{\$field},
\$data{\$id}->{field}->{\$col{\$field}}->{rank}->{\$rank}]
}
}
['',[]]
}

George Mpouras, Mar 15, 2011
15. ### George MpourasGuest

# The following version is much faster than the previous

my @col;
my %data;

\$_ = query('NODE_124_length_77_cov_13.792208', 100);
print "Field=\$_->[0],Value=@{\$_->[1]}\n";

\$_ = query('NODE_124_length_77_cov_13.792208', 50);
print "Field=\$_->[0],Value=@{\$_->[1]}\n";

{
while (<DATA>) {
chomp;
my @a = split /\s*\|\s*/, \$_, -1;
if (-1 == \$#col){ push @col, @a[1..\$#a] ;next}
\$data{\$a[0]}->[0]++;
for(my \$i=1;\$i<=\$#a;\$i++){\$data{\$a[0]}->[1]->[\$i-1]->[0]->{\$a[\$i]}++} }
foreach my \$id ( keys %data ) {
foreach my \$f ( @{\$data{\$id}->[1]} ) {
foreach my \$v ( keys %{\$f->[0]} ) {
push @{ \$f->[1]->{int 100*( \$f->[0]->{\$v}/\$data{\$id}->[0])} }, \$v}}}
#use Data:umper; print Dumper(\%data);exit;
}

sub query {
for (my \$i=\$#{\$data{\$_[0]}->[1]}; \$i>=0; \$i--) {
return [\$col[\$i], \$data{\$_[0]}->[1]->[\$i]->[1]->{\$_[1]}] if exists
\$data{\$_[0]}->[1]->[\$i]->[1]->{\$_[1]} }
['',[]]
}

__DATA__
Identity of query sequence | Superkingdom | Kingdom | Subkingdom |
Phylum | Class | Order | Family
| Genus | Species2 | group | Species
NODE_124_length_77_cov_13.792208 | Bacteria | undef | undef |
Proteobacteria | Gammaproteobacteria | Enterobacteriales |
Enterobacteriaceae | Escherichia | undef | Escherichi1 | coli
NODE_124_length_77_cov_13.792208 | Bacteria | undef | undef |
Proteobacteria | Gammaproteobacteria | Enterobacteriales |
Enterobacteriaceae | Escherichia | undef | Escherichi2 | coli

George Mpouras, Mar 15, 2011
16. ### Peter J. HolzerGuest

On 2011-03-15 12:39, George Mpouras <> wrote:
> foreach my \$id ( keys %data ) {
> foreach my \$f ( @{\$data{\$id}->[1]} ) {
> foreach my \$v ( keys %{\$f->[0]} ) {
> push @{ \$f->[1]->{int 100*( \$f->[0]->{\$v}/\$data{\$id}->[0])} }, \$v}}}

[...]
> __DATA__
> Identity of query sequence | Superkingdom | Kingdom | Subkingdom |
> Phylum | Class | Order | Family
>| Genus | Species2 | group | Species
> NODE_124_length_77_cov_13.792208 | Bacteria | undef | undef |
> Proteobacteria | Gammaproteobacteria | Enterobacteriales |
> Enterobacteriaceae | Escherichia | undef | Escherichi1 | coli

[...]

When posting code to a newsgroup, please ensure that proper indentation
and line wraps are preserved. This is very hard to read because of the
missing indentation and doesn't work as posted because of the extra
newlines.

hp

Peter J. Holzer, Mar 15, 2011
17. ### Uri GuttmanGuest

>>>>> "GM" == George Mpouras <> writes:

>>
>> our doesn't create a variable. it only creates a lexical alias to the
>> variable of the same name in the current package.

GM> Alias of what if our is only the definition ;
GM> #!/usr/bin/perl
GM> our \$var=1;
GM> print \$var;
GM> exit 0

the current package as i said. what is the default current package?

uri

--
Uri Guttman ------ -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

Uri Guttman, Mar 15, 2011
18. ### elaGuest

While the suggested solution works perfectly for the example data inside the
perl script, it does not work when the data change to:

__DATA__
Identity of query sequence Superkingdom Kingdom Subkingdom
Phylum Class Order Family Genus Species group Species
NODE_124_length_77_cov_13.792208 Bacteria undef undef
Proteobacteria Gammaproteobacteria Enterobacteriales
Enterobacteriaceae Escherichia undef Escherichia coli
NODE_124_length_77_cov_13.792208 Bacteria undef undef
Proteobacteria Gammaproteobacteria Enterobacteriales
Enterobacteriaceae Escherichia undef Escherichia coli

even though I changed

my @a = split /\s+/; to my @a = split /\t/;

Moreover, the current implementation is hard-coding a rank instead of
surpassing a threshold so if I input 10 (i.e. incident abundance exceed 10%
is ok) instead of 100, no result will return at all. Even if I use "100",
the result returned still differs from my expectation.

I expect the program gives the result

id=NODE_124_length_77_cov_13.792208, thr=100% ->
Field=Species,Value=Escherichia coli

but it gives me:

id=NODE_124_length_77_cov_13.792208, thr=100% ->
Field=Order,Value=Enterobacteriales

this statement:
push @{ \$data{\$id}->{field}->{\$field}->{rank}->{
100*(\$data{\$id}->{field}->{\$field}->{data}->{\$item} /
\$data{\$id}->{lines} ) } } , \$item}

looks very new to me, and can anybody further tell me what it is for?

usually I use push like:

push @array, \$element;

and the above code even does not have the symbol ";" but no runtime error
......

ela, Mar 15, 2011
19. ### elaGuest

The problem arises due to the number of fields. Since there are more than 9
fields, and the sort has to do in this way:

foreach my \$field (reverse sort {\$a<=>\$b} keys %col) {

so perl treats 10 as bigger than 9.

Now the remaining problem is how to solve the threshold problem cleverly.
Originally, George's solution makes use of hash for fast access and in fact
progressive test, 100,99,98, ..., \$threshold can be performed. However, when
the data is large (>1 million rows with a lot of ID's), this may not be a
good idea......

ela, Mar 16, 2011
20. ### George MpourasGuest

> I wanna change your implementation from "discrete" checking to
> "continuous" one, the logic is to first sort (rank keys: *** expected
> range: (0-100] ***) numerically, then test if the "largest" key (e.g. 100,
> 75 etc) is larger than the threshold specified. My problem is that I don't
> know how to refer to the keys under
>
> \$data{\$_[0]}->[1]->[\$i]->[1]
>
> Writing something like "foreach my \$field (sort {\$b<=>\$a} keys
> %data{\$_[0]}->[1]->[\$i]->[1])" (Thanks for McClellan's teaching on
> appropriately using sort here) does not work. Moreover, there's no need to
> "foreach" here as if the largest one also can't surpass the threshold,
> neither the smaller ones can. So how to avoid "foreach" here?
>

For the data

ID|B|C|D|E|F|G|H
01|3|7|9|3|4|2|3
01|3|7|9|3|4|2|2
01|3|7|9|5|8|6|6
01|3|7|9|3|4|2|3
02|4|7|9|3|4|2|1
02|4|7|9|3|4|2|2
02|4|7|9|3|4|2|3
02|4|7|9|3|4|2|3

what would be your input and what do you expect ?
An example make things more clear.

George Mpouras, Mar 16, 2011