HoA bug? Size limit?

P

PB0711

Hello all,

I have just finished a script with 3 for loops, at the end of which it
generates an HoA. I've printed out that it goes throught all expected
8000 loops but there are only a unique 1599 keys. However, if I print
the keys out before I put them into the HoA I can see that it does all
combinations that I expect. Is this a bug or a size limit on the number
of keys for an HoA??

Thanks,

Paul
-------------OUTPUT------------
1599 <- keys
total 8000
-------------CODE-------------

my @AoA = (
["Ala", "Alanine", "71.08", "3", "2", "1","7", "0"],
["Arg", "Arginine", "156.19", "6", "2", "4", "14", "0"],
["His", "Histidine", "137.14","6","2","3","9", "0"],
["Phe", "Phenylalanine", "147.18", "9", "2", "1", "11", "0"],
["Cys", "Cysteine", "103.14", "3", "2", "1", "7", "1"],
["Gly", "Glycine", "57.05", "2", "2", "1", "5", "0"],
["Gln", "Glutamine", "128.13", "5", "3", "2", "10", "0"],
["Glu", "Glutamate", "129.11", "5", "4", "1", "9", "0"],
["Asp", "Asparte", "115.09", "4", "4", "1", "7", "0"],
["Lys", "Lysine", "128.17", "6", "2", "2", "14", "0"],
["Leu", "Leucine", "113.16", "6", "2", "1", "13", "0"],
["Met", "Methionine", "131.20", "5", "2","1","11", "1"],
["Asn", "Asparagine", "114.10", "4", "3", "2","8", "0"],
["Ser", "Serine", "87.08", "3", "3", "1", "7", "0"],
["Tyr", "Tryosine", "163.17", "9", "3", "1","11", "0"],
["Thr", "Threonine", "101.10", "4", "3", "1", "9", "0"],
["Ile", "Isoleucine", "113.16", "6","2","1","13","0"],
["Trp", "Trytophan","186.21","11","2","2","12", "0"],
["Pro", "Proline", "97.12", "5", "2", "1","9", "0"],
["Val", "Valine", "99.13", "5", "2", "1", "11", "0"],
);
#----------------------------------
#making dipeptides - a nasty way V2 will do a sub
for (my $i=0; $i < $#AoA+1; $i++){
for (my $j=0; $j < $#AoA+1; $j++){
for (my $n=0; $n < $#AoA+1; $n++) {
$total++;
my $formula=0;
my $c=($AoA[$i][3]+$AoA[$j][3])+$AoA[$n][3];
my $o=($AoA[$i][4]+$AoA[$j][4])+$AoA[$n][4];
my $h=($AoA[$i][6]+$AoA[$j][6])+$AoA[$n][6];
my $n=($AoA[$i][5]+$AoA[$j][5])+$AoA[$n][5];
my $s=($AoA[$i][7]+$AoA[$j][7])+$AoA[$n][7];
if ($s == 0){
$formula = "C$c" . "H$h" . "N$n" . "O$o";
} else {
$formula = "C$c" . "H$h" . "N$n" . "O$o" . "S$s";
}
my $mass = ($AoA[$i][2] + $AoA[$j][2]) + $AoA[$n][2];
my $name = "$AoA[$i][0] $AoA[$j][0] $AoA[$n][0]";
$HoA{$name} = [ "$mass", "$formula"];
print "$name\n";
}
}
}

#for my $comp (keys %HoA) {
# print "$comp: @{$HoA{$comp}}\n";
#}
my ($count, $added);
my @key = keys %HoA;
print "$#key <- keys\n";
#----------------------------------
print "total $total\n";
 
B

Bob Walton

PB0711 wrote:
....
I have just finished a script with 3 for loops, at the end of which it
generates an HoA. I've printed out that it goes throught all expected
8000 loops but there are only a unique 1599 keys. However, if I print
the keys out before I put them into the HoA I can see that it does all
combinations that I expect. Is this a bug or a size limit on the number
of keys for an HoA??

Yes, it is a bug -- in your program, not Perl. Perl will handle many
millions of hash keys in a hash with no problem. The problem is that
some of your keys appear once, some four times, and some 14 times. 'Phe
Asp Cys', for example, shows up 4 times; 'Met Trp Cys' shows up 14
times; 'Arg Ala Lys' shows up once. I found that out by adding the
statement:

$cnt{$name}++;

after your $HoA statement, then at the end adding

use Data::Dumper;
print Dumper(\%cnt);

to print it out.

As to why that happens, you clobbered your for loop index variable $n
with the statement

my $n=($AoA[$i][5]+$AoA[$j][5])+$AoA[$n][5];

In the next "my $s=" statement you probably want the index variable, and
in the if statement you probably want the new $n. Code like the
following seems to fare better (I just used $k instead of $n for the
third for loop index).

Also, your "keys" output line is off by one -- you printed the index
number of the last element of @key rather than the number of elements in
@key.

And, you didn't

use strict;

since there were undeclared variables when it was added. Let Perl help
you all it can.

And BTW, thanks for making an example anyone could copy/paste/run.
....
Paul
-------------OUTPUT------------
....
8000 <- keys
total 8000
-------------CODE-------------
use warnings;
use strict;
my @AoA = (
["Ala", "Alanine", "71.08", "3", "2", "1","7", "0"],
["Arg", "Arginine", "156.19", "6", "2", "4", "14", "0"],
["His", "Histidine", "137.14","6","2","3","9", "0"],
["Phe", "Phenylalanine", "147.18", "9", "2", "1", "11", "0"],
["Cys", "Cysteine", "103.14", "3", "2", "1", "7", "1"],
["Gly", "Glycine", "57.05", "2", "2", "1", "5", "0"],
["Gln", "Glutamine", "128.13", "5", "3", "2", "10", "0"],
["Glu", "Glutamate", "129.11", "5", "4", "1", "9", "0"],
["Asp", "Asparte", "115.09", "4", "4", "1", "7", "0"],
["Lys", "Lysine", "128.17", "6", "2", "2", "14", "0"],
["Leu", "Leucine", "113.16", "6", "2", "1", "13", "0"],
["Met", "Methionine", "131.20", "5", "2","1","11", "1"],
["Asn", "Asparagine", "114.10", "4", "3", "2","8", "0"],
["Ser", "Serine", "87.08", "3", "3", "1", "7", "0"],
["Tyr", "Tryosine", "163.17", "9", "3", "1","11", "0"],
["Thr", "Threonine", "101.10", "4", "3", "1", "9", "0"],
["Ile", "Isoleucine", "113.16", "6","2","1","13","0"],
["Trp", "Trytophan","186.21","11","2","2","12", "0"],
["Pro", "Proline", "97.12", "5", "2", "1","9", "0"],
["Val", "Valine", "99.13", "5", "2", "1", "11", "0"],
);
#----------------------------------
#making dipeptides - a nasty way V2 will do a sub
my(%HoA,%cnt,$total);
for (my $i=0; $i < $#AoA+1; $i++){
for (my $j=0; $j < $#AoA+1; $j++){
for (my $k=0; $k < $#AoA+1; $k++) {
$total++;
my $formula=0;
my $c=($AoA[$i][3]+$AoA[$j][3])+$AoA[$k][3];
my $o=($AoA[$i][4]+$AoA[$j][4])+$AoA[$k][4];
my $h=($AoA[$i][6]+$AoA[$j][6])+$AoA[$k][6];
my $n=($AoA[$i][5]+$AoA[$j][5])+$AoA[$k][5];
my $s=($AoA[$i][7]+$AoA[$j][7])+$AoA[$k][7];
if ($s == 0){
$formula = "C$c" . "H$h" . "N$n" . "O$o";
} else {
$formula = "C$c" . "H$h" . "N$n" . "O$o" . "S$s";
}
my $mass = ($AoA[$i][2] + $AoA[$j][2]) + $AoA[$k][2];
my $name = "$AoA[$i][0] $AoA[$j][0] $AoA[$k][0]";
$HoA{$name} = [ "$mass", "$formula"];
$cnt{$name}++;
print "$name\n";
}
}
}
#for my $comp (keys %HoA) {
# print "$comp: @{$HoA{$comp}}\n";
#}
my ($count, $added);
print scalar(keys(%HoA))." <- keys\n";
#----------------------------------
print "total $total\n";
use Data::Dumper;
#print Dumper(\%cnt);
__END__

HTH.
 
U

Uri Guttman

someone else figured out the bugs so i will do a little code
review. this can be cleaned up in many ways and that would lead to fewer
bugs here and in your future code.


P> my @AoA = (

i know this is an example but hopefully that is not a real variable
name. call it something to do with amino acids or whatever.

P> ["Ala", "Alanine", "71.08", "3", "2", "1","7", "0"],

[ qw( Ala Alanine 71.08 3 2 17 0 ) ],

a little bit easier to read.
P> #making dipeptides - a nasty way V2 will do a sub
P> for (my $i=0; $i < $#AoA+1; $i++){
P> for (my $j=0; $j < $#AoA+1; $j++){
P> for (my $n=0; $n < $#AoA+1; $n++) {


gack! i hate seeing useless and bug prone c style for loops.

foreach my $i ( 0 .. $#AoA ){
foreach my $j ( 0 .. $#AoA ){
foreach my $k ( 0 .. $#AoA ){

and k usually follows n.

P> $total++;
P> my $formula=0;
P> my $c=($AoA[$i][3]+$AoA[$j][3])+$AoA[$n][3];
P> my $o=($AoA[$i][4]+$AoA[$j][4])+$AoA[$n][4];
P> my $h=($AoA[$i][6]+$AoA[$j][6])+$AoA[$n][6];
P> my $n=($AoA[$i][5]+$AoA[$j][5])+$AoA[$n][5];
P> my $s=($AoA[$i][7]+$AoA[$j][7])+$AoA[$n][7];

why is there a paren around the first + ?

there is a massive amount of redundancy in there. it hard to read and
see what are the slight differences in each line.

first off, you can grab $AoA[$i] (and j and k) as soon as you have the
index value. this makes it cleaner and faster as you lose all those
extra array lookups.

and use whitespace. it was created for you to use.

foreach my $i ( 0 .. $#AoA ){

my $AoA_i = $AoA[$i] ;
foreach my $j ( 0 .. $#AoA ){

my $AoA_j = $AoA[$j] ;
foreach my $k ( 0 .. $#AoA ){

my $AoA_k = $AoA[$k] ;

my $s = $AoA_i->[7] + $AoA_j->[7] + $AoA_k->[7] ;


now we also have the repeated execution of that expression with 3 .. 7
so we factor that out with a map:

my( $c, $o, $h, $n, $s ) = map {
$AoA_i->[$_] + $AoA_j->[$_] + $AoA_k->[$_] }
} 3 .. 7 ;

but even that bothers me. you don't USE $i except to index into AoA. so
why even have the index variables? we can then eliminate the indexing to
get each loop's reference.

this now reduced the loop to:

foreach my $AoA_i ( @AoA ){
foreach my $AoA_j ( @AoA ){
foreach my $AoA_k ( @AoA ){

my( $c, $o, $h, $n, $s ) = map {
$AoA_i->[$_] + $AoA_j->[$_] + $AoA_k->[$_] }
} 3 .. 7 ;

P> if ($s == 0){
P> $formula = "C$c" . "H$h" . "N$n" . "O$o";
P> } else {
P> $formula = "C$c" . "H$h" . "N$n" . "O$o" . "S$s";
P> }

use ?: for conditional assignments. use "" for interpolation of the
whole string. you just need {} around the variable names. and you can
declare $formula here at the same time.

my $formula = ($s == 0) ?
"C${c}H${h}N${n}O${o}" :
"C${c}H${h}N${n}O${o}S${s}" ;

but there is more redundancy there as both versions start the same. so
we factor out one more time.

my $formula = "C${c}H${h}N${n}O${o}" .
($s == 0) ? '' : "S${s}" ;


P> my $mass = ($AoA[$i][2] + $AoA[$j][2]) + $AoA[$n][2];

add that to the above map as it is the same expression.

P> my $name = "$AoA[$i][0] $AoA[$j][0] $AoA[$n][0]";

my $name = "$AoA_i->[0] $AoA_j->[0] $AoA_k->[0]";

P> $HoA{$name} = [ "$mass", "$formula"];

don't quote scalar vars for no reason. see the FAQ for why.

uri
 
J

John W. Krahn

Uri said:
gack! i hate seeing useless and bug prone c style for loops.

foreach my $i ( 0 .. $#AoA ){
foreach my $j ( 0 .. $#AoA ){
foreach my $k ( 0 .. $#AoA ){

and k usually follows n.

I though that k usually follows j?

P> $total++;
P> my $formula=0;
P> my $c=($AoA[$i][3]+$AoA[$j][3])+$AoA[$n][3];
P> my $o=($AoA[$i][4]+$AoA[$j][4])+$AoA[$n][4]; ^ ^ ^
P> my $h=($AoA[$i][6]+$AoA[$j][6])+$AoA[$n][6]; ^ ^ ^
P> my $n=($AoA[$i][5]+$AoA[$j][5])+$AoA[$n][5]; ^ ^ ^
P> my $s=($AoA[$i][7]+$AoA[$j][7])+$AoA[$n][7]; ^ ^ ^

why is there a paren around the first + ?

there is a massive amount of redundancy in there. it hard to read and
see what are the slight differences in each line.

first off, you can grab $AoA[$i] (and j and k) as soon as you have the
index value. this makes it cleaner and faster as you lose all those
extra array lookups.

and use whitespace. it was created for you to use.

foreach my $i ( 0 .. $#AoA ){

my $AoA_i = $AoA[$i] ;
foreach my $j ( 0 .. $#AoA ){

my $AoA_j = $AoA[$j] ;
foreach my $k ( 0 .. $#AoA ){

my $AoA_k = $AoA[$k] ;

my $s = $AoA_i->[7] + $AoA_j->[7] + $AoA_k->[7] ;


now we also have the repeated execution of that expression with 3 .. 7
so we factor that out with a map:

my( $c, $o, $h, $n, $s ) = map {

The correct order should be:

my( $c, $o, $n, $h, $s ) = map {

$AoA_i->[$_] + $AoA_j->[$_] + $AoA_k->[$_] }
} 3 .. 7 ;


John
 
J

John W. Krahn

John said:
Uri said:
now we also have the repeated execution of that expression with 3 .. 7
so we factor that out with a map:

my( $c, $o, $h, $n, $s ) = map {

The correct order should be:

my( $c, $o, $n, $h, $s ) = map {

$AoA_i->[$_] + $AoA_j->[$_] + $AoA_k->[$_] }
} 3 .. 7 ;

Also, you've got one too many } in there.



John
 
U

Uri Guttman

JWK> I though that k usually follows j?

not in MY alphabet!

P> my $h=($AoA[$i][6]+$AoA[$j][6])+$AoA[$n][6];
JWK> ^ ^ ^

why are you pointing out the integers there?

JWK> The correct order should be:

JWK> my( $c, $o, $n, $h, $s ) = map {


my comments and code were highly untested. they were more for
ejimicating the OP than actually writing working code. :)

uri
 
T

Tad McClellan

Uri Guttman said:
JWK> I though that k usually follows j?

not in MY alphabet!
^^
^^

_your_ alphabet contains upper case? Horrors!

P> my $h=($AoA[$i][6]+$AoA[$j][6])+$AoA[$n][6];
JWK> ^ ^ ^

why are you pointing out the integers there?


Because ...

JWK> The correct order should be:

JWK> my( $c, $o, $n, $h, $s ) = map {


.... you had them in the wrong order.
 
P

PB0711

Ok, WoW. Yes still relativitly new to perl. I'm teaching myself as I go
and run into different problems. I was happy that I could get the AoA
working and supprised with myself that the HoA works. Thank you for
pointing out my mistake. I wrote a quick two loop program and then
added the 3rd loop later.
I'll look over the code and make sure I understand it before I
impliment it and if it's ok, I'll ask questions to you guys about it. I
have to say Thank you to both Bob and micro.

Cheers,

PB

Mirco said:
Thus spoke PB0711 (on 2006-11-18 02:20):
I have just finished a script with 3 for loops, at the end of which it
generates an HoA. I've printed out that it goes throught all expected
8000 loops but there are only a unique 1599 keys. However, if I print
the keys out before I put them into the HoA I can see that it does all
combinations that I expect. Is this a bug or a size limit on the number
of keys for an HoA??

I see you are into Protein stuff, this is nice, Hey, I did
these things in the past (when I didn't know Perl).
#making dipeptides - a nasty way V2 will do a sub
for (my $i=0; $i < $#AoA+1; $i++){
for (my $j=0; $j < $#AoA+1; $j++){
for (my $n=0; $n < $#AoA+1; $n++) {

Others have pointed out several problems already.

Because I guess you are new into Perl and new into
Protein/amino acids stuff, I'd like to make a recommendation
regarding general program structure and amino acid handling
here.

Basically, I straightened up your code a bit and tried
to make it readable and extensible (what I think what it
would be then ;-).

Part one - Data:
--- 8< ---

use strict;
use warnings;

use constant TLC => 1; # use some readable indices
use constant MASS => 3; # into the data array below
use constant C_ID => 4; # TLC -> 'three letter code'
use constant O_ID => 5;
use constant N_ID => 6;
use constant H_ID => 7;
use constant S_ID => 8;
# define some element related data
my @elems;
$elems[C_ID] = 'C', $elems[O_ID] = 'O', $elems[N_ID] = 'N',
$elems[H_ID] = 'H', $elems[S_ID] = 'S';

# this is the whole story (always use single character codes!):
my @acids = qw'A C D E F G H I K L M N P Q R S T V W Y';

# now compile your data into the appropriate form,
# use a hash to organize the amino acid data records ...
my %table = (
A => [qw'A Ala Alanine 71.08 3 2 1 7 0'],
C => [qw'C Cys Cysteine 103.14 3 2 1 7 1'],
D => [qw'D Asp Aspartate 115.09 4 4 1 7 0'],
E => [qw'E Glu Glutamate 129.11 5 4 1 9 0'],
F => [qw'F Phe Phenylalanine 147.18 9 2 1 11 0'],
G => [qw'G Gly Glycine 57.05 2 2 1 5 0'],
H => [qw'H His Histidine 137.14 6 2 3 9 0'],
I => [qw'I Ile Isoleucine 113.16 6 2 1 13 0'],
K => [qw'K Lys Lysine 128.17 6 2 2 14 0'],
L => [qw'L Leu Leucine 113.16 6 2 1 13 0'],
M => [qw'L Met Methionine 131.20 5 2 1 11 1'],
N => [qw'N Asn Asparagine 114.10 4 3 2 8 0'],
P => [qw'P Pro Proline 97.12 5 2 1 9 0'],
Q => [qw'Q Gln Glutamine 128.13 5 3 2 10 0'],
R => [qw'R Arg Arginine 156.19 6 2 4 14 0'],
S => [qw'S Ser Serine 87.08 3 3 1 7 0'],
T => [qw'T Thr Threonine 101.10 4 3 1 9 0'],
V => [qw'V Val Valine 99.13 5 2 1 11 0'],
W => [qw'W Trp Trytophan 186.21 11 2 2 12 0'],
Y => [qw'Y Tyr Tryosine 163.17 9 3 1 11 0'] );

--- 8< ---

From the definitions and declarations above,
data handling gets relatively easy now:

Part two - workflow:

--- 8< ---

my @triplets;
for my $i (@acids) {
for my $j (@acids) {
for my $k (@acids) {
push @triplets, [ $i, $j, $k ];
}
}
}

--- 8< ---

After you generated the triplets, you can
do some operations on them:

--- 8< ---

for my $triplet (@triplets) { # handle tripeptides in any way you want
# 1) calculate masses
my $mass = 0;
$mass += $table{$_}[MASS] for @$triplet;

# 2) generate elementary composition
my %compos;
for my $acid (@$triplet) {
$compos{$_} += $table{$acid}[$_] for C_ID .. S_ID;
}

# 3) make some kind of a 'formula'
my $formula = join '',
map $elems[$_] . $compos{$_} x ($compos{$_} > 1),
grep $compos{$_}, C_ID .. S_ID;

# 4) print the results
print +(map "$_", @$triplet), "\t", # print one letter codes
(map "$table{$_}[TLC] ", @$triplet), # print three letter codes
"\t$formula\tM=$mass\n";
}
__END__

--- 8< ---

This is how I'd have tried to play with
amino acid sequences in the olden times,
maybe it's somehow instructive - if not,
whatever ... ;-)

Iteration over data contained in the records (C_ID .. S_ID)
implies that this specific order is reitained, otherwise
you have to explicitly state the indices in question.


Regards

Mirco
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,262
Messages
2,571,046
Members
48,769
Latest member
Clifft

Latest Threads

Top