convenient module to take statistics for hashed structures?

Discussion in 'Perl Misc' started by ela, Mar 13, 2011.

  1. ela

    ela Guest

    __DATA__
    ID B C D E F G H
    1 3 7 9 3 4 2 3
    1 3 7 9 3 4 2 2
    1 3 7 9 5 8 6 6
    1 3 7 9 3 4 2 3
    2 4 7 9 3 4 2 1
    2 4 7 9 3 4 2 2
    2 4 7 9 3 4 2 3
    2 4 7 9 3 4 2 3

    For each ID (the above example has two (1 and 2)), I want to identify the
    "last common
    ancestor: LCA, H being higher preference than B " based on some defined
    threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9; if
    set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)

    While hash is good at allocating different instance easily, I don't know
    whether perl supports simple architecture to get the max/min. For the
    following code,

    @array

    For each row
    for ($i=0; $i<$numcol; $i++)
    $array[$i]{$key}++;


    if just for B, I know the following should be written in this way:

    foreach $key (keys %B) { $Bpertpot{$key} = $B{$key}/$total; }

    for ($i=0; $i<$numcol; $i++) {
    $maxcol[$i] = 0;
    foreach $key (keys %Bpertpot) { if ($Bpertpot{$key}> $maxcol[$i]) {
    $maxcol[$i] = $Bpertpot{$key}; }
    }

    but then I don't know how to do that for array of hash to traverse... i.e.
    replace the %B and %Bpertpot to something that is compatible with the array
    structure... In fact, I wonder if there is already well-established modules
    that may have handled this kind of max-min statistics problems that seem to
    encounter frequently in the business sector...
     
    ela, Mar 13, 2011
    #1
    1. Advertising

  2. ela

    smallpond Guest

    On Mar 13, 11:32 am, "ela" <> wrote:
    > __DATA__
    > ID    B    C    D    E    F    G    H
    > 1    3    7    9    3    4    2    3
    > 1    3    7    9    3    4    2    2
    > 1    3    7    9    5    8    6    6
    > 1    3    7    9    3    4    2    3
    > 2    4    7    9    3    4    2    1
    > 2    4    7    9    3    4    2    2
    > 2    4    7    9    3    4    2    3
    > 2    4    7    9    3    4    2    3
    >
    > For each ID (the above example has two (1 and 2)), I want to identify the
    > "last common
    > ancestor: LCA, H being higher preference than B " based on some defined
    > threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9; if
    > set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)
    >
    > While hash is good at allocating different instance easily, I don't know
    > whether perl supports simple architecture to get the max/min. For the
    > following code,
    >
    > @array
    >
    > For each row
    >     for ($i=0; $i<$numcol; $i++)
    >         $array[$i]{$key}++;
    >
    > if just for B, I know the following should be written in this way:
    >
    > foreach $key (keys %B) {     $Bpertpot{$key} = $B{$key}/$total;   }
    >
    > for ($i=0; $i<$numcol; $i++) {
    > $maxcol[$i] = 0;
    > foreach $key (keys %Bpertpot) { if ($Bpertpot{$key}> $maxcol[$i]) {
    > $maxcol[$i] = $Bpertpot{$key};    }
    >
    > }
    >
    > but then I don't know how to do that for array of hash to traverse...  i.e.
    > replace the %B and %Bpertpot to something that is compatible with the array
    > structure... In fact, I wonder if there is already well-established modules
    > that may have handled this kind of max-min statistics problems that seem to
    > encounter frequently in the business sector...


    Have a look at List::Util which is a core module - it has max and
    first
    functions that you will find useful.

    Any book on Perl will explain how to create and use a hash of arrays
    or an
    array of arrays.
     
    smallpond, Mar 14, 2011
    #2
    1. Advertising

  3. "ela" <> wrote in message
    news:ilgvng$7ec$...
    > __DATA__
    > ID B C D E F G H
    > 1 3 7 9 3 4 2 3
    > 1 3 7 9 3 4 2 2
    > 1 3 7 9 5 8 6 6
    > 1 3 7 9 3 4 2 3
    > 2 4 7 9 3 4 2 1
    > 2 4 7 9 3 4 2 2
    > 2 4 7 9 3 4 2 3
    > 2 4 7 9 3 4 2 3
    >
    > For each ID (the above example has two (1 and 2)), I want to identify the
    > "last common
    > ancestor: LCA, H being higher preference than B " based on some defined
    > threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9;
    > if set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)
    >
    > While hash is good at allocating different instance easily, I don't know
    > whether perl supports simple architecture to get the max/min. For the
    > following code,
    >
    > @array
    >
    > For each row
    > for ($i=0; $i<$numcol; $i++)
    > $array[$i]{$key}++;
    >
    >
    > if just for B, I know the following should be written in this way:
    >
    > foreach $key (keys %B) { $Bpertpot{$key} = $B{$key}/$total; }
    >
    > for ($i=0; $i<$numcol; $i++) {
    > $maxcol[$i] = 0;
    > foreach $key (keys %Bpertpot) { if ($Bpertpot{$key}> $maxcol[$i]) {
    > $maxcol[$i] = $Bpertpot{$key}; }
    > }
    >
    > but then I don't know how to do that for array of hash to traverse...
    > i.e.
    > replace the %B and %Bpertpot to something that is compatible with the
    > array
    > structure... In fact, I wonder if there is already well-established
    > modules
    > that may have handled this kind of max-min statistics problems that seem
    > to
    > encounter frequently in the business sector...
    >
    >
    >



    set it to 75%, then it is F=2
    I do not see any 2 at F column. I have problem to undestand what you
    mean/what you want.
     
    George Mpouras, Mar 14, 2011
    #3
  4. "ela" <> wrote in message
    news:ilkui6$mhk$...
    >
    > "George Mpouras" <> wrote in message
    > news:ilklas$qos$...
    >>
    >> "ela" <> wrote in message
    >> news:ilgvng$7ec$...
    >>> __DATA__
    >>> ID B C D E F G H
    >>> 1 3 7 9 3 4 2 3
    >>> 1 3 7 9 3 4 2 2
    >>> 1 3 7 9 5 8 6 6
    >>> 1 3 7 9 3 4 2 3
    >>> 2 4 7 9 3 4 2 1
    >>> 2 4 7 9 3 4 2 2
    >>> 2 4 7 9 3 4 2 3
    >>> 2 4 7 9 3 4 2 3
    >>>
    >>> For each ID (the above example has two (1 and 2)), I want to identify
    >>> the "last common
    >>> ancestor: LCA, H being higher preference than B " based on some defined
    >>> threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9;
    >>> if set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50%
    >>> H=3)
    >>>
    >>> While hash is good at allocating different instance easily, I don't know
    >>> whether perl supports simple architecture to get the max/min. For the
    >>> following code,
    >>>
    >>> @array
    >>>
    >>> For each row
    >>> for ($i=0; $i<$numcol; $i++)
    >>> $array[$i]{$key}++;
    >>>
    >>>
    >>> if just for B, I know the following should be written in this way:
    >>>
    >>> foreach $key (keys %B) { $Bpertpot{$key} = $B{$key}/$total; }
    >>>
    >>> for ($i=0; $i<$numcol; $i++) {
    >>> $maxcol[$i] = 0;
    >>> foreach $key (keys %Bpertpot) { if ($Bpertpot{$key}> $maxcol[$i]) {
    >>> $maxcol[$i] = $Bpertpot{$key}; }
    >>> }
    >>>
    >>> but then I don't know how to do that for array of hash to traverse...
    >>> i.e.
    >>> replace the %B and %Bpertpot to something that is compatible with the
    >>> array
    >>> structure... In fact, I wonder if there is already well-established
    >>> modules
    >>> that may have handled this kind of max-min statistics problems that seem
    >>> to
    >>> encounter frequently in the business sector...
    >>>
    >>>
    >>>

    >>
    >>
    >> set it to 75%, then it is F=2
    >> I do not see any 2 at F column. I have problem to undestand what you
    >> mean/what you want.

    > Thanks for correcting the mistake. It is G=2 (2,2,6,2; so fulfilling the
    > 75% requirement) and not F=2. Always check from H (or the last column
    > first). H's majority is 3, for only 50% abundant, and then look up one by
    > one (F, E, D, ...). Each ID (without knowing how many incidents
    > beforehand) has to repeat the same process again and again.
    >



    #!/usr/bin/perl
    #
    # ok here is your homework .
    # next time try not cheat , because even if
    # you pass the lesson, will not learn !



    my %col;
    my %data;
    ReadData();

    $_ = query(1,100);
    print "id=1, thr=100% -> Field=$_->[0],Value=@{$_->[1]}\n";

    $_ = query(1,75);
    print "id=1, thr=100% -> Field=$_->[0],Value=@{$_->[1]}\n";

    $_ = query(2,100);
    print "id=2, thr=100% -> Field=$_->[0],Value=@{$_->[1]}\n";

    $_ = query(2,75);
    print "id=2, thr=75% -> Field=$_->[0],Value=@{$_->[1]}\n";

    $_ = query(2,50);
    print "id=2, thr=50% -> Field=$_->[0],Value=@{$_->[1]}\n";

    $_ = query(2,25);
    print "id=2, thr=25% -> Field=$_->[0],Value=@{$_->[1]}\n";



    sub ReadData {
    while(<DATA>){
    chomp;
    my @a = split /\s+/;
    unless (exists $col{1}){@col{1..$#a}=@a[1..$#a];next}
    ++$data{$a[0]}->{lines};
    for(my $i=1;$i<=$#a;$i++){
    ++$data{$a[0]}->{field}->{$col{$i}}->{data}->{$a[$i]} } }
    foreach my $id (keys %data) {
    foreach my $field (keys %{$data{$id}->{field}} ) {
    foreach my $item (keys %{$data{$id}->{field}->{$field}->{data}} ) {
    push @{ $data{$id}->{field}->{$field}->{rank}->{ 100*(
    $data{$id}->{field}->{$field}->{data}->{$item} / $data{$id}->{lines} ) } } ,
    $item}}}
    #use Data::Dumper; print Dumper(\%data);exit;
    }


    sub query {
    my ($id,$rank)=@_;
    foreach my $field (reverse sort keys %col) {
    if ( exists $data{$id}->{field}->{$col{$field}}->{rank}->{ $rank } ) {
    return [ $col{$field},
    $data{$id}->{field}->{$col{$field}}->{rank}->{$rank}] }
    }
    ['',[]]
    }




    __DATA__
    ID B C D E F G H
    1 3 7 9 3 4 2 3
    1 3 7 9 3 4 2 2
    1 3 7 9 5 8 6 6
    1 3 7 9 3 4 2 3
    2 4 7 9 3 4 2 1
    2 4 7 9 3 4 2 2
    2 4 7 9 3 4 2 3
    2 4 7 9 3 4 2 3
     
    George Mpouras, Mar 14, 2011
    #4
  5. Sorry for the silly joke at the comment.


    > $col{1} #what does 1 refer to?

    this is a just a check to see if we are reading the first line with the
    column names


    > $#a

    is the last item index of an array. Synonymous are
    $array[ -1 + scalar @array ]
    $array[-1]

    > @col{1..$#a} #array of hash?

    This is called hash slice; used to create a hash from an array
    my @array = qw/a b c/
    my %hash = ();
    @hash{ @array } = some values


    > @a[1..$#a] #array of what?

    Oh some array elements
    @array[2..4] -> $array[2], $array[3], $array[4]


    >
    > ++$data{$a[0]}->{lines}; #hash of hash? and an arbitrary name "line" is
    > given?

    Lets keep the total lines of every ID to a hash reference with key "lines"



    > ++$data{$a[0]}->{field}->{$col{$i}}->{data}->{$a[$i]} } } #oh, this line
    > is really... hard to know why arrow can be used again and again....

    Arrows are not neccassery, but I found them beautifull
    we want to keep our data isolate to a different sub-hash with key data


    > push @{ $data{$id}->{field}->{$field}->{rank}->{ 100*(
    > $data{$id}->{field}->{$field}->{data}->{$item} / $data{$id}->{lines} ) } } ,
    > $item}}} #what advantage of using push here?


    Here we want to keep all the occasions with the same threshold !
    So we if for example there are four different numbers , we can report
    back all of the, if the questioned threshold is 25%


    > ['',[]] #what is this...?!

    This the default answer if no threshiold is found. They are to items the
    '' , and an empty array representing the (no) found values.



    If you check what I ve done you will find out that it can be re-written
    to be almost 10 times faster, but it is goog enough for a start.


    Peace.
     
    George Mpouras, Mar 14, 2011
    #5
  6. uncomment the line
    #use Data::Dumper; print Dumper(\%data);exit;
    and you will undestand the underlying logic by your own.
     
    George Mpouras, Mar 14, 2011
    #6
  7. George Mpouras wrote:
    > Sorry for the silly joke at the comment.
    >
    >
    > [ snip ]
    >
    >
    >> @col{1..$#a} #array of hash?

    > This is called hash slice;


    Correct.

    > used to create a hash


    Only my() can create a hash.

    Used to add keys and values to a hash.

    > from an array


    from a LIST of keys and a LIST of values.




    John
    --
    Any intelligent fool can make things bigger and
    more complex... It takes a touch of genius -
    and a lot of courage to move in the opposite
    direction. -- Albert Einstein
     
    John W. Krahn, Mar 14, 2011
    #7
  8. ela

    news.ntua.gr Guest

    Ο "John W. Krahn" έγÏαψε στο μήνυμα
    news:gqwfp.59406$...

    >
    >> @col{1..$#a} #array of hash?

    > This is called hash slice;


    Correct.

    > used to create a hash


    Only my() can create a hash.


    I thought that local, our, state, could also do the job
     
    news.ntua.gr, Mar 14, 2011
    #8
  9. ela

    Uri Guttman Guest

    >>>>> "nng" == news ntua gr <> writes:

    nng> Ο "John W. Krahn" έγÏαψε στο μήνυμα
    nng> news:gqwfp.59406$...

    >>
    >>> @col{1..$#a} #array of hash?

    >> This is called hash slice;


    nng> Correct.

    >> used to create a hash


    nng> Only my() can create a hash.

    nng> I thought that local, our, state, could also do the job

    our doesn't create a variable. it only creates a lexical alias to the
    variable of the same name in the current package.

    local doesn't create a variable. it pushes the value of a variable and
    allows for a new value to be put in its place.

    state variables are just like my but they don't get reinitialized when
    the enclosing block is entered.

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
     
    Uri Guttman, Mar 14, 2011
    #9
  10. ela

    ela Guest

    "George Mpouras" <> wrote in message
    news:ilklas$qos$...
    >
    > "ela" <> wrote in message
    > news:ilgvng$7ec$...
    >> __DATA__
    >> ID B C D E F G H
    >> 1 3 7 9 3 4 2 3
    >> 1 3 7 9 3 4 2 2
    >> 1 3 7 9 5 8 6 6
    >> 1 3 7 9 3 4 2 3
    >> 2 4 7 9 3 4 2 1
    >> 2 4 7 9 3 4 2 2
    >> 2 4 7 9 3 4 2 3
    >> 2 4 7 9 3 4 2 3
    >>
    >> For each ID (the above example has two (1 and 2)), I want to identify the
    >> "last common
    >> ancestor: LCA, H being higher preference than B " based on some defined
    >> threshold. If the threshold is set to 100%, then the LCA of ID 1 is D=9;
    >> if set it to 75%, then it is F=2. For ID 2 (100%, G=2; 75%, G=2; 50% H=3)
    >>
    >> While hash is good at allocating different instance easily, I don't know
    >> whether perl supports simple architecture to get the max/min. For the
    >> following code,
    >>
    >> @array
    >>
    >> For each row
    >> for ($i=0; $i<$numcol; $i++)
    >> $array[$i]{$key}++;
    >>
    >>
    >> if just for B, I know the following should be written in this way:
    >>
    >> foreach $key (keys %B) { $Bpertpot{$key} = $B{$key}/$total; }
    >>
    >> for ($i=0; $i<$numcol; $i++) {
    >> $maxcol[$i] = 0;
    >> foreach $key (keys %Bpertpot) { if ($Bpertpot{$key}> $maxcol[$i]) {
    >> $maxcol[$i] = $Bpertpot{$key}; }
    >> }
    >>
    >> but then I don't know how to do that for array of hash to traverse...
    >> i.e.
    >> replace the %B and %Bpertpot to something that is compatible with the
    >> array
    >> structure... In fact, I wonder if there is already well-established
    >> modules
    >> that may have handled this kind of max-min statistics problems that seem
    >> to
    >> encounter frequently in the business sector...
    >>
    >>
    >>

    >
    >
    > set it to 75%, then it is F=2
    > I do not see any 2 at F column. I have problem to undestand what you
    > mean/what you want.

    Thanks for correcting the mistake. It is G=2 (2,2,6,2; so fulfilling the 75%
    requirement) and not F=2. Always check from H (or the last column first).
    H's majority is 3, for only 50% abundant, and then look up one by one (F, E,
    D, ...). Each ID (without knowing how many incidents beforehand) has to
    repeat the same process again and again.
     
    ela, Mar 15, 2011
    #10
  11. >
    > our doesn't create a variable. it only creates a lexical alias to the
    > variable of the same name in the current package.


    Alias of what if our is only the definition ;
    #!/usr/bin/perl
    our $var=1;
    print $var;
    exit 0
     
    George Mpouras, Mar 15, 2011
    #11
  12. "ela" <> wrote in message
    news:iln04v$ev1$...
    >
    > While the suggested solution works perfectly for the example data inside
    > the perl script, it does not work when the data change to:
    >
    > __DATA__
    > Identity of query sequence Superkingdom Kingdom Subkingdom Phylum
    > Class Order Family Genus Species group Species
    > NODE_124_length_77_cov_13.792208 Bacteria undef undef
    > Proteobacteria Gammaproteobacteria Enterobacteriales
    > Enterobacteriaceae Escherichia undef Escherichia coli
    > NODE_124_length_77_cov_13.792208 Bacteria undef undef
    > Proteobacteria Gammaproteobacteria Enterobacteriales
    > Enterobacteriaceae Escherichia undef Escherichia coli




    Garbage in, garbage out.



    Your data are completely inconsistent. You have to solve your input data
    problem first.



    Make sure that your data contain the same number of columns, and each column
    is separated from the other with exactly the same string.



    Are you sure the first line will always describe your column names ;

    For a start, separated your data using the | no comma , so tabs , no
    spaces.



    Make sure that you have the same number of | at every line.

    change the split to my @a = split /\|/, $_, -1;



    So after your data looks like the following, try again



    ID|C1|C2|C3

    1|a1|b1|c1

    2|a2|b2|c2

    3|a3|b3|c3
     
    George Mpouras, Mar 15, 2011
    #12
  13. ela

    ela Guest

    "George Mpouras" <> wrote in message
    news:ill20h$2ech$...

    >
    > #!/usr/bin/perl
    > #
    > # ok here is your homework .
    > # next time try not cheat , because even if
    > # you pass the lesson, will not learn !
    >


    Thanks for your solution and I'd like to know more about what your codes are
    doing...

    $col{1} #what does 1 refer to?
    @col{1..$#a} #array of hash?
    @a[1..$#a] #array of what?

    ++$data{$a[0]}->{lines}; #hash of hash? and an arbitrary name "line" is
    given?
    ++$data{$a[0]}->{field}->{$col{$i}}->{data}->{$a[$i]} } } #oh, this line
    is really... hard to know why arrow can be used again and again....

    push @{ $data{$id}->{field}->{$field}->{rank}->{ 100*(
    $data{$id}->{field}->{$field}->{data}->{$item} / $data{$id}->{lines} ) } } ,
    $item}}} #what advantage of using push here?

    ['',[]] #what is this...?!
     
    ela, Mar 15, 2011
    #13
  14. "ela" <> wrote in message
    news:iln9fk$ibv$...
    > The problem arises due to the number of fields. Since there are more than
    > 9
    > fields, and the sort has to do in this way:
    >
    > foreach my $field (reverse sort {$a<=>$b} keys %col) {
    >
    > so perl treats 10 as bigger than 9.
    >
    > Now the remaining problem is how to solve the threshold problem cleverly.
    > Originally, George's solution makes use of hash for fast access and in
    > fact
    > progressive test, 100,99,98, ..., $threshold can be performed. However,
    > when
    > the data is large (>1 million rows with a lot of ID's), this may not be a
    > good idea......
    >
    >


    If your columns are

    Identity of query sequence | Superkingdom | Kingdom | Subkingdom |
    Phylum | Class | Order | Family
    | Genus | Species | group | Species

    you will notice that you have two columns with the same name "Species" so I
    assumed this should be dome Species2 or something . Now you are ready. Run
    the following code against the attached data.txt




    my %col;
    my %data;
    ReadData('./data.txt');

    $_ = query('NODE_124_length_77_cov_13.792208',100);
    print "Field=$_->[0],Value=@{$_->[1]}\n";


    sub ReadData {
    open INPUT, '<', $_[0] or die "$^E\n";
    while(<INPUT>){
    chomp;
    my @a = split /\s*\|\s*/, $_, -1;
    unless (exists $col{1}){@col{1..$#a}=@a[1..$#a];next}
    ++$data{$a[0]}->{lines};
    for(my $i=1;$i<=$#a;$i++){
    ++$data{$a[0]}->{field}->{$col{$i}}->{data}->{$a[$i]} } }
    foreach my $id (keys %data) {
    foreach my $field (keys %{$data{$id}->{field}} ) {
    foreach my $item (keys %{$data{$id}->{field}->{$field}->{data}} ) {
    push @{ $data{$id}->{field}->{$field}->{rank}->{ 100*(
    $data{$id}->{field}->{$field}->{data}->{$item} / $data{$id}->{lines} ) } } ,
    $item}}}
    close INPUT;
    #use Data::Dumper; print Dumper(\%data);exit;
    }


    sub query {
    my ($id,$rank)=@_;
    foreach my $field (sort {$b<=>$a} keys %col)
    {
    if ( exists $data{$id}->{field}->{$col{$field}}->{rank}->{ $rank } ) {
    return [ $col{$field},
    $data{$id}->{field}->{$col{$field}}->{rank}->{$rank}]
    }
    }
    ['',[]]
    }
     
    George Mpouras, Mar 15, 2011
    #14
  15. # The following version is much faster than the previous


    my @col;
    my %data;
    ReadData();


    $_ = query('NODE_124_length_77_cov_13.792208', 100);
    print "Field=$_->[0],Value=@{$_->[1]}\n";

    $_ = query('NODE_124_length_77_cov_13.792208', 50);
    print "Field=$_->[0],Value=@{$_->[1]}\n";



    sub ReadData
    {
    while (<DATA>) {
    chomp;
    my @a = split /\s*\|\s*/, $_, -1;
    if (-1 == $#col){ push @col, @a[1..$#a] ;next}
    $data{$a[0]}->[0]++;
    for(my $i=1;$i<=$#a;$i++){$data{$a[0]}->[1]->[$i-1]->[0]->{$a[$i]}++} }
    foreach my $id ( keys %data ) {
    foreach my $f ( @{$data{$id}->[1]} ) {
    foreach my $v ( keys %{$f->[0]} ) {
    push @{ $f->[1]->{int 100*( $f->[0]->{$v}/$data{$id}->[0])} }, $v}}}
    #use Data::Dumper; print Dumper(\%data);exit;
    }


    sub query {
    for (my $i=$#{$data{$_[0]}->[1]}; $i>=0; $i--) {
    return [$col[$i], $data{$_[0]}->[1]->[$i]->[1]->{$_[1]}] if exists
    $data{$_[0]}->[1]->[$i]->[1]->{$_[1]} }
    ['',[]]
    }


    __DATA__
    Identity of query sequence | Superkingdom | Kingdom | Subkingdom |
    Phylum | Class | Order | Family
    | Genus | Species2 | group | Species
    NODE_124_length_77_cov_13.792208 | Bacteria | undef | undef |
    Proteobacteria | Gammaproteobacteria | Enterobacteriales |
    Enterobacteriaceae | Escherichia | undef | Escherichi1 | coli
    NODE_124_length_77_cov_13.792208 | Bacteria | undef | undef |
    Proteobacteria | Gammaproteobacteria | Enterobacteriales |
    Enterobacteriaceae | Escherichia | undef | Escherichi2 | coli
     
    George Mpouras, Mar 15, 2011
    #15
  16. On 2011-03-15 12:39, George Mpouras <> wrote:
    > foreach my $id ( keys %data ) {
    > foreach my $f ( @{$data{$id}->[1]} ) {
    > foreach my $v ( keys %{$f->[0]} ) {
    > push @{ $f->[1]->{int 100*( $f->[0]->{$v}/$data{$id}->[0])} }, $v}}}

    [...]
    > __DATA__
    > Identity of query sequence | Superkingdom | Kingdom | Subkingdom |
    > Phylum | Class | Order | Family
    >| Genus | Species2 | group | Species
    > NODE_124_length_77_cov_13.792208 | Bacteria | undef | undef |
    > Proteobacteria | Gammaproteobacteria | Enterobacteriales |
    > Enterobacteriaceae | Escherichia | undef | Escherichi1 | coli

    [...]

    When posting code to a newsgroup, please ensure that proper indentation
    and line wraps are preserved. This is very hard to read because of the
    missing indentation and doesn't work as posted because of the extra
    newlines.

    hp
     
    Peter J. Holzer, Mar 15, 2011
    #16
  17. ela

    Uri Guttman Guest

    >>>>> "GM" == George Mpouras <> writes:

    >>
    >> our doesn't create a variable. it only creates a lexical alias to the
    >> variable of the same name in the current package.


    GM> Alias of what if our is only the definition ;
    GM> #!/usr/bin/perl
    GM> our $var=1;
    GM> print $var;
    GM> exit 0

    the current package as i said. what is the default current package?

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
     
    Uri Guttman, Mar 15, 2011
    #17
  18. ela

    ela Guest

    While the suggested solution works perfectly for the example data inside the
    perl script, it does not work when the data change to:

    __DATA__
    Identity of query sequence Superkingdom Kingdom Subkingdom
    Phylum Class Order Family Genus Species group Species
    NODE_124_length_77_cov_13.792208 Bacteria undef undef
    Proteobacteria Gammaproteobacteria Enterobacteriales
    Enterobacteriaceae Escherichia undef Escherichia coli
    NODE_124_length_77_cov_13.792208 Bacteria undef undef
    Proteobacteria Gammaproteobacteria Enterobacteriales
    Enterobacteriaceae Escherichia undef Escherichia coli


    even though I changed

    my @a = split /\s+/; to my @a = split /\t/;

    Moreover, the current implementation is hard-coding a rank instead of
    surpassing a threshold so if I input 10 (i.e. incident abundance exceed 10%
    is ok) instead of 100, no result will return at all. Even if I use "100",
    the result returned still differs from my expectation.

    I expect the program gives the result

    id=NODE_124_length_77_cov_13.792208, thr=100% ->
    Field=Species,Value=Escherichia coli

    but it gives me:

    id=NODE_124_length_77_cov_13.792208, thr=100% ->
    Field=Order,Value=Enterobacteriales

    this statement:
    push @{ $data{$id}->{field}->{$field}->{rank}->{
    100*($data{$id}->{field}->{$field}->{data}->{$item} /
    $data{$id}->{lines} ) } } , $item}

    looks very new to me, and can anybody further tell me what it is for?

    usually I use push like:

    push @array, $element;

    and the above code even does not have the symbol ";" but no runtime error
    ......
     
    ela, Mar 15, 2011
    #18
  19. ela

    ela Guest

    The problem arises due to the number of fields. Since there are more than 9
    fields, and the sort has to do in this way:

    foreach my $field (reverse sort {$a<=>$b} keys %col) {

    so perl treats 10 as bigger than 9.

    Now the remaining problem is how to solve the threshold problem cleverly.
    Originally, George's solution makes use of hash for fast access and in fact
    progressive test, 100,99,98, ..., $threshold can be performed. However, when
    the data is large (>1 million rows with a lot of ID's), this may not be a
    good idea......
     
    ela, Mar 16, 2011
    #19
  20. > I wanna change your implementation from "discrete" checking to
    > "continuous" one, the logic is to first sort (rank keys: *** expected
    > range: (0-100] ***) numerically, then test if the "largest" key (e.g. 100,
    > 75 etc) is larger than the threshold specified. My problem is that I don't
    > know how to refer to the keys under
    >
    > $data{$_[0]}->[1]->[$i]->[1]
    >
    > Writing something like "foreach my $field (sort {$b<=>$a} keys
    > %data{$_[0]}->[1]->[$i]->[1])" (Thanks for McClellan's teaching on
    > appropriately using sort here) does not work. Moreover, there's no need to
    > "foreach" here as if the largest one also can't surpass the threshold,
    > neither the smaller ones can. So how to avoid "foreach" here?
    >



    For the data

    ID|B|C|D|E|F|G|H
    01|3|7|9|3|4|2|3
    01|3|7|9|3|4|2|2
    01|3|7|9|5|8|6|6
    01|3|7|9|3|4|2|3
    02|4|7|9|3|4|2|1
    02|4|7|9|3|4|2|2
    02|4|7|9|3|4|2|3
    02|4|7|9|3|4|2|3

    what would be your input and what do you expect ?
    An example make things more clear.
     
    George Mpouras, Mar 16, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. PJ6
    Replies:
    10
    Views:
    11,005
  2. dcrespo
    Replies:
    3
    Views:
    344
    dcrespo
    Apr 12, 2005
  3. dcrespo
    Replies:
    3
    Views:
    268
    Greg Ewing
    Apr 14, 2005
  4. Alfonso Morra
    Replies:
    11
    Views:
    721
    Emmanuel Delahaye
    Sep 24, 2005
  5. Ian
    Replies:
    3
    Views:
    171
Loading...

Share This Page