Difficulty manipulating hashes

Discussion in 'Perl Misc' started by limitz, Jul 11, 2007.

  1. limitz

    limitz Guest

    Here is my script, followed by my question at the end:

    print "\nThere are a total of 18990 bases in the entire sequence\n";
    print "\nWhat is the starting base number? Keep in mind that Perl
    begins the tally";
    print "\nwith 0. So if you wanted to start from base 30, input in
    29\n";


    #Here we can define where we want the sequence to begin and end
    print "Input Starting Number:\n";
    $beginning_of_sequence = <STDIN>;
    print "Input Ending Base Number:\n";
    $end_of_sequence = <STDIN>;
    $length_of_sequence = $end_of_sequence-$beginning_of_sequence+1;


    %dinucleotidepair = (
    AT => 0,
    AC => 0,
    AG => 0,
    AA => 0,


    TA => 0,
    TC => 0,
    TG => 0,
    TT => 0,


    CA => 0,
    CT => 0,
    CG => 0,
    CC => 0,


    GA => 0,
    GT => 0,
    GC => 0,
    GG => 0,
    );


    #$AC = 'AC';
    #$ACcountss = 0;
    #for my $i (0 .. $length_of_sequence-1) {
    # my $dinuc = substr($fastasequence, $i, 2);
    # if ($dinuc =~ $AC) {
    # $ACcountss++;
    # }
    #}


    for my $i (0 .. $length_of_sequence-1) {
    foreach (keys %dinucleotidepair) {
    $dinucleotidepair{$_}++ if substr($fastasequence, $i,
    2) =~ /$_/;
    }



    }


    while ( my($keys,$values) = each(%dinucleotidepair) ) {
    print "$keys $values\n";


    }


    #print "The Fasta sequence segment has $ACcountss AC's in
    $beginning_of_sequence to $end_of_sequence",
    #printf "for a relative frequency of %f\n", $ACcountss/
    $length_of_sequence;

    My new problem is this. I have to calculate the relative frequencies
    for everything. So, what that means, is that if one of the keys in my
    hash has an occurence, I have to divide that by $length_of_sequence
    to
    find the relative frequency. Then, that relative frequency will be
    used in another Perl script.


    My question is this:
    How do I manipulate individual elements in a hash
    given the value of the key is not zero?


    For example. In $fastasequence, the first 30 nucleotide bases contain
    4 occurences of the variable "AC" yet no occurences of the base
    combination "TG".


    Thus, the variable frequency of AC is 0.133.


    Secondly, how do I save 0.133 as a variable that can be carried over
    and used by another
    Perl script for calcuation purposes?


    Thanks!


    ~Frank
    limitz, Jul 11, 2007
    #1
    1. Advertising

  2. limitz

    Dr.Ruud Guest

    limitz schreef:

    > Here is my script, followed by my question at the end:


    Missing:

    #!/usr/bin/perl
    use strict;
    use warnings;

    Never leave those out!


    > print "\nThere are a total of 18990 bases in the entire sequence\n";
    > print "\nWhat is the starting base number? Keep in mind that Perl
    > begins the tally";
    > print "\nwith 0. So if you wanted to start from base 30, input in
    > 29\n";


    Why not just subtract 1 from the input?

    BTW, where does $fastasequence get filled?


    $seq =~ s/[^ACGT]+//g; # remove any garbage like whitespace or
    newlines

    my $seq_len = length $seq;

    print <<"EOT";

    There are a total of ${seq_len} bases in the entire sequence

    What is the starting base number?
    EOT


    > #Here we can define where we want the sequence to begin and end
    > print "Input Starting Number:\n";
    > $beginning_of_sequence = <STDIN>;


    my $seq_begin = <STDIN>;


    > print "Input Ending Base Number:\n";
    > $end_of_sequence = <STDIN>;


    my $seq_end = <STDIN>;


    > $length_of_sequence = $end_of_sequence-$beginning_of_sequence+1;


    $seq_len = $seq_end - $seq_begin + 1;


    > %dinucleotidepair = (
    > AT => 0,
    > [...]
    > GG => 0,
    > );


    my %dinucleotidepair;
    $dinucleotidepair{
    qw( AA AC AG AT
    CA CC CG CT
    GA GC GG GT
    TA TC TG TT ) } = (0) x 16;


    > for my $i (0 .. $length_of_sequence-1) {
    > foreach (keys %dinucleotidepair) {
    > $dinucleotidepair{$_}++ if substr($fastasequence, $i,
    > 2) =~ /$_/;
    > }


    Because you step through it by 1 character, a ATA will count an AT and a
    TA, is that what you want?

    You could then also replace your loops by


    my %pair;

    $pair{$_}++ for
    (substr($seq, $seq_begin - 1, $seq_len) =~ /(..)/g);

    $pair{$_}++ for
    (substr($seq, $seq_begin, $seq_len - 1) =~ /(..)/g);


    > while ( my($keys,$values) = each(%dinucleotidepair) ) {
    > print "$keys $values\n";
    > }


    print "$_ $pair{$_}\n" for keys %pair;

    There are in total ($seq_len - 1) pairs, so just divide by that.



    #!/usr/bin/perl
    use strict;
    use warnings;

    my $seq = do { local $/; <DATA> };

    # remove any garbage like whitespace or newlines
    $seq =~ s/[^ACGT]+//g;

    my $seq_len = length $seq;

    print qq(
    There are a total of ${seq_len} bases in the entire sequence.\n
    Input Starting Number: );
    my $seq_begin = <>;

    print qq(
    Input Ending Number : );
    my $seq_end = <>;

    $seq_len = $seq_end - $seq_begin + 1;
    print qq(
    Sequence length : $seq_len\n\n);

    my %pair;
    @pair{ qw( AA AC AG AT
    CA CC CG CT
    GA GC GG GT
    TA TC TG TT ) } = (0) x 16;

    $pair{$_}++ for
    (substr($seq, $seq_begin - 1, $seq_len) =~ /(..)/g);
    $pair{$_}++ for
    (substr($seq, $seq_begin, $seq_len - 1) =~ /(..)/g);

    printf "%d pairs:\n--\n", scalar keys %pair;

    for my $k (sort keys %pair) {
    my $v = $pair{$k};
    printf "%s %5d %5.3f\n", $k, $v, $v / ($seq_len - 1);
    }

    __DATA__
    ACGTGCATGCACGTAATGATCCATG
    ACGTGCATGCACGTAATGATCCATG
    ACGTGCATGCACGTAATGATCCATG
    ACGTGCATGCACGTAATGATCCATG
    ACGTGCATGCACGTAATGATCCATG
    ACGTGCATGCACGTAATGATCCATG
    ACGTGCATGCACGTAATGATCCATG
    ACGTGCATGCACGTAATGATCCATG
    ACGTGCATGCACGTAATGATCCATG
    ACGTGCATGCACGTAATGATCCATG

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Jul 11, 2007
    #2
    1. Advertising

  3. Dr.Ruud <> wrote:

    > $seq =~ s/[^ACGT]+//g; # remove any garbage like whitespace or newlines



    $seq =~ tr/ACGT//cd; # same thing, only much faster


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
    Tad McClellan, Jul 12, 2007
    #3
  4. Dr.Ruud wrote:
    > limitz schreef:
    >
    >> Here is my script, followed by my question at the end:

    >
    > Missing:
    >
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    >
    > Never leave those out!
    >
    >
    >> print "\nThere are a total of 18990 bases in the entire sequence\n";
    >> print "\nWhat is the starting base number? Keep in mind that Perl
    >> begins the tally";
    >> print "\nwith 0. So if you wanted to start from base 30, input in
    >> 29\n";

    >
    > Why not just subtract 1 from the input?
    >
    > BTW, where does $fastasequence get filled?
    >
    >
    > $seq =~ s/[^ACGT]+//g; # remove any garbage like whitespace or
    > newlines


    Or do it more efficiently:

    $seq =~ tr/ACGT//cd; # remove any garbage like whitespace or newlines

    > my $seq_len = length $seq;
    >
    > print <<"EOT";
    >
    > There are a total of ${seq_len} bases in the entire sequence




    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
    John W. Krahn, Jul 12, 2007
    #4
  5. merl the perl <> wrote:


    > So the arrays don't associate as in a *(b*c)=(a*b)*c, but the values and
    > keys associate,



    So far, so good.


    > as in, if you have one, you can get the other.



    Not so good.

    If you have the key you can get the value, but not the other
    way around (because there may be multiple keys with the same value).


    > I find the
    > semantics of perl quite challenging.



    The semantics of a hash data structure are what you find challenging.

    Hashes were around long before Perl, try googling "hash data structure".


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
    Tad McClellan, Jul 12, 2007
    #5
  6. limitz

    limitz Guest

    This is my complete code. I'm at the very last step. Basically, I need
    to multiply the variable $relafreq as found near the end of the code
    to the hash %dinucleotidescorepair and then sum the multiplied values
    of the hash all together to return the score.

    For example. If $relafreq = .5, .4, .09, .01 and the values of
    %dinucleotidescorepair is 5 8 3 6. I need to .5*5 + .4*8 + 3*.09 + .
    01*6 then print the value out.

    How would I go about doing this?

    Thanks guy!

    ~Frank


    #!/usr/bin/perl -w

    #The filename of the file containing the e-coli sequence data

    $fastadata = 'verified.fa';

    #This step is to open the file, if the file cannot be opened, an error
    message is printed and the program exits
    unless ( open(FASTAFILE,$fastadata) ) {
    print "Could not open file $fastadata!\n";
    exit;
    }

    #Read the protein sequence into an array. This array will be used to
    generate the Markov Transitional Matrices
    @fasta = <FASTAFILE>;

    #Close the FASTA data now as all the data has been read into the array
    close FASTAFILE;

    #Extracting the FASTA file from the array and putting it into a
    sequence data

    sub extract_sequence_from_fasta_data {
    my(@fasta) =@_;

    use strict;
    use warnings;

    #declaing the variables
    my $fastasequence = '';

    foreach my $line (@fasta) {

    #Discard blank line
    if ($line =~ /^\s*$/) {
    next;

    #Discard comment line
    } elsif($line =~ /^\s*#/) {
    next;

    #Discard fasta header line
    } elsif($line =~ /^>/) {
    next;
    #Keep line, add to sequence string
    } else {
    $fastasequence .= $line;
    }
    }

    #remove non-sequence data from the $sequence string
    $fastasequence =~ s/\s//g;

    return $fastasequence;
    }

    $fastasequence = extract_sequence_from_fasta_data (@fasta);

    #Split up the FASTA sequence into individual elements
    @fastasequence = split( '' , $fastasequence);

    #extract data and remove fasta headers from the file to be scored
    against the original fasta file
    print "What is the filename?\n";
    $scoredata = <STDIN>;
    unless ( open(SCOREFILE,$scoredata) ) {
    print "could not open file $scoredata!\n";
    exit;
    }

    @score = <SCOREFILE>;
    close SCOREFILE;

    #The sequence to be scored is now in identical format as our original
    fasta sequence
    $scoresequence = extract_sequence_from_fasta_data (@score);

    print "\nThere are a total of 18990 bases in the entire sequence\n";
    print "\nWhat is the starting base number? Keep in mind that Perl
    begins the tally";
    print "\nwith 0. So if you wanted to start from base 30, input in
    29\n";


    #Here we can define where we want the sequence to begin and end
    print "Input Starting Number:\n";
    $beginning_of_sequence = <STDIN>;
    print "Input Ending Base Number:\n";
    $end_of_sequence = <STDIN>;
    $length_of_sequence = $end_of_sequence-$beginning_of_sequence+1;

    #Counting the occurences for the file to be scored
    %dinucleotidescorepair = (
    AT => 0,
    AC => 0,
    AG => 0,
    AA => 0,

    TA => 0,
    TC => 0,
    TG => 0,
    TT => 0,

    CA => 0,
    CT => 0,
    CG => 0,
    CC => 0,

    GA => 0,
    GT => 0,
    GC => 0,
    GG => 0,
    );

    for my $i (0 .. $length_of_sequence-1) {
    foreach (keys %dinucleotidescorepair) {
    $dinucleotidescorepair{$_}++ if substr($scoresequence, $i, 2) =~ /
    $_/;
    }
    }
    print "These are the occurences for the file to be scored\n";
    while ( my($keys,$values) = each(%dinucleotidescorepair) ) {
    print "$keys $values\n";
    }

    #Counting and finding the frequencies for the original fasta file
    %dinucleotidepair = (
    AT => 0,
    AC => 0,
    AG => 0,
    AA => 0,

    TA => 0,
    TC => 0,
    TG => 0,
    TT => 0,

    CA => 0,
    CT => 0,
    CG => 0,
    CC => 0,

    GA => 0,
    GT => 0,
    GC => 0,
    GG => 0,
    );

    for my $i (0 .. $length_of_sequence-1) {
    foreach (keys %dinucleotidepair) {
    $dinucleotidepair{$_}++ if substr($fastasequence, $i, 2) =~ /$_/;
    }
    }

    print "\n";
    print "These are the occurences for the original file (verified.fa)
    \n";

    while ( my($keys,$values) = each(%dinucleotidepair) ) {
    print "$keys $values\n";
    }

    print "\n";
    print "These are the relative frequencies for the original file\n";

    while ( my($keys,$values) = each(%dinucleotidepair) ) {
    $relafreq = $values/$length_of_sequence;
    printf "$keys %f\n", $relafreq;
    }


    #The calculation of the final score
    limitz, Jul 13, 2007
    #6
  7. limitz

    Dr.Ruud Guest

    merl the perl schreef:
    > "Dr.Ruud":


    >> __DATA__
    >> ACGTGCATGCACGTAATGATCCATG
    >> ACGTGCATGCACGTAATGATCCATG

    >
    > Why do you want the same gene over and over again?


    Just to get a quick first run of test data, of course. Is that really
    hard for you to understand?
    Just replace it by your favourite sequence.

    My answer-in-code was not so much about the data, but about the messy
    code that limitz published.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Jul 13, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ben Holness

    Hashes of Hashes via subs

    Ben Holness, Oct 5, 2003, in forum: Perl
    Replies:
    8
    Views:
    547
    Ben Holness
    Oct 8, 2003
  2. Steven Arnold

    using hashes as keys in hashes

    Steven Arnold, Nov 23, 2005, in forum: Ruby
    Replies:
    3
    Views:
    155
    Mauricio Fernández
    Nov 23, 2005
  3. kazaam
    Replies:
    12
    Views:
    259
    Matthias Wächter
    Sep 13, 2007
  4. Aaron DeLoach

    manipulating a hash of hashes

    Aaron DeLoach, Jul 7, 2004, in forum: Perl Misc
    Replies:
    3
    Views:
    99
    Brad Baxter
    Jul 7, 2004
  5. Tim O'Donovan

    Hash of hashes, of hashes, of arrays of hashes

    Tim O'Donovan, Oct 27, 2005, in forum: Perl Misc
    Replies:
    5
    Views:
    201
Loading...

Share This Page