Difficulty manipulating hashes

L

limitz

Here is my script, followed by my question at the end:

print "\nThere are a total of 18990 bases in the entire sequence\n";
print "\nWhat is the starting base number? Keep in mind that Perl
begins the tally";
print "\nwith 0. So if you wanted to start from base 30, input in
29\n";


#Here we can define where we want the sequence to begin and end
print "Input Starting Number:\n";
$beginning_of_sequence = <STDIN>;
print "Input Ending Base Number:\n";
$end_of_sequence = <STDIN>;
$length_of_sequence = $end_of_sequence-$beginning_of_sequence+1;


%dinucleotidepair = (
AT => 0,
AC => 0,
AG => 0,
AA => 0,


TA => 0,
TC => 0,
TG => 0,
TT => 0,


CA => 0,
CT => 0,
CG => 0,
CC => 0,


GA => 0,
GT => 0,
GC => 0,
GG => 0,
);


#$AC = 'AC';
#$ACcountss = 0;
#for my $i (0 .. $length_of_sequence-1) {
# my $dinuc = substr($fastasequence, $i, 2);
# if ($dinuc =~ $AC) {
# $ACcountss++;
# }
#}


for my $i (0 .. $length_of_sequence-1) {
foreach (keys %dinucleotidepair) {
$dinucleotidepair{$_}++ if substr($fastasequence, $i,
2) =~ /$_/;
}



}


while ( my($keys,$values) = each(%dinucleotidepair) ) {
print "$keys $values\n";


}


#print "The Fasta sequence segment has $ACcountss AC's in
$beginning_of_sequence to $end_of_sequence",
#printf "for a relative frequency of %f\n", $ACcountss/
$length_of_sequence;

My new problem is this. I have to calculate the relative frequencies
for everything. So, what that means, is that if one of the keys in my
hash has an occurence, I have to divide that by $length_of_sequence
to
find the relative frequency. Then, that relative frequency will be
used in another Perl script.


My question is this:
How do I manipulate individual elements in a hash
given the value of the key is not zero?


For example. In $fastasequence, the first 30 nucleotide bases contain
4 occurences of the variable "AC" yet no occurences of the base
combination "TG".


Thus, the variable frequency of AC is 0.133.


Secondly, how do I save 0.133 as a variable that can be carried over
and used by another
Perl script for calcuation purposes?


Thanks!


~Frank
 
D

Dr.Ruud

limitz schreef:
Here is my script, followed by my question at the end:

Missing:

#!/usr/bin/perl
use strict;
use warnings;

Never leave those out!

print "\nThere are a total of 18990 bases in the entire sequence\n";
print "\nWhat is the starting base number? Keep in mind that Perl
begins the tally";
print "\nwith 0. So if you wanted to start from base 30, input in
29\n";

Why not just subtract 1 from the input?

BTW, where does $fastasequence get filled?


$seq =~ s/[^ACGT]+//g; # remove any garbage like whitespace or
newlines

my $seq_len = length $seq;

print <<"EOT";

There are a total of ${seq_len} bases in the entire sequence

What is the starting base number?
EOT

#Here we can define where we want the sequence to begin and end
print "Input Starting Number:\n";
$beginning_of_sequence = <STDIN>;

print "Input Ending Base Number:\n";
$end_of_sequence = <STDIN>;

$length_of_sequence = $end_of_sequence-$beginning_of_sequence+1;

$seq_len = $seq_end - $seq_begin + 1;

%dinucleotidepair = (
AT => 0,
[...]
GG => 0,
);

my %dinucleotidepair;
$dinucleotidepair{
qw( AA AC AG AT
CA CC CG CT
GA GC GG GT
TA TC TG TT ) } = (0) x 16;

for my $i (0 .. $length_of_sequence-1) {
foreach (keys %dinucleotidepair) {
$dinucleotidepair{$_}++ if substr($fastasequence, $i,
2) =~ /$_/;
}

Because you step through it by 1 character, a ATA will count an AT and a
TA, is that what you want?

You could then also replace your loops by


my %pair;

$pair{$_}++ for
(substr($seq, $seq_begin - 1, $seq_len) =~ /(..)/g);

$pair{$_}++ for
(substr($seq, $seq_begin, $seq_len - 1) =~ /(..)/g);

while ( my($keys,$values) = each(%dinucleotidepair) ) {
print "$keys $values\n";
}

print "$_ $pair{$_}\n" for keys %pair;

There are in total ($seq_len - 1) pairs, so just divide by that.



#!/usr/bin/perl
use strict;
use warnings;

my $seq = do { local $/; <DATA> };

# remove any garbage like whitespace or newlines
$seq =~ s/[^ACGT]+//g;

my $seq_len = length $seq;

print qq(
There are a total of ${seq_len} bases in the entire sequence.\n
Input Starting Number: );
my $seq_begin = <>;

print qq(
Input Ending Number : );
my $seq_end = <>;

$seq_len = $seq_end - $seq_begin + 1;
print qq(
Sequence length : $seq_len\n\n);

my %pair;
@pair{ qw( AA AC AG AT
CA CC CG CT
GA GC GG GT
TA TC TG TT ) } = (0) x 16;

$pair{$_}++ for
(substr($seq, $seq_begin - 1, $seq_len) =~ /(..)/g);
$pair{$_}++ for
(substr($seq, $seq_begin, $seq_len - 1) =~ /(..)/g);

printf "%d pairs:\n--\n", scalar keys %pair;

for my $k (sort keys %pair) {
my $v = $pair{$k};
printf "%s %5d %5.3f\n", $k, $v, $v / ($seq_len - 1);
}

__DATA__
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
ACGTGCATGCACGTAATGATCCATG
 
J

John W. Krahn

Dr.Ruud said:
limitz schreef:
Here is my script, followed by my question at the end:

Missing:

#!/usr/bin/perl
use strict;
use warnings;

Never leave those out!

print "\nThere are a total of 18990 bases in the entire sequence\n";
print "\nWhat is the starting base number? Keep in mind that Perl
begins the tally";
print "\nwith 0. So if you wanted to start from base 30, input in
29\n";

Why not just subtract 1 from the input?

BTW, where does $fastasequence get filled?


$seq =~ s/[^ACGT]+//g; # remove any garbage like whitespace or
newlines

Or do it more efficiently:

$seq =~ tr/ACGT//cd; # remove any garbage like whitespace or newlines
my $seq_len = length $seq;

print <<"EOT";

There are a total of ${seq_len} bases in the entire sequence



John
 
T

Tad McClellan

So the arrays don't associate as in a *(b*c)=(a*b)*c, but the values and
keys associate,


So far, so good.

as in, if you have one, you can get the other.


Not so good.

If you have the key you can get the value, but not the other
way around (because there may be multiple keys with the same value).

I find the
semantics of perl quite challenging.


The semantics of a hash data structure are what you find challenging.

Hashes were around long before Perl, try googling "hash data structure".
 
L

limitz

This is my complete code. I'm at the very last step. Basically, I need
to multiply the variable $relafreq as found near the end of the code
to the hash %dinucleotidescorepair and then sum the multiplied values
of the hash all together to return the score.

For example. If $relafreq = .5, .4, .09, .01 and the values of
%dinucleotidescorepair is 5 8 3 6. I need to .5*5 + .4*8 + 3*.09 + .
01*6 then print the value out.

How would I go about doing this?

Thanks guy!

~Frank


#!/usr/bin/perl -w

#The filename of the file containing the e-coli sequence data

$fastadata = 'verified.fa';

#This step is to open the file, if the file cannot be opened, an error
message is printed and the program exits
unless ( open(FASTAFILE,$fastadata) ) {
print "Could not open file $fastadata!\n";
exit;
}

#Read the protein sequence into an array. This array will be used to
generate the Markov Transitional Matrices
@fasta = <FASTAFILE>;

#Close the FASTA data now as all the data has been read into the array
close FASTAFILE;

#Extracting the FASTA file from the array and putting it into a
sequence data

sub extract_sequence_from_fasta_data {
my(@fasta) =@_;

use strict;
use warnings;

#declaing the variables
my $fastasequence = '';

foreach my $line (@fasta) {

#Discard blank line
if ($line =~ /^\s*$/) {
next;

#Discard comment line
} elsif($line =~ /^\s*#/) {
next;

#Discard fasta header line
} elsif($line =~ /^>/) {
next;
#Keep line, add to sequence string
} else {
$fastasequence .= $line;
}
}

#remove non-sequence data from the $sequence string
$fastasequence =~ s/\s//g;

return $fastasequence;
}

$fastasequence = extract_sequence_from_fasta_data (@fasta);

#Split up the FASTA sequence into individual elements
@fastasequence = split( '' , $fastasequence);

#extract data and remove fasta headers from the file to be scored
against the original fasta file
print "What is the filename?\n";
$scoredata = <STDIN>;
unless ( open(SCOREFILE,$scoredata) ) {
print "could not open file $scoredata!\n";
exit;
}

@score = <SCOREFILE>;
close SCOREFILE;

#The sequence to be scored is now in identical format as our original
fasta sequence
$scoresequence = extract_sequence_from_fasta_data (@score);

print "\nThere are a total of 18990 bases in the entire sequence\n";
print "\nWhat is the starting base number? Keep in mind that Perl
begins the tally";
print "\nwith 0. So if you wanted to start from base 30, input in
29\n";


#Here we can define where we want the sequence to begin and end
print "Input Starting Number:\n";
$beginning_of_sequence = <STDIN>;
print "Input Ending Base Number:\n";
$end_of_sequence = <STDIN>;
$length_of_sequence = $end_of_sequence-$beginning_of_sequence+1;

#Counting the occurences for the file to be scored
%dinucleotidescorepair = (
AT => 0,
AC => 0,
AG => 0,
AA => 0,

TA => 0,
TC => 0,
TG => 0,
TT => 0,

CA => 0,
CT => 0,
CG => 0,
CC => 0,

GA => 0,
GT => 0,
GC => 0,
GG => 0,
);

for my $i (0 .. $length_of_sequence-1) {
foreach (keys %dinucleotidescorepair) {
$dinucleotidescorepair{$_}++ if substr($scoresequence, $i, 2) =~ /
$_/;
}
}
print "These are the occurences for the file to be scored\n";
while ( my($keys,$values) = each(%dinucleotidescorepair) ) {
print "$keys $values\n";
}

#Counting and finding the frequencies for the original fasta file
%dinucleotidepair = (
AT => 0,
AC => 0,
AG => 0,
AA => 0,

TA => 0,
TC => 0,
TG => 0,
TT => 0,

CA => 0,
CT => 0,
CG => 0,
CC => 0,

GA => 0,
GT => 0,
GC => 0,
GG => 0,
);

for my $i (0 .. $length_of_sequence-1) {
foreach (keys %dinucleotidepair) {
$dinucleotidepair{$_}++ if substr($fastasequence, $i, 2) =~ /$_/;
}
}

print "\n";
print "These are the occurences for the original file (verified.fa)
\n";

while ( my($keys,$values) = each(%dinucleotidepair) ) {
print "$keys $values\n";
}

print "\n";
print "These are the relative frequencies for the original file\n";

while ( my($keys,$values) = each(%dinucleotidepair) ) {
$relafreq = $values/$length_of_sequence;
printf "$keys %f\n", $relafreq;
}


#The calculation of the final score
 
D

Dr.Ruud

merl the perl schreef:
"Dr.Ruud":

Why do you want the same gene over and over again?

Just to get a quick first run of test data, of course. Is that really
hard for you to understand?
Just replace it by your favourite sequence.

My answer-in-code was not so much about the data, but about the messy
code that limitz published.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,050
Latest member
AngelS122

Latest Threads

Top