Problem with PERL function

M

michaelzhao

Hey,

I am making a program to tally up the nucleic acid bases of E. Coli.
This data will be used in bioinformatics research to generate a Markov
Matrix.

However, I am just beginning PERL and ran into a slight problem.

Basically, I need the frequencies of the 4 bases, Adenosine (A),
Thymine (T), Cytosine (C), and Guanine (G). I made a PERL script to
tally up the total counts of the bases. In my case, there are 6254
Adenosine, 4957 Thymine, 4245 Cytosine, and 3534 Guanine.

In order to find the frequencies, all I would do is divide each base
count by the total.

However, here is my problem. Instead of doing it globally. I need to
be able to specify an arbitrary start and stop position to start
tallying the occurences of the bases and also to find the frequencies
of the bases in that particularly defined area.

Herein lies my problem. I have not a clue how to go about doing this
task. I've been looking online for a solution but haven't really found
one. If anyone can suggest an idea or function(s) I could use to go
about doing this task. I would be much obliged. Thanks!

~Michael
 
P

Paul Lalli

I am making a program to tally up the nucleic acid bases of E. Coli.
This data will be used in bioinformatics research to generate a Markov
Matrix.

However, I am just beginning PERL and ran into a slight problem.

The language's name is "Perl", not PERL
Basically, I need the frequencies of the 4 bases, Adenosine (A),
Thymine (T), Cytosine (C), and Guanine (G). I made a PERL script to
tally up the total counts of the bases. In my case, there are 6254
Adenosine, 4957 Thymine, 4245 Cytosine, and 3534 Guanine.

In order to find the frequencies, all I would do is divide each base
count by the total.

However, here is my problem. Instead of doing it globally. I need to
be able to specify an arbitrary start and stop position to start
tallying the occurences of the bases and also to find the frequencies
of the bases in that particularly defined area.

Herein lies my problem. I have not a clue how to go about doing this
task. I've been looking online for a solution but haven't really found
one. If anyone can suggest an idea or function(s) I could use to go
about doing this task. I would be much obliged. Thanks!

You've given no indication of any kind as to how you're reading these
values, from what sort of data they're coming, in what way you're
scanning that data to pull out the values you want. So how is anyone
to know how to modify your program to only get certain of those
values?

Please par your problem down to a SHORT but COMPLETE script that
demonstrates your issue, and post that.

In the meantime, my random shot in the dark is that you want the $.
variable. Read about it in `perldoc perlop`.

Paul Lalli
 
U

usenet

Basically, I need the frequencies of the 4 bases, Adenosine (A),
Thymine (T), Cytosine (C), and Guanine (G).
However, here is my problem. Instead of doing it globally. I need to
be able to specify an arbitrary start and stop position

I don't know bioperl (and there are LOTS of modules related to
bioperl; maybe one does exactly what/how you want) but if all you want
to do is basic frequency analysis on an array (assuming your bases are
in an array) then there are several statistics packages that will do
this for you. Consider:

#!/usr/bin/perl
use strict;
use warnings;
use Statistics::Frequency;

my @dna = qw{ T C G A T T T T C C C C G G G G A A A A A A A A};

my $freq = Statistics::Frequency->new( @dna );

printf ( "\nAnalyzing %s bases\n\n", $freq->frequencies_sum() );
map { printf("%s Frequency is %s\n",
$_,
$freq->frequency($_) || 'undefined',
)
} qw {T C A G}

__END__


If you want to specify an arbitrary range, just slice the array, such
as:

my $freq = Statistics::Frequency->new( @dna[4..9] );

(remember arrays start numbering at zero, so this will get the fifth-
tenth items)
 
J

John W. Krahn

michaelzhao said:
I am making a program to tally up the nucleic acid bases of E. Coli.
This data will be used in bioinformatics research to generate a Markov
Matrix.

However, I am just beginning PERL and ran into a slight problem.

Basically, I need the frequencies of the 4 bases, Adenosine (A),
Thymine (T), Cytosine (C), and Guanine (G). I made a PERL script to
tally up the total counts of the bases. In my case, there are 6254
Adenosine, 4957 Thymine, 4245 Cytosine, and 3534 Guanine.

In order to find the frequencies, all I would do is divide each base
count by the total.

However, here is my problem. Instead of doing it globally. I need to
be able to specify an arbitrary start and stop position to start
tallying the occurences of the bases and also to find the frequencies
of the bases in that particularly defined area.

Herein lies my problem. I have not a clue how to go about doing this
task. I've been looking online for a solution but haven't really found
one. If anyone can suggest an idea or function(s) I could use to go
about doing this task. I would be much obliged. Thanks!

You can use substr() to define a smaller area inside a string:

$ perl -le'
my $string =
q[ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG];
my $area = \substr $string, 23, 35;
my $Thymine_area = $$area =~ tr/A//;
my $Thymine_total = $string =~ tr/A//;

print for length( $string ), $Thymine_total, length( $$area ), $Thymine_area;
'
128
32
35
9



John
 
J

jgraber

I don't know bioperl (and there are LOTS of modules related to
bioperl; maybe one does exactly what/how you want) but if all you want
to do is basic frequency analysis on an array (assuming your bases are
in an array) then there are several statistics packages that will do
this for you. Consider:

snip example of use Statistics::Frequency;
If you want to specify an arbitrary range, just slice the array, such
as:
my $freq = Statistics::Frequency->new( @dna[4..9] );
(remember arrays start numbering at zero, so this will get the fifth-
tenth items)

I don't know bioperl either, but it seems like the DNA strings of ATCG
might be represented as long strings because dna people might like to
use regexp to search for subsets.

In that case, here are some idioms
for counting characters in a scalar using tr///,
and for accessing a subset of a string with substr; (pasted and tested)
# 0123456789.12345678
$dna = 'TCTCTCGGGAAGAGATTGA';
$Tcount = $dna =~ tr/T/T/; # tr returns number of T->T replacements
$start = 3; # inclusive, based at 0
$stop = 9; # inclusive
$len = $stop-$start+1
$Tcountss = substr($dna,$start,$len) =~ tr/T/T/; # access subset of $dna
print "dna = '$dna' has $Tcount T's,\n";
print " but only $Tcountss T's in $start to $stop, inclusive\n",
printf " for a relative frequency of %f\n", $Tcountss/$len;

-- output is

dna = 'TCTCTCGGGAAGAGATTGA' has 5 T's,
but only 1 T's in 3 to 9, inclusive
for a relative frequency of 0.142857

As mentioned before, posting short example datastructures from your code
shows you are paying attention, and enable more appropriate assistance.
 
M

michaelzhao

Here is my code. Keep in mind that this is like my 2nd week on Perl so
it probably is pretty crappy ;). Try not to give me too much flak
about it.

The DNA File is called "verified.fa". It is a FASTA file, a common
file format for storing DNA strings.

#!/usr/bin/perl -w

#The filename of the file containing the e-coli sequence data

$fastadata = 'verified.fa';

#This step is to open the file, if the file cannot be opened, an error
message is printed and the program exits
unless ( open(FASTAFILE,$fastadata) ) {
print "Could not open file $fastadata!\n";
exit;
}

#Read the protein sequence into an array. This array will be used to
generate the Markov Transitional Matrices
@fasta = <FASTAFILE>;

#Close the FASTA data now as all the data has been read into the array
close FASTAFILE;

#Extracting the FASTA file from the array and putting it into a
sequence data

sub extract_sequence_from_fasta_data {
my(@fasta) =@_;

use strict;
use warnings;

#declaing the variables
my $fastasequence = '';

foreach my $line (@fasta) {

#Discard blank line
if ($line =~ /^\s*$/) {
next;

#Discard comment line
} elsif($line =~ /^\s*#/) {
next;

#Discard fasta header line
} elsif($line =~ /^>/) {
next;
#Keep line, add to sequence string
} else {
$fastasequence .= $line;
}
}

#remove non-sequence data from the $sequence string
$fastasequence =~ s/\s//g;

return $fastasequence;
}

$fastasequence = extract_sequence_from_fasta_data (@fasta);

#Split up the FASTA sequence into individual elements
@fastasequence = split( '' , $fastasequence);

#Initialize the tally
$tally_of_Adenosine = 0;
$tally_of_Thymine = 0;
$tally_of_Guanine = 0;
$tally_of_Cytosine = 0;
$errors = 0;


#Initialize a loop so it can tally the nucleotides
foreach $base (@fastasequence) {

if ( $base eq 'A') {
++$tally_of_Adenosine;
} elsif ( $base eq 'T') {
++$tally_of_Thymine;
} elsif ( $base eq 'G') {
++$tally_of_Guanine;
} elsif ( $base eq 'C') {
++$tally_of_Cytosine;
} else {
print "!!!!!! Error - Base not recognized: $base\n";
++$errors;
}
}
#Now that we have a tally of the bases, we have have to calculate the
zero-order Markov Transitional Matrice
print "$tally_of_Adenosine\n";
print "$tally_of_Thymine\n";
print "$tally_of_Guanine\n";
print "$tally_of_Cytosine\n";
unless ($errors eq '0') {
print "$errors\n";
}
 
M

Mirco Wahab

michaelzhao said:
Here is my code. Keep in mind that this is like my 2nd week on Perl so
it probably is pretty crappy ;). Try not to give me too much flak
about it.

Thats really good! It works.
The DNA File is called "verified.fa". It is a FASTA file, a common
file format for storing DNA strings.
#!/usr/bin/perl -w
#The filename of the file containing the e-coli sequence data
$fastadata = 'verified.fa';


I'd guess all your program does or intends to do is the following:

...

my $fastasequence = '';
my %Tally;

open my $fh, '<', 'verified.fa' or die "can't open file!$!";
while (<$fh>) {
/([ATCG]+)(?{ $fastasequence .= $1 })/g unless /^[>#]/
}
close $fh;

1 while $fastasequence =~ /(.)(?{ ++$Tally{$1} })/g;


print
map "$_: $Tally{$_} " . (100. * $Tally{$_}/length $fastasequence) . "%\n",
keys %Tally;

...

(Please correct me if I'm wrong)


Please have also a look at:
http://www.perl.com/pub/a/2003/09/10/bioinformatics.html

Regards

M.
 
J

Jürgen Exner

michaelzhao said:
Here is my code.

I won't pretend to understand all of your code, but in this part below it
appears as if all you are doing is counting how often a specific character
appears in the given string $fastasequence.
#Split up the FASTA sequence into individual elements
@fastasequence = split( '' , $fastasequence);

#Initialize the tally
$tally_of_Adenosine = 0;
$tally_of_Thymine = 0;
$tally_of_Guanine = 0;
$tally_of_Cytosine = 0;
$errors = 0;


#Initialize a loop so it can tally the nucleotides
foreach $base (@fastasequence) {

if ( $base eq 'A') {
++$tally_of_Adenosine;
} elsif ( $base eq 'T') {
++$tally_of_Thymine;
} elsif ( $base eq 'G') {
++$tally_of_Guanine;
} elsif ( $base eq 'C') {
++$tally_of_Cytosine;
} else {
print "!!!!!! Error - Base not recognized: $base\n";
++$errors;
}

If my guess is correct, then this whole section can be replaces with a
simple

$tally_of_Adenosine = tr/A/A/;
$tally_of_Thymine = tr/T/T/;
$tally_of_Guanine = tr/G/G/;
$tally_of_Cytosine = tr/C/C/;

The tr// operator returns the number of replacements it has performed and
thus tells you how many A's it has replaced with A's.
From "perldoc perlop"
tr/SEARCHLIST/REPLACEMENTLIST/cds
Transliterates all occurrences of the characters found in the
search list with the corresponding character in the replacement
list. It returns the number of characters replaced or deleted.
[...]

jue
 
M

Mirco Wahab

Mirco said:
1 while $fastasequence =~ /(.)(?{ ++$Tally{$1} })/g;

Oh, I forgot: if you need defined ranges in the nucleotide
string, you could use the 'substr' function directly,
($st and $nn denote sequence start/seqence count)

...
my ($fastasequence, $st, $nn) = ('', 1, 444);

open my $fh, '<', 'verified.fa' or die "can't open file!$!";
while (<$fh>) { /([ATCG]+)(?{ $fastasequence .= $1 })/g unless /^[>#]/ }
close $fh;

my %Tally;
1 while substr($fastasequence, $st, $nn) =~ /(.)(?{ ++$Tally{$^N} })/g;

print
map "$_: $Tally{$_} ".(100*$Tally{$_}/length substr $fastasequence, $st, $nn)."%\n",
keys %Tally;
...

Regards

M.
 
M

michaelzhao

I don't know bioperl (and there are LOTS of modules related to
bioperl; maybe one does exactly what/how you want) but if all you want
to do is basic frequency analysis on an array (assuming your bases are
in an array) then there are several statistics packages that will do
this for you. Consider:

snip example of use Statistics::Frequency;
If you want to specify an arbitrary range, just slice the array, such
as:
my $freq = Statistics::Frequency->new( @dna[4..9] );
(remember arrays start numbering at zero, so this will get the fifth-
tenth items)

I don't know bioperl either, but it seems like the DNA strings of ATCG
might be represented as long strings because dna people might like to
use regexp to search for subsets.

In that case, here are some idioms
for counting characters in a scalar using tr///,
and for accessing a subset of a string with substr; (pasted and tested)
# 0123456789.12345678
$dna = 'TCTCTCGGGAAGAGATTGA';
$Tcount = $dna =~ tr/T/T/; # tr returns number of T->T replacements
$start = 3; # inclusive, based at 0
$stop = 9; # inclusive
$len = $stop-$start+1
$Tcountss = substr($dna,$start,$len) =~ tr/T/T/; # access subset of $dna
print "dna = '$dna' has $Tcount T's,\n";
print " but only $Tcountss T's in $start to $stop, inclusive\n",
printf " for a relative frequency of %f\n", $Tcountss/$len;

-- output is

dna = 'TCTCTCGGGAAGAGATTGA' has 5 T's,
but only 1 T's in 3 to 9, inclusive
for a relative frequency of 0.142857

As mentioned before, posting short example datastructures from your code
shows you are paying attention, and enable more appropriate assistance.

Wow Joel! Thanks a lot for your help. Your suggestion really hit the
spot. I was able to learn quite a bit from the code you put up and
succesfully modified it to fit my needs.

To everyone else: Thanks for the suggestions and the help! With the
exception of the jerkoff who wrote the first response.
 
T

Tad McClellan

michaelzhao said:
On Jun 26, 5:23 pm, (e-mail address removed) wrote:


Who mentioned it before?

To everyone else: Thanks for the suggestions and the help! With the
exception of the jerkoff who wrote the first response.


Oh. It was the "jerkoff" who mentioned how to get help with your problem.

Calling the regulars names will have a negative effect on your ability
to get quality help here in the future. I suggest you try to refrain
from doing that anymore.
 
M

Mirco Wahab

bugbear said:
bugbear said:
$A[i2] - $A[i1],

and so on for the other bases.

Dang - just thought of something else

For a 25% storage saving (since
each of my suggested arrays has as many entries as the sequence is
long), you only need 3 arrays, since the fourth
count is length-of-sequence - (sum of other 3 counts)

Hi bugbear, IMHO one shouldn't do to much with arrays in the
case of real genome files. A "good" fasta file may have dozens
of millions of nucleobases, one example here:
http://www.sanger.ac.uk/Software/analysis/projector/projector_test_set.fasta

Regards

M.
 
U

usenet


Don't it just annoy you when you take the time to read someone's post
and make a reply, and then this n00b goes off and flames a respected
group regular who was only trying to be helpful. It really makes you
sorry you ever read or responded to the post.

* Ploink *
 
M

michaelzhao

Don't it just annoy you when you take the time to read someone's post
and make a reply, and then this n00b goes off and flames a respected
group regular who was only trying to be helpful. It really makes you
sorry you ever read or responded to the post.

* Ploink *

Ha! Lets see... I was grateful for everyone elses suggestions. But
tell me, besides informing me that "PERL" was actually "Perl", he did
not help me at all with my problem besides giving me the response that
I needed the "$" variable.

Other people helped me a good deal giving relevant and educational
discussion and I am grateful to them. So, I don't care if he is a
regular or if hes a Perl god for that matter. The post he made was
neither helpful nor educational... besides me learning the "Perl" of
course. Oh, and next time I'll post a short script.
 
A

anno4000

Mirco Wahab said:
bugbear said:
bugbear said:
$A[i2] - $A[i1],

and so on for the other bases.

Dang - just thought of something else

For a 25% storage saving (since
each of my suggested arrays has as many entries as the sequence is
long), you only need 3 arrays, since the fourth
count is length-of-sequence - (sum of other 3 counts)

Hi bugbear, IMHO one shouldn't do to much with arrays in the
case of real genome files. A "good" fasta file may have dozens
of millions of nucleobases, one example here:
http://www.sanger.ac.uk/Software/analysis/projector/projector_test_set.fasta

You could still usefully store only the counts for substrings that end on
a multiple of 100 (or 1000). That would take only 1 (0.1) % of the full
storage, but you'd need to re-scan only relatively short strings for
an exact count.

Anno
 
P

Peter J. Holzer

bugbear said:
bugbear said:
$A[i2] - $A[i1],

and so on for the other bases.

Dang - just thought of something else

For a 25% storage saving (since
each of my suggested arrays has as many entries as the sequence is
long), you only need 3 arrays, since the fourth
count is length-of-sequence - (sum of other 3 counts)

Hi bugbear, IMHO one shouldn't do to much with arrays in the
case of real genome files. A "good" fasta file may have dozens
of millions of nucleobases, one example here:
http://www.sanger.ac.uk/Software/analysis/projector/projector_test_set.fasta

You could store the counts in a string, since a perl string is
essentially an array of 32 bit integers (but the utf-8 encoding may make
accesses slow and updates in the middle even slower).

hp
 
D

Dr.Ruud

Peter J. Holzer schreef:
a perl string is essentially an array of 32 bit integers
(but the utf-8 encoding may
make accesses slow and updates in the middle even slower).

Yes, that is another projection of Perl strings. Putting substrings
(like with length 256 characters) in a Perl array would probably be less
slow.

Or use pack/unpack with the 'N' or 'J' template.

my $iMax = 1234567890; # maximum index +1
my $N = 256; # chunk size
my $B = 4; # octets per value
my @buf = (chr(0) x ($B*$N)) x (int($iMax/$N) + ($iMax%$N)?1:0);

my $i = 54321;

for ( substr $buf[int($i/$N)], ($i%$N)*$B, $B ) {
my $v = unpack 'N', $_;
$v ++;
$_ = pack 'N', $v;
}

(untested)

For the 'J' template, you would have to set $B dynamically, like from
length(pack("J", 0)).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top