V
Vishal G
Hi Guys,
I am trying to edit some bioinformatic package written in perl which
was written to handle DNA sequence of about 500,000 base long (a
string containg 500000 chrs)..
I have to enhance it to handle 100 million base long DNA...
Each base in DNA has this information, base (A, C, G or T), qual
(0-99), position (1-length)
there is one main DNA sequence and on average 500,000 parts (max 2000
chrs long with the same set of information)...
The program first creates an alignment like
*
Main - .....ACCCTTTGTCTAGTCGTATCGTCGATCGTCGCTAGCTCTGCT....
Part -
GTCGTATCGTCGAACGTCGCTAGCTC
Part - CTTTGTCTAGTCGTATCGTCGATCGTCGCT
Part
-
TCGAACGTCGCTAGCTCTG
Now, lets say I have to go thorugh each position and find how many
variations are present at certain position (with their original
position and quality).
Look at * position, there is T-A variation
Right now they are using hash to caputure this
%A, %C, %G, %T
Loop For Main DNA {
$A{$pos} = $qual; # this tells
me that there is A base at certain position
with some qual for main
}
Update the qual by adding the qual of parts
Loop For Parts {
$A{$pos} += $qual # for A parts
$T{$pos} += $qual $ for T parts
}
But because the dataset is huge, it consumes lot of memory...
so basically I am trying to figure out a way to store this information
without using much memory
If you dont understand the above problem, dont worry....
just tell me how to handle huge data which need to accessed frequently
using least possible memory..
Thanks in advance
I am trying to edit some bioinformatic package written in perl which
was written to handle DNA sequence of about 500,000 base long (a
string containg 500000 chrs)..
I have to enhance it to handle 100 million base long DNA...
Each base in DNA has this information, base (A, C, G or T), qual
(0-99), position (1-length)
there is one main DNA sequence and on average 500,000 parts (max 2000
chrs long with the same set of information)...
The program first creates an alignment like
*
Main - .....ACCCTTTGTCTAGTCGTATCGTCGATCGTCGCTAGCTCTGCT....
Part -
GTCGTATCGTCGAACGTCGCTAGCTC
Part - CTTTGTCTAGTCGTATCGTCGATCGTCGCT
Part
-
TCGAACGTCGCTAGCTCTG
Now, lets say I have to go thorugh each position and find how many
variations are present at certain position (with their original
position and quality).
Look at * position, there is T-A variation
Right now they are using hash to caputure this
%A, %C, %G, %T
Loop For Main DNA {
$A{$pos} = $qual; # this tells
me that there is A base at certain position
with some qual for main
}
Update the qual by adding the qual of parts
Loop For Parts {
$A{$pos} += $qual # for A parts
$T{$pos} += $qual $ for T parts
}
But because the dataset is huge, it consumes lot of memory...
so basically I am trying to figure out a way to store this information
without using much memory
If you dont understand the above problem, dont worry....
just tell me how to handle huge data which need to accessed frequently
using least possible memory..
Thanks in advance