Handling Huge Data

V

Vishal G

Hi Guys,

I am trying to edit some bioinformatic package written in perl which
was written to handle DNA sequence of about 500,000 base long (a
string containg 500000 chrs)..

I have to enhance it to handle 100 million base long DNA...

Each base in DNA has this information, base (A, C, G or T), qual
(0-99), position (1-length)

there is one main DNA sequence and on average 500,000 parts (max 2000
chrs long with the same set of information)...

The program first creates an alignment like
<code>

*
Main - .....ACCCTTTGTCTAGTCGTATCGTCGATCGTCGCTAGCTCTGCT....
Part -
GTCGTATCGTCGAACGTCGCTAGCTC
Part - CTTTGTCTAGTCGTATCGTCGATCGTCGCT
Part
-
TCGAACGTCGCTAGC
</code>
Now, lets say I have to go thorugh each position and find how many
variations are present at certain position (with their original
position and quality).

Look at * position, there is T-A variation

Right now they are using hash to caputure this

%A, %C, %G, %T

Loop For Main DNA {
$T{$pos} = $qual;
# this tells me that there is T base at certain position with some
qual

}

Update the qual by adding the qual of parts

Loop For Parts {
$A{$pos} += $qual # for A parts

$T{$pos} += $qual $ for T parts
}

But because the dataset is huge, it consumes lot of memory...

so basically I am trying to figure out a way to store this information
without using much memory

If you dont understand the above problem, dont worry....

just tell me how to handle huge data which need to accessed frequently
using least possible memory..

Thanks in advance
 
X

xhoster

Vishal G said:
Hi Guys,

I am trying to edit some bioinformatic package written in perl which
was written to handle DNA sequence of about 500,000 base long (a
string containg 500000 chrs)..

I have to enhance it to handle 100 million base long DNA...

Each base in DNA has this information, base (A, C, G or T), qual
(0-99), position (1-length)

there is one main DNA sequence and on average 500,000 parts (max 2000
chrs long with the same set of information)...

How is this data stored? Is it all in memory at once?
The program first creates an alignment like
<code>

*
Main - .....ACCCTTTGTCTAGTCGTATCGTCGATCGTCGCTAGCTCTGCT....
Part -
GTCGTATCGTCGAACGTCGCTAGCTC
Part - CTTTGTCTAGTCGTATCGTCGATCGTCGCT
Part
-
TCGAACGTCGCTAGC
</code>

It looks like your alignment was line-wrapped into oblivion. Anyway,
how was the alignment on such a large dataset done? Couldn't your quality
summarization thing be best implement by pushing it into the aligner code?

Now, lets say I have to go thorugh each position and find how many
variations are present at certain position (with their original
position and quality).

Look at * position, there is T-A variation

Right now they are using hash to caputure this

%A, %C, %G, %T

Loop For Main DNA {
$T{$pos} = $qual;
# this tells me that there is T base at certain position with some
qual

Since $pos is an integer and seems to be dense (every or almost every
position from 0 up to the length-1 will be occupied), then you should
consider using an array rather than a hash. That might save some memory.
On the other hand, it might take more memory if most positions are
unanimous, meaning that 3 of the 4 base-hashes would not have a value for
any given position.

Also, where is $qual coming from? Obviously it isn't a constant over the
life of the loop, like you have it shown. Doesn't it have to draw from
something in RAM to obtain its value?
}

Update the qual by adding the qual of parts

Loop For Parts {
$A{$pos} += $qual # for A parts

$T{$pos} += $qual $ for T parts
}

Is there another loop over $pos? If so, is it inside the Loop for parts
or outside of it? Again, where does $qual come from?
But because the dataset is huge, it consumes lot of memory...

so basically I am trying to figure out a way to store this information
without using much memory

You could "pack" the numbers into strings and manipulate them with
"substr". I think there are even some Tie modules that do this for you, but
the speed decrease might be substantial.

What I would probably do is use Inline::C and have the data be accumulated
in a C float or double array, rather than a perl structure.

Or maybe you can address one $pos at a time, and output the results of that
$pos to disk before moving on to the next one, rather than accumulating
into memory.
If you dont understand the above problem, dont worry....

just tell me how to handle huge data which need to accessed frequently
using least possible memory..

Don't worry about what disease I actually have doc, just give me the cure.
I'm afraid that isn't likely to work well. The details of the solution
are likely to depend on the details of the problem.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
E

Eric Pozharski

Vishal G said:
If you dont understand the above problem, dont worry....

You first...
just tell me how to handle huge data which need to accessed frequently
using least possible memory..

Free your mind of slurping (quite impossible if you came from world
where cycles are cheap, memory is cheap, disks are cheap etc.). Then
use C<use DBI> (I prefer B<DBD::SQLite>, it's fscking fast).

p.s. And a piece of advice. If you're not going to show your code that
"clearly exhibits your problem" -- don't wait for help here.
 
V

Vishal G

Hello Guys,

Thanks for your advice and sorry for being so vague...

In simple words if I have this code...

my $unitlength = 3;
my $dnaLength = 100000000;

my $A = sprintf("%3d", 0) x $dnaLength;
my $C = sprintf("%3d", 0) x $dnaLength;
my $G = sprintf("%3d", 0) x $dnaLength;
my $T = sprintf("%3d", 0) x $dnaLength;
my $I = sprintf("%3d", 0) x $dnaLength;

# Assign quality information of DNA
print "DNA Processing";
my ($num, $qual);
for (my $i = 0; $i < $dnaLength; $i++) {
$num = int(rand(5)) + 1;
$qual = int(rand(99)) + 1;

if ($num == 1) {
# Base A at position $i with base quality $qual
substr($A, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 2) {
substr($C, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 3) {
substr($G, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 4) {
substr($T, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 5) {
substr($I, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} else {
}
}

print "Member Processing\n";
my ($start, $stop);
for (my $j = 0; $j < 50000; $j++) {
# Start and Stop of memeber with respect to DNA
$start = int(rand($dnaLength - 2000)) + 1; # Member start with
respect to DNA
$stop = $dnaLength; # Finish at end

for (my $i = $start; $i <= $stop; $i++) {
$num = int(rand(5)) + 1;
$qual = int(rand(99)) + 1;
if ($num == 1) {
$qual = $qual + int( substr($A, $i * $unitlength,
$unitlength) );
substr($A, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 2) {
$qual = $qual + int( substr($C, $i * $unitlength,
$unitlength) );
substr($C, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 3) {
$qual = $qual + int( substr($G, $i * $unitlength,
$unitlength) );
substr($G, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 4) {
$qual = $qual + int( substr($T, $i * $unitlength,
$unitlength) );
substr($T, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 5) {
$qual = $qual + int( substr($I, $i * $unitlength,
$unitlength) );
substr($I, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} else {
}
}
}

I ran this code and it consumes around 3.0 GB of memory...

I have also ran this same code using Hash (%A, %C,....) (8.0+ GB) and
with Array (5.0+ GB)

Is there any other way to store the information using less memory.

Thanks
 
J

John W. Krahn

Vishal said:
Hello Guys,

Thanks for your advice and sorry for being so vague...

In simple words if I have this code...

my $unitlength = 3;
my $dnaLength = 100000000;

my $A = sprintf("%3d", 0) x $dnaLength;
my $C = sprintf("%3d", 0) x $dnaLength;
my $G = sprintf("%3d", 0) x $dnaLength;
my $T = sprintf("%3d", 0) x $dnaLength;
my $I = sprintf("%3d", 0) x $dnaLength;

Why not just:

my $A = '000' x $dnaLength;
my $C = '000' x $dnaLength;
my $G = '000' x $dnaLength;
my $T = '000' x $dnaLength;
my $I = '000' x $dnaLength;

Or even:

my $A = my $C = my $G = my $T = my $I = '000' x $dnaLength;

# Assign quality information of DNA
print "DNA Processing";
my ($num, $qual);
for (my $i = 0; $i < $dnaLength; $i++) {
$num = int(rand(5)) + 1;
$qual = int(rand(99)) + 1;

if ($num == 1) {
# Base A at position $i with base quality $qual
substr($A, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 2) {
substr($C, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 3) {
substr($G, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 4) {
substr($T, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 5) {
substr($I, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} else {
}
}

If you wanted, you *could* write that loop as:

for my $i ( 0 .. $dnaLength - 1 ) {
substr ${\( $A, $C, $G, $T, $I )[ rand 5 ]}, $i * $unitlength,
$unitlength, sprintf '%*d', $unitlength, 1 + int rand 99;
}

print "Member Processing\n";
my ($start, $stop);
for (my $j = 0; $j < 50000; $j++) {
# Start and Stop of memeber with respect to DNA
$start = int(rand($dnaLength - 2000)) + 1; # Member start with
respect to DNA
$stop = $dnaLength; # Finish at end

for (my $i = $start; $i <= $stop; $i++) {
$num = int(rand(5)) + 1;
$qual = int(rand(99)) + 1;
if ($num == 1) {
$qual = $qual + int( substr($A, $i * $unitlength,
$unitlength) );
substr($A, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 2) {
$qual = $qual + int( substr($C, $i * $unitlength,
$unitlength) );
substr($C, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 3) {
$qual = $qual + int( substr($G, $i * $unitlength,
$unitlength) );
substr($G, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 4) {
$qual = $qual + int( substr($T, $i * $unitlength,
$unitlength) );
substr($T, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 5) {
$qual = $qual + int( substr($I, $i * $unitlength,
$unitlength) );
substr($I, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} else {
}
}
}

I ran this code and it consumes around 3.0 GB of memory...

You are running out of memory because when you add the numbers together
they are sometimes longer than $unitlength which causes the strings to
expand.

$ perl -le'printf "%3d\n", 900 + 800'
1700

I have also ran this same code using Hash (%A, %C,....) (8.0+ GB) and
with Array (5.0+ GB)

Is there any other way to store the information using less memory.

If you want to keep the substrings at only $unitlength you could use
either modulus:

$ perl -le'printf "%3d\n", ( 900 + 800 ) % 1000'
700

Or a truncating sprintf format:

$ perl -le'printf "%3.3s\n", 900 + 800'
170



John
 
X

xhoster

John W. Krahn said:
Why not just:

my $A = '000' x $dnaLength;
my $C = '000' x $dnaLength;
my $G = '000' x $dnaLength;
my $T = '000' x $dnaLength;
my $I = '000' x $dnaLength;

Or even:

my $A = my $C = my $G = my $T = my $I = '000' x $dnaLength;

Or better yet:

my %h;
$h{$_}='000' x $dnaLength foreach qw/A C G T I/;

Or, because $num is numbers:

$h{$_}='000' x $dnaLength foreach 1..5;


This cuts the memory use almost in half, as each of the lexicals instances
of '000' x $dnaLength takes up memory and doesn't seem to release it.


replace the ugly switch statement with:

substr($h{$num}, $i * $unitlength, #....


Shouldn't it finish at its own end, $start+2000-1, not at the main sequence
end?


This too could be replaced by $h{$num} in the substr and getting rid of
the big if blocks.

....
You are running out of memory because when you add the numbers together
they are sometimes longer than $unitlength which causes the strings to
expand.

$ perl -le'printf "%3d\n", 900 + 800'
1700

This is truly a problem, but it is a correctness problem. In my hands
it leads to almost no size inflation. The way he stores data, the minimum
possible size would be 1.5e9 bytes, (5*3*1e8) and the way the x operator
works inflates that to 3e9 bytes if you have 5 literal instances of it.

I've show how to cut it almost in half (but you will need to increase
$unitlength unless you want to get wrong answers or lose data, which will
cost you more space.)

But the real answer is not to store the entire set in RAM at all.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
John W. Krahn
my $A = my $C = my $G = my $T = my $I = '000' x $dnaLength;

For best results, use

my $I = '000';
$I x= $dnaLength;
my $A = my $C = my $G = my $T = $I;

(otherwise '000' x $dnaLength is computed at compile time, and remains
in the compiled tree).

And do not have anything "large" as a last statement of a subroutine -
unless you want it to be duplicated to create a return value of the
subroutine.

Hope this helps,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top