Handling Huge Data

Vishal G · Sep 30, 2008

Hi Guys,

I am trying to edit some bioinformatic package written in perl which
was written to handle DNA sequence of about 500,000 base long (a
string containg 500000 chrs)..

I have to enhance it to handle 100 million base long DNA...

Each base in DNA has this information, base (A, C, G or T), qual
(0-99), position (1-length)

there is one main DNA sequence and on average 500,000 parts (max 2000
chrs long with the same set of information)...

The program first creates an alignment like
<code>

*
Main - .....ACCCTTTGTCTAGTCGTATCGTCGATCGTCGCTAGCTCTGCT....
Part -
GTCGTATCGTCGAACGTCGCTAGCTC
Part - CTTTGTCTAGTCGTATCGTCGATCGTCGCT
Part
-
TCGAACGTCGCTAGC
</code>
Now, lets say I have to go thorugh each position and find how many
variations are present at certain position (with their original
position and quality).

Look at * position, there is T-A variation

Right now they are using hash to caputure this

%A, %C, %G, %T

Loop For Main DNA {
$T{$pos} = $qual;
# this tells me that there is T base at certain position with some
qual

}

Update the qual by adding the qual of parts

Loop For Parts {
$A{$pos} += $qual # for A parts

$T{$pos} += $qual $ for T parts
}

But because the dataset is huge, it consumes lot of memory...

so basically I am trying to figure out a way to store this information
without using much memory

If you dont understand the above problem, dont worry....

just tell me how to handle huge data which need to accessed frequently
using least possible memory..

Thanks in advance

xhoster · Sep 30, 2008

Vishal G said:
Hi Guys,

I am trying to edit some bioinformatic package written in perl which
was written to handle DNA sequence of about 500,000 base long (a
string containg 500000 chrs)..

I have to enhance it to handle 100 million base long DNA...

Each base in DNA has this information, base (A, C, G or T), qual
(0-99), position (1-length)

there is one main DNA sequence and on average 500,000 parts (max 2000
chrs long with the same set of information)...

How is this data stored? Is it all in memory at once?

The program first creates an alignment like
<code>

*
Main - .....ACCCTTTGTCTAGTCGTATCGTCGATCGTCGCTAGCTCTGCT....
Part -
GTCGTATCGTCGAACGTCGCTAGCTC
Part - CTTTGTCTAGTCGTATCGTCGATCGTCGCT
Part
-
TCGAACGTCGCTAGC
</code>

It looks like your alignment was line-wrapped into oblivion. Anyway,
how was the alignment on such a large dataset done? Couldn't your quality
summarization thing be best implement by pushing it into the aligner code?

Now, lets say I have to go thorugh each position and find how many
variations are present at certain position (with their original
position and quality).

Look at * position, there is T-A variation

Right now they are using hash to caputure this

%A, %C, %G, %T

Loop For Main DNA {
$T{$pos} = $qual;
# this tells me that there is T base at certain position with some
qual

Since $pos is an integer and seems to be dense (every or almost every
position from 0 up to the length-1 will be occupied), then you should
consider using an array rather than a hash. That might save some memory.
On the other hand, it might take more memory if most positions are
unanimous, meaning that 3 of the 4 base-hashes would not have a value for
any given position.

Also, where is $qual coming from? Obviously it isn't a constant over the
life of the loop, like you have it shown. Doesn't it have to draw from
something in RAM to obtain its value?

}

Update the qual by adding the qual of parts

Loop For Parts {
$A{$pos} += $qual # for A parts

$T{$pos} += $qual $ for T parts
}

Is there another loop over $pos? If so, is it inside the Loop for parts
or outside of it? Again, where does $qual come from?

But because the dataset is huge, it consumes lot of memory...

so basically I am trying to figure out a way to store this information
without using much memory

You could "pack" the numbers into strings and manipulate them with
"substr". I think there are even some Tie modules that do this for you, but
the speed decrease might be substantial.

What I would probably do is use Inline::C and have the data be accumulated
in a C float or double array, rather than a perl structure.

Or maybe you can address one $pos at a time, and output the results of that
$pos to disk before moving on to the next one, rather than accumulating
into memory.

If you dont understand the above problem, dont worry....

just tell me how to handle huge data which need to accessed frequently
using least possible memory..

Don't worry about what disease I actually have doc, just give me the cure.
I'm afraid that isn't likely to work well. The details of the solution
are likely to depend on the details of the problem.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Eric Pozharski · Oct 1, 2008

Vishal G said:
If you dont understand the above problem, dont worry....

You first...

just tell me how to handle huge data which need to accessed frequently
using least possible memory..

Free your mind of slurping (quite impossible if you came from world
where cycles are cheap, memory is cheap, disks are cheap etc.). Then
use C<use DBI> (I prefer B<DBD::SQLite>, it's fscking fast).

p.s. And a piece of advice. If you're not going to show your code that
"clearly exhibits your problem" -- don't wait for help here.

Vishal G · Oct 2, 2008

Hello Guys,

Thanks for your advice and sorry for being so vague...

In simple words if I have this code...

my $unitlength = 3;
my $dnaLength = 100000000;

my $A = sprintf("%3d", 0) x $dnaLength;
my $C = sprintf("%3d", 0) x $dnaLength;
my $G = sprintf("%3d", 0) x $dnaLength;
my $T = sprintf("%3d", 0) x $dnaLength;
my $I = sprintf("%3d", 0) x $dnaLength;

# Assign quality information of DNA
print "DNA Processing";
my ($num, $qual);
for (my $i = 0; $i < $dnaLength; $i++) {
$num = int(rand(5)) + 1;
$qual = int(rand(99)) + 1;

if ($num == 1) {
# Base A at position $i with base quality $qual
substr($A, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 2) {
substr($C, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 3) {
substr($G, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 4) {
substr($T, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 5) {
substr($I, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} else {
}
}

print "Member Processing\n";
my ($start, $stop);
for (my $j = 0; $j < 50000; $j++) {
# Start and Stop of memeber with respect to DNA
$start = int(rand($dnaLength - 2000)) + 1; # Member start with
respect to DNA
$stop = $dnaLength; # Finish at end

for (my $i = $start; $i <= $stop; $i++) {
$num = int(rand(5)) + 1;
$qual = int(rand(99)) + 1;
if ($num == 1) {
$qual = $qual + int( substr($A, $i * $unitlength,
$unitlength) );
substr($A, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 2) {
$qual = $qual + int( substr($C, $i * $unitlength,
$unitlength) );
substr($C, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 3) {
$qual = $qual + int( substr($G, $i * $unitlength,
$unitlength) );
substr($G, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 4) {
$qual = $qual + int( substr($T, $i * $unitlength,
$unitlength) );
substr($T, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 5) {
$qual = $qual + int( substr($I, $i * $unitlength,
$unitlength) );
substr($I, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} else {
}
}
}

I ran this code and it consumes around 3.0 GB of memory...

I have also ran this same code using Hash (%A, %C,....) (8.0+ GB) and
with Array (5.0+ GB)

Is there any other way to store the information using less memory.

Thanks

John W. Krahn · Oct 2, 2008

Vishal said:
Hello Guys,

Thanks for your advice and sorry for being so vague...

In simple words if I have this code...

my $unitlength = 3;
my $dnaLength = 100000000;

my $A = sprintf("%3d", 0) x $dnaLength;
my $C = sprintf("%3d", 0) x $dnaLength;
my $G = sprintf("%3d", 0) x $dnaLength;
my $T = sprintf("%3d", 0) x $dnaLength;
my $I = sprintf("%3d", 0) x $dnaLength;

Why not just:

my $A = '000' x $dnaLength;
my $C = '000' x $dnaLength;
my $G = '000' x $dnaLength;
my $T = '000' x $dnaLength;
my $I = '000' x $dnaLength;

Or even:

my $A = my $C = my $G = my $T = my $I = '000' x $dnaLength;

# Assign quality information of DNA
print "DNA Processing";
my ($num, $qual);
for (my $i = 0; $i < $dnaLength; $i++) {
$num = int(rand(5)) + 1;
$qual = int(rand(99)) + 1;

if ($num == 1) {
# Base A at position $i with base quality $qual
substr($A, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 2) {
substr($C, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 3) {
substr($G, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 4) {
substr($T, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 5) {
substr($I, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} else {
}
}

If you wanted, you *could* write that loop as:

for my $i ( 0 .. $dnaLength - 1 ) {
substr ${\( $A, $C, $G, $T, $I )[ rand 5 ]}, $i * $unitlength,
$unitlength, sprintf '%*d', $unitlength, 1 + int rand 99;
}

print "Member Processing\n";
my ($start, $stop);
for (my $j = 0; $j < 50000; $j++) {
# Start and Stop of memeber with respect to DNA
$start = int(rand($dnaLength - 2000)) + 1; # Member start with
respect to DNA
$stop = $dnaLength; # Finish at end

for (my $i = $start; $i <= $stop; $i++) {
$num = int(rand(5)) + 1;
$qual = int(rand(99)) + 1;
if ($num == 1) {
$qual = $qual + int( substr($A, $i * $unitlength,
$unitlength) );
substr($A, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 2) {
$qual = $qual + int( substr($C, $i * $unitlength,
$unitlength) );
substr($C, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 3) {
$qual = $qual + int( substr($G, $i * $unitlength,
$unitlength) );
substr($G, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 4) {
$qual = $qual + int( substr($T, $i * $unitlength,
$unitlength) );
substr($T, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} elsif ($num == 5) {
$qual = $qual + int( substr($I, $i * $unitlength,
$unitlength) );
substr($I, $i * $unitlength, $unitlength, sprintf("%$
{unitlength}d", $qual));
} else {
}
}
}

I ran this code and it consumes around 3.0 GB of memory...

You are running out of memory because when you add the numbers together
they are sometimes longer than $unitlength which causes the strings to
expand.

$ perl -le'printf "%3d\n", 900 + 800'
1700

I have also ran this same code using Hash (%A, %C,....) (8.0+ GB) and
with Array (5.0+ GB)

Is there any other way to store the information using less memory.

If you want to keep the substrings at only $unitlength you could use
either modulus:

$ perl -le'printf "%3d\n", ( 900 + 800 ) % 1000'
700

Or a truncating sprintf format:

$ perl -le'printf "%3.3s\n", 900 + 800'
170

John

xhoster · Oct 2, 2008

John W. Krahn said:
Why not just:

my $A = '000' x $dnaLength;
my $C = '000' x $dnaLength;
my $G = '000' x $dnaLength;
my $T = '000' x $dnaLength;
my $I = '000' x $dnaLength;

Or even:

my $A = my $C = my $G = my $T = my $I = '000' x $dnaLength;

Or better yet:

my %h;
$h{$_}='000' x $dnaLength foreach qw/A C G T I/;

Or, because $num is numbers:

$h{$_}='000' x $dnaLength foreach 1..5;

This cuts the memory use almost in half, as each of the lexicals instances
of '000' x $dnaLength takes up memory and doesn't seem to release it.

replace the ugly switch statement with:

substr($h{$num}, $i * $unitlength, #....

Shouldn't it finish at its own end, $start+2000-1, not at the main sequence
end?

This too could be replaced by $h{$num} in the substr and getting rid of
the big if blocks.

....

You are running out of memory because when you add the numbers together
they are sometimes longer than $unitlength which causes the strings to
expand.

$ perl -le'printf "%3d\n", 900 + 800'
1700

This is truly a problem, but it is a correctness problem. In my hands
it leads to almost no size inflation. The way he stores data, the minimum
possible size would be 1.5e9 bytes, (5*3*1e8) and the way the x operator
works inflates that to 3e9 bytes if you have 5 literal instances of it.

I've show how to cut it almost in half (but you will need to increase
$unitlength unless you want to get wrong answers or lose data, which will
cost you more space.)

But the real answer is not to store the entire set in RAM at all.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

J. Gleixner · Oct 2, 2008

Vishal said:
Hi Guys,

I am trying to edit some bioinformatic package written in perl which
was written to handle DNA sequence of about 500,000 base long (a
string containg 500000 chrs)..

[...]

If you haven't read it yet, this might be useful:

http://www.perl.com/pub/a/2003/09/10/bioinformatics.html

Ilya Zakharevich · Oct 3, 2008

[A complimentary Cc of this posting was sent to
John W. Krahn

my $A = my $C = my $G = my $T = my $I = '000' x $dnaLength;

For best results, use

my $I = '000';
$I x= $dnaLength;
my $A = my $C = my $G = my $T = $I;

(otherwise '000' x $dnaLength is computed at compile time, and remains
in the compiled tree).

And do not have anything "large" as a last statement of a subroutine -
unless you want it to be duplicated to create a return value of the
subroutine.

Hope this helps,
Ilya

Huge Data Handling	1	Sep 30, 2008
Windows LLDP Driver Responds With No Data	0	Mar 17, 2023
Exception Handling	33	Mar 11, 2012
Perl storing huge data(300MB) in a scalar	21	Dec 5, 2006
Efficient (HUGE) prime modulus	9	Nov 19, 2007
Need for speed -> a C extension?	27	Apr 18, 2011
Templated class not modified by parent member function?	2	Mar 20, 2011
skipping bytes	10	Jul 18, 2011

Handling Huge Data

Vishal G

xhoster

Eric Pozharski

Vishal G

John W. Krahn

xhoster

J. Gleixner

Ilya Zakharevich

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads