Handling Huge Data

Discussion in 'Perl Misc' started by Vishal G, Sep 30, 2008.

  1. Vishal G

    Vishal G Guest

    Hi Guys,

    I am trying to edit some bioinformatic package written in perl which
    was written to handle DNA sequence of about 500,000 base long (a
    string containg 500000 chrs)..

    I have to enhance it to handle 100 million base long DNA...

    Each base in DNA has this information, base (A, C, G or T), qual
    (0-99), position (1-length)

    there is one main DNA sequence and on average 500,000 parts (max 2000
    chrs long with the same set of information)...

    The program first creates an alignment like
    <code>

    *
    Main - .....ACCCTTTGTCTAGTCGTATCGTCGATCGTCGCTAGCTCTGCT....
    Part -
    GTCGTATCGTCGAACGTCGCTAGCTC
    Part - CTTTGTCTAGTCGTATCGTCGATCGTCGCT
    Part
    -
    TCGAACGTCGCTAGC
    </code>
    Now, lets say I have to go thorugh each position and find how many
    variations are present at certain position (with their original
    position and quality).

    Look at * position, there is T-A variation

    Right now they are using hash to caputure this

    %A, %C, %G, %T

    Loop For Main DNA {
    $T{$pos} = $qual;
    # this tells me that there is T base at certain position with some
    qual

    }

    Update the qual by adding the qual of parts

    Loop For Parts {
    $A{$pos} += $qual # for A parts

    $T{$pos} += $qual $ for T parts
    }

    But because the dataset is huge, it consumes lot of memory...

    so basically I am trying to figure out a way to store this information
    without using much memory

    If you dont understand the above problem, dont worry....

    just tell me how to handle huge data which need to accessed frequently
    using least possible memory..

    Thanks in advance
    Vishal G, Sep 30, 2008
    #1
    1. Advertising

  2. Vishal G

    Guest

    Vishal G <> wrote:
    > Hi Guys,
    >
    > I am trying to edit some bioinformatic package written in perl which
    > was written to handle DNA sequence of about 500,000 base long (a
    > string containg 500000 chrs)..
    >
    > I have to enhance it to handle 100 million base long DNA...
    >
    > Each base in DNA has this information, base (A, C, G or T), qual
    > (0-99), position (1-length)
    >
    > there is one main DNA sequence and on average 500,000 parts (max 2000
    > chrs long with the same set of information)...


    How is this data stored? Is it all in memory at once?

    >
    > The program first creates an alignment like
    > <code>
    >
    > *
    > Main - .....ACCCTTTGTCTAGTCGTATCGTCGATCGTCGCTAGCTCTGCT....
    > Part -
    > GTCGTATCGTCGAACGTCGCTAGCTC
    > Part - CTTTGTCTAGTCGTATCGTCGATCGTCGCT
    > Part
    > -
    > TCGAACGTCGCTAGC
    > </code>


    It looks like your alignment was line-wrapped into oblivion. Anyway,
    how was the alignment on such a large dataset done? Couldn't your quality
    summarization thing be best implement by pushing it into the aligner code?


    > Now, lets say I have to go thorugh each position and find how many
    > variations are present at certain position (with their original
    > position and quality).
    >
    > Look at * position, there is T-A variation
    >
    > Right now they are using hash to caputure this
    >
    > %A, %C, %G, %T
    >
    > Loop For Main DNA {
    > $T{$pos} = $qual;
    > # this tells me that there is T base at certain position with some
    > qual


    Since $pos is an integer and seems to be dense (every or almost every
    position from 0 up to the length-1 will be occupied), then you should
    consider using an array rather than a hash. That might save some memory.
    On the other hand, it might take more memory if most positions are
    unanimous, meaning that 3 of the 4 base-hashes would not have a value for
    any given position.

    Also, where is $qual coming from? Obviously it isn't a constant over the
    life of the loop, like you have it shown. Doesn't it have to draw from
    something in RAM to obtain its value?

    >
    > }
    >
    > Update the qual by adding the qual of parts
    >
    > Loop For Parts {
    > $A{$pos} += $qual # for A parts
    >
    > $T{$pos} += $qual $ for T parts
    > }


    Is there another loop over $pos? If so, is it inside the Loop for parts
    or outside of it? Again, where does $qual come from?

    >
    > But because the dataset is huge, it consumes lot of memory...
    >
    > so basically I am trying to figure out a way to store this information
    > without using much memory


    You could "pack" the numbers into strings and manipulate them with
    "substr". I think there are even some Tie modules that do this for you, but
    the speed decrease might be substantial.

    What I would probably do is use Inline::C and have the data be accumulated
    in a C float or double array, rather than a perl structure.

    Or maybe you can address one $pos at a time, and output the results of that
    $pos to disk before moving on to the next one, rather than accumulating
    into memory.

    >
    > If you dont understand the above problem, dont worry....
    >
    > just tell me how to handle huge data which need to accessed frequently
    > using least possible memory..


    Don't worry about what disease I actually have doc, just give me the cure.
    I'm afraid that isn't likely to work well. The details of the solution
    are likely to depend on the details of the problem.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
    , Sep 30, 2008
    #2
    1. Advertising

  3. Vishal G <> wrote:
    *SKIP*
    > If you dont understand the above problem, dont worry....


    You first...

    > just tell me how to handle huge data which need to accessed frequently
    > using least possible memory..


    Free your mind of slurping (quite impossible if you came from world
    where cycles are cheap, memory is cheap, disks are cheap etc.). Then
    use C<use DBI> (I prefer B<DBD::SQLite>, it's fscking fast).

    p.s. And a piece of advice. If you're not going to show your code that
    "clearly exhibits your problem" -- don't wait for help here.

    --
    Torvalds' goal for Linux is very simple: World Domination
    Eric Pozharski, Oct 1, 2008
    #3
  4. Vishal G

    Vishal G Guest

    Hello Guys,

    Thanks for your advice and sorry for being so vague...

    In simple words if I have this code...

    my $unitlength = 3;
    my $dnaLength = 100000000;

    my $A = sprintf("%3d", 0) x $dnaLength;
    my $C = sprintf("%3d", 0) x $dnaLength;
    my $G = sprintf("%3d", 0) x $dnaLength;
    my $T = sprintf("%3d", 0) x $dnaLength;
    my $I = sprintf("%3d", 0) x $dnaLength;

    # Assign quality information of DNA
    print "DNA Processing";
    my ($num, $qual);
    for (my $i = 0; $i < $dnaLength; $i++) {
    $num = int(rand(5)) + 1;
    $qual = int(rand(99)) + 1;

    if ($num == 1) {
    # Base A at position $i with base quality $qual
    substr($A, $i * $unitlength, $unitlength, sprintf("%$
    {unitlength}d", $qual));
    } elsif ($num == 2) {
    substr($C, $i * $unitlength, $unitlength, sprintf("%$
    {unitlength}d", $qual));
    } elsif ($num == 3) {
    substr($G, $i * $unitlength, $unitlength, sprintf("%$
    {unitlength}d", $qual));
    } elsif ($num == 4) {
    substr($T, $i * $unitlength, $unitlength, sprintf("%$
    {unitlength}d", $qual));
    } elsif ($num == 5) {
    substr($I, $i * $unitlength, $unitlength, sprintf("%$
    {unitlength}d", $qual));
    } else {
    }
    }

    print "Member Processing\n";
    my ($start, $stop);
    for (my $j = 0; $j < 50000; $j++) {
    # Start and Stop of memeber with respect to DNA
    $start = int(rand($dnaLength - 2000)) + 1; # Member start with
    respect to DNA
    $stop = $dnaLength; # Finish at end

    for (my $i = $start; $i <= $stop; $i++) {
    $num = int(rand(5)) + 1;
    $qual = int(rand(99)) + 1;
    if ($num == 1) {
    $qual = $qual + int( substr($A, $i * $unitlength,
    $unitlength) );
    substr($A, $i * $unitlength, $unitlength, sprintf("%$
    {unitlength}d", $qual));
    } elsif ($num == 2) {
    $qual = $qual + int( substr($C, $i * $unitlength,
    $unitlength) );
    substr($C, $i * $unitlength, $unitlength, sprintf("%$
    {unitlength}d", $qual));
    } elsif ($num == 3) {
    $qual = $qual + int( substr($G, $i * $unitlength,
    $unitlength) );
    substr($G, $i * $unitlength, $unitlength, sprintf("%$
    {unitlength}d", $qual));
    } elsif ($num == 4) {
    $qual = $qual + int( substr($T, $i * $unitlength,
    $unitlength) );
    substr($T, $i * $unitlength, $unitlength, sprintf("%$
    {unitlength}d", $qual));
    } elsif ($num == 5) {
    $qual = $qual + int( substr($I, $i * $unitlength,
    $unitlength) );
    substr($I, $i * $unitlength, $unitlength, sprintf("%$
    {unitlength}d", $qual));
    } else {
    }
    }
    }

    I ran this code and it consumes around 3.0 GB of memory...

    I have also ran this same code using Hash (%A, %C,....) (8.0+ GB) and
    with Array (5.0+ GB)

    Is there any other way to store the information using less memory.

    Thanks
    Vishal G, Oct 2, 2008
    #4
  5. Vishal G wrote:
    > Hello Guys,
    >
    > Thanks for your advice and sorry for being so vague...
    >
    > In simple words if I have this code...
    >
    > my $unitlength = 3;
    > my $dnaLength = 100000000;
    >
    > my $A = sprintf("%3d", 0) x $dnaLength;
    > my $C = sprintf("%3d", 0) x $dnaLength;
    > my $G = sprintf("%3d", 0) x $dnaLength;
    > my $T = sprintf("%3d", 0) x $dnaLength;
    > my $I = sprintf("%3d", 0) x $dnaLength;


    Why not just:

    my $A = '000' x $dnaLength;
    my $C = '000' x $dnaLength;
    my $G = '000' x $dnaLength;
    my $T = '000' x $dnaLength;
    my $I = '000' x $dnaLength;

    Or even:

    my $A = my $C = my $G = my $T = my $I = '000' x $dnaLength;


    > # Assign quality information of DNA
    > print "DNA Processing";
    > my ($num, $qual);
    > for (my $i = 0; $i < $dnaLength; $i++) {
    > $num = int(rand(5)) + 1;
    > $qual = int(rand(99)) + 1;
    >
    > if ($num == 1) {
    > # Base A at position $i with base quality $qual
    > substr($A, $i * $unitlength, $unitlength, sprintf("%$
    > {unitlength}d", $qual));
    > } elsif ($num == 2) {
    > substr($C, $i * $unitlength, $unitlength, sprintf("%$
    > {unitlength}d", $qual));
    > } elsif ($num == 3) {
    > substr($G, $i * $unitlength, $unitlength, sprintf("%$
    > {unitlength}d", $qual));
    > } elsif ($num == 4) {
    > substr($T, $i * $unitlength, $unitlength, sprintf("%$
    > {unitlength}d", $qual));
    > } elsif ($num == 5) {
    > substr($I, $i * $unitlength, $unitlength, sprintf("%$
    > {unitlength}d", $qual));
    > } else {
    > }
    > }


    If you wanted, you *could* write that loop as:

    for my $i ( 0 .. $dnaLength - 1 ) {
    substr ${\( $A, $C, $G, $T, $I )[ rand 5 ]}, $i * $unitlength,
    $unitlength, sprintf '%*d', $unitlength, 1 + int rand 99;
    }


    > print "Member Processing\n";
    > my ($start, $stop);
    > for (my $j = 0; $j < 50000; $j++) {
    > # Start and Stop of memeber with respect to DNA
    > $start = int(rand($dnaLength - 2000)) + 1; # Member start with
    > respect to DNA
    > $stop = $dnaLength; # Finish at end
    >
    > for (my $i = $start; $i <= $stop; $i++) {
    > $num = int(rand(5)) + 1;
    > $qual = int(rand(99)) + 1;
    > if ($num == 1) {
    > $qual = $qual + int( substr($A, $i * $unitlength,
    > $unitlength) );
    > substr($A, $i * $unitlength, $unitlength, sprintf("%$
    > {unitlength}d", $qual));
    > } elsif ($num == 2) {
    > $qual = $qual + int( substr($C, $i * $unitlength,
    > $unitlength) );
    > substr($C, $i * $unitlength, $unitlength, sprintf("%$
    > {unitlength}d", $qual));
    > } elsif ($num == 3) {
    > $qual = $qual + int( substr($G, $i * $unitlength,
    > $unitlength) );
    > substr($G, $i * $unitlength, $unitlength, sprintf("%$
    > {unitlength}d", $qual));
    > } elsif ($num == 4) {
    > $qual = $qual + int( substr($T, $i * $unitlength,
    > $unitlength) );
    > substr($T, $i * $unitlength, $unitlength, sprintf("%$
    > {unitlength}d", $qual));
    > } elsif ($num == 5) {
    > $qual = $qual + int( substr($I, $i * $unitlength,
    > $unitlength) );
    > substr($I, $i * $unitlength, $unitlength, sprintf("%$
    > {unitlength}d", $qual));
    > } else {
    > }
    > }
    > }
    >
    > I ran this code and it consumes around 3.0 GB of memory...


    You are running out of memory because when you add the numbers together
    they are sometimes longer than $unitlength which causes the strings to
    expand.

    $ perl -le'printf "%3d\n", 900 + 800'
    1700


    > I have also ran this same code using Hash (%A, %C,....) (8.0+ GB) and
    > with Array (5.0+ GB)
    >
    > Is there any other way to store the information using less memory.


    If you want to keep the substrings at only $unitlength you could use
    either modulus:

    $ perl -le'printf "%3d\n", ( 900 + 800 ) % 1000'
    700

    Or a truncating sprintf format:

    $ perl -le'printf "%3.3s\n", 900 + 800'
    170



    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
    John W. Krahn, Oct 2, 2008
    #5
  6. Vishal G

    Guest

    "John W. Krahn" <> wrote:
    > Vishal G wrote:
    > > Hello Guys,
    > >
    > > Thanks for your advice and sorry for being so vague...
    > >
    > > In simple words if I have this code...
    > >
    > > my $unitlength = 3;
    > > my $dnaLength = 100000000;
    > >
    > > my $A = sprintf("%3d", 0) x $dnaLength;
    > > my $C = sprintf("%3d", 0) x $dnaLength;
    > > my $G = sprintf("%3d", 0) x $dnaLength;
    > > my $T = sprintf("%3d", 0) x $dnaLength;
    > > my $I = sprintf("%3d", 0) x $dnaLength;

    >
    > Why not just:
    >
    > my $A = '000' x $dnaLength;
    > my $C = '000' x $dnaLength;
    > my $G = '000' x $dnaLength;
    > my $T = '000' x $dnaLength;
    > my $I = '000' x $dnaLength;
    >
    > Or even:
    >
    > my $A = my $C = my $G = my $T = my $I = '000' x $dnaLength;


    Or better yet:

    my %h;
    $h{$_}='000' x $dnaLength foreach qw/A C G T I/;

    Or, because $num is numbers:

    $h{$_}='000' x $dnaLength foreach 1..5;


    This cuts the memory use almost in half, as each of the lexicals instances
    of '000' x $dnaLength takes up memory and doesn't seem to release it.


    > > # Assign quality information of DNA
    > > print "DNA Processing";
    > > my ($num, $qual);
    > > for (my $i = 0; $i < $dnaLength; $i++) {
    > > $num = int(rand(5)) + 1;
    > > $qual = int(rand(99)) + 1;
    > >
    > > if ($num == 1) {
    > > # Base A at position $i with base quality $qual
    > > substr($A, $i * $unitlength, $unitlength, sprintf("%$
    > > {unitlength}d", $qual));


    replace the ugly switch statement with:

    substr($h{$num}, $i * $unitlength, #....


    > > print "Member Processing\n";
    > > my ($start, $stop);
    > > for (my $j = 0; $j < 50000; $j++) {
    > > # Start and Stop of memeber with respect to DNA
    > > $start = int(rand($dnaLength - 2000)) + 1; # Member start with
    > > respect to DNA
    > > $stop = $dnaLength; # Finish at end


    Shouldn't it finish at its own end, $start+2000-1, not at the main sequence
    end?


    > > if ($num == 1) {
    > > $qual = $qual + int( substr($A, $i * $unitlength,
    > > $unitlength) );


    This too could be replaced by $h{$num} in the substr and getting rid of
    the big if blocks.

    ....
    > >
    > > I ran this code and it consumes around 3.0 GB of memory...

    >
    > You are running out of memory because when you add the numbers together
    > they are sometimes longer than $unitlength which causes the strings to
    > expand.
    >
    > $ perl -le'printf "%3d\n", 900 + 800'
    > 1700


    This is truly a problem, but it is a correctness problem. In my hands
    it leads to almost no size inflation. The way he stores data, the minimum
    possible size would be 1.5e9 bytes, (5*3*1e8) and the way the x operator
    works inflates that to 3e9 bytes if you have 5 literal instances of it.

    > >
    > > Is there any other way to store the information using less memory.


    I've show how to cut it almost in half (but you will need to increase
    $unitlength unless you want to get wrong answers or lose data, which will
    cost you more space.)

    But the real answer is not to store the entire set in RAM at all.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
    , Oct 2, 2008
    #6
  7. Vishal G

    J. Gleixner Guest

    Vishal G wrote:
    > Hi Guys,
    >
    > I am trying to edit some bioinformatic package written in perl which
    > was written to handle DNA sequence of about 500,000 base long (a
    > string containg 500000 chrs)..

    [...]

    If you haven't read it yet, this might be useful:

    http://www.perl.com/pub/a/2003/09/10/bioinformatics.html
    J. Gleixner, Oct 2, 2008
    #7
  8. [A complimentary Cc of this posting was sent to
    John W. Krahn
    <>], who wrote in article <_pZEk.3778$>:

    > my $A = my $C = my $G = my $T = my $I = '000' x $dnaLength;


    For best results, use

    my $I = '000';
    $I x= $dnaLength;
    my $A = my $C = my $G = my $T = $I;

    (otherwise '000' x $dnaLength is computed at compile time, and remains
    in the compiled tree).

    And do not have anything "large" as a last statement of a subroutine -
    unless you want it to be duplicated to create a return value of the
    subroutine.

    Hope this helps,
    Ilya
    Ilya Zakharevich, Oct 3, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    490
  2. Simon
    Replies:
    13
    Views:
    634
    Eric Sosman
    Mar 25, 2011
  3. Simon Ng
    Replies:
    5
    Views:
    253
  4. Simon Ng
    Replies:
    5
    Views:
    215
  5. Vishal G

    Huge Data Handling

    Vishal G, Sep 30, 2008, in forum: Perl Misc
    Replies:
    1
    Views:
    93
    John W. Krahn
    Sep 30, 2008
Loading...

Share This Page