Out of memory in vec

M

Mumia W.

I have 512mb RAM and 800mb swap, Perl 5.8.4 (and 5.9.4) under i386 Linux.

I decided to try out Jie Huang's idea of transposing a large array of
"bioinfomatics" data. Since I don't know what bioinfomatics data is, I
just assumed that they were sequences of the amino acids A, C, G, and T.

And since Jie never gave a sample of his/her data, I created a program
to do so:

#!/usr/bin/perl
use strict;
use warnings;

my @bases = qw/A C T G/;
my $rows = shift() || 5;
my $cols = shift() || 10;

for my $n (1 .. ($rows * $cols)) {
my $base = $bases[rand @bases];
print $base;
print "\n" if (0 == ($n % $cols));
}
print "\n";

__END__

I then went to writing the transposition program. A gigabyte (1e6 rows ×
1000 columns) is a lot of memory, so I wanted to avoid forcing each item
to take an entire byte, and since A, C, T and G are only four distinct
values, an entire byte is not needed to encode them, so I opted to use
four bits for each item and the vec() function:

#!/usr/bin/perl
use strict;
use warnings;
use Fatal qw/open close/;
use constant WIDTH => 4;

my $infile = shift() || 'data/bases-small';
my $outfile = 'out';
my %bases = (A => 0, C => 1, G => 2, T => 3);
my %rbases = reverse %bases;

my $buffer = '';
open my $ifh, '<', $infile;
open my $ofh, '>', $outfile;

my $offset = 0;
my $maxset = 0;
my $reclen;
my $maxrow = -1;

while (<$ifh>) {
$reclen = length($_)-1 unless defined $reclen;
while (/([ACTG])/g) {
vec($buffer, $offset++, WIDTH) = $bases{$1};
}
$maxrow++;
}

$maxset = $offset;
system(ps => 'up', $$);

for my $col (0 .. $reclen-1) {
for my $row (0 .. $maxrow-1) {
my $base = vec($buffer, $col + ($row * $reclen), WIDTH);
print $ofh $rbases{$base};
}
print $ofh "\n";
}

print $ofh "\n";


close $ofh;
close $ifh;

__END__

This program works with a small file of 10MB, but it falls over with
"Out of memory!" for the one gigabyte file. I shouldn't be all too
surprised if vec() a gigabyte of data, but is there some documented
limit on vec?

I absolutely never had a chance to run out of swap while running this
program. It ran for about 20 minutes, consumed over 230MB of total
memory, then aborted.
 
M

Mumia W.

[...] I shouldn't be all too
surprised if vec() [can't handle] a gigabyte of data, but is there some documented
limit on vec?

That's what I meant to say.
 
A

anno4000

[...]
I then went to writing the transposition program. A gigabyte (1e6 rows ×
1000 columns) is a lot of memory, so I wanted to avoid forcing each item
to take an entire byte, and since A, C, T and G are only four distinct
values, an entire byte is not needed to encode them, so I opted to use
four bits for each item and the vec() function:

Why four? You only need two bits to encode four bases.

[...]
This program works with a small file of 10MB, but it falls over with
"Out of memory!" for the one gigabyte file. I shouldn't be all too
surprised if vec() a gigabyte of data, but is there some documented
limit on vec?

No. Like any Perl string a vec can be as big as your memory allows.

Anno
 
T

Ted Zlatanov

On 10 Aug 2007 07:35:48 GMT (e-mail address removed)-berlin.de wrote:


a> Why four? You only need two bits to encode four bases.

"--" was also allowed as data besides a pair of letters, so you have

[ACGT][ACGT] = 4 bits (which is what Mumia means by "item", I think)
plus "--" as a value = 5 bits

Ted
 
M

Mumia W.

On 10 Aug 2007 07:35:48 GMT (e-mail address removed)-berlin.de wrote:


a> Why four? You only need two bits to encode four bases.

"--" was also allowed as data besides a pair of letters, so you have

[ACGT][ACGT] = 4 bits (which is what Mumia means by "item", I think)
plus "--" as a value = 5 bits

Ted

I started programming before Jie submitted any sample data, so I thought
of [ACGT] as an item. I would've liked to use three bits, but vec() does
not seem to allow that.

This program
----------------------
my $buffer = '';
vec($buffer, 0, 3) = 7;
----------------------
produces "Illegal number of bits in vec at ...", so I'm stuck with using
four bits.

Now that Jie has give us some data, I can find a way to get [ACGT][ACGT]
into a single four-bit group.
 
M

Mumia W.

On 10 Aug 2007 07:35:48 GMT (e-mail address removed)-berlin.de wrote:

a> Why four? You only need two bits to encode four bases.

"--" was also allowed as data besides a pair of letters, so you have

[ACGT][ACGT] = 4 bits (which is what Mumia means by "item", I think)
plus "--" as a value = 5 bits

Ted
[...]

Now that Jie has give us some data, I can find a way to get [ACGT][ACGT]
into a single four-bit group.

Duh. Sorry Anno. Sorry Ted. Of course I only need two bits for [ACGT]

I guess it's obvious that I haven't needed to count in binary for a while
:-(

But the "_" characters do throw a monkey wrench into my plans.

I wrote another version of the program that used bytes. It took over 80
minutes to complete and consumed 930MB of memory--mostly swap.

However, I'll still try to get a vec() version working because keeping
the data out of the swap space will probably speed things up by 100 times.

One half of 930MB is 465MB. If I exit Gnome and don't use my PC for
browsing the web while the program is running, I'll probably have enough
RAM for the program to run without using any swap. But I've got to get
vec() working.

PS.
Thanks again Anno. I changed the program to use two bits, and it worked.
 
M

Mumia W.

"--" was also allowed as data besides a pair of letters, so you have

[ACGT][ACGT] = 4 bits (which is what Mumia means by "item", I think)
plus "--" as a value = 5 bits

Ted

Alright, I admit I cheated :)

Five bits were needed, so I used five bits. No, I didn't hack the vec()
function in the Perl core. I just used two buffers:

#!/usr/local/bin/perl5.9.4
use strict;
use warnings;
use Fatal qw/open close/;
use Data::Dumper;

my $infile = shift() || 'BIG.txt';
my $outfile = 'out';

# A C G T _
my %bases = (
0,'AA',1,'AC',2,'AG',3,'AT',4,'A-',5,'CA',6,'CC',
7,'CG',8,'CT',9,'C-',10,'GA',11,'GC',12,'GG',13,'GT',
14,'G-',15,'TA',16,'TC',17,'TG',18,'TT',19,'T-',
20,'-A',21,'-C',22,'-G',23,'-T',24,'--',
);
my %rbases = reverse %bases;

my $buffer1 = '';
my $buffer2 = '';
my $maxcols = 0;
my $maxrows = 0;
my $pos = 0;

my (%inp, %out);
open $inp{hand}, '<', $infile;
open $out{hand}, '>', $outfile;

scalar readline $inp{hand};
my $startpos = tell($inp{hand});

while (readline $inp{hand}) {
my @f = split; shift @f;

bpstore($_, $pos++) for @f;
# print "@f\n";

$maxcols = @f if @f > $maxcols;
$maxrows++;
# last if $inp{lines}++ > 4;
}

select $out{hand};

for my $col (0 .. $maxcols-1) {
for my $row (0 .. $maxrows-1) {
# print "($row, $col) ";
my $npos = $col + ($row * $maxcols);
my $bp = $bases{bpfetch($npos)};
print "$bp ";
}
print "\n";
}

close $inp{hand};
close $out{hand};

system(ps => 'up', $$);
# system(cat => $outfile);

##############################

sub bpstore {
my ($bp, $pos) = @_;
$bp = $rbases{$bp};
vec($buffer1, $pos, 4) = 0xF & $bp;;
vec($buffer2, $pos, 1) = 1 & ($bp >> 4);
}

sub bpfetch {
my $pos = shift;
my $bp = vec($buffer1, $pos, 4);
$bp = $bp | (vec($buffer2, $pos, 1) << 4);
}

__END__

This works on Jie's submitted data, and it takes less than two minutes
to run and requires only 10MB of memory. I'm happy about that, but I
suspect it may encounter problems when dealing with over a gigabyte of
input data.
 
T

Ted Zlatanov

MW> Alright, I admit I cheated :)

MW> Five bits were needed, so I used five bits. No, I didn't hack the
MW> vec() function in the Perl core. I just used two buffers:

....

MW> This works on Jie's submitted data, and it takes less than two minutes
MW> to run and requires only 10MB of memory. I'm happy about that, but I
MW> suspect it may encounter problems when dealing with over a gigabyte of
MW> input data.

Cool solution. I thought about using a separate list to store the "--"
instances, so the storage would be 4 bits per pair plus a sorted list of
offsets (or hash of offsets, for faster lookups). I think your solution
is better, though.

Ted
 
D

Dr.Ruud

Mumia W. schreef:
my %bases = (
0,'AA',1,'AC',2,'AG',3,'AT',4,'A-',5,'CA',6,'CC',
7,'CG',8,'CT',9,'C-',10,'GA',11,'GC',12,'GG',13,'GT',
14,'G-',15,'TA',16,'TC',17,'TG',18,'TT',19,'T-',
20,'-A',21,'-C',22,'-G',23,'-T',24,'--',
);
my %rbases = reverse %bases;

Maybe a base-5 approach will give somebody some ideas:

0 => 00 => --
1 => 01 => -A
2 => 02 => -C
3 => 03 => -G
4 => 04 => -T
5 => 10 => A-
6 => 11 => AA
7 => 12 => AC
8 => 13 => AG
9 => 14 => AT
A => 20 => B-
B => 21 => BA
C => 22 => BC
D => 23 => BG
E => 24 => BT
F => 30 => G-
G => 31 => GA
H => 32 => GC
I => 33 => GG
J => 34 => GT
K => 40 => T-
L => 41 => TA
M => 42 => TC
N => 43 => TG
O => 44 => TT

Maybe a combination of a sparse array, index being the result of
t/-ACGT/01234/, is faster than a hash lookup?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top