Out of memory in vec

Mumia W. · Aug 10, 2007

I have 512mb RAM and 800mb swap, Perl 5.8.4 (and 5.9.4) under i386 Linux.

I decided to try out Jie Huang's idea of transposing a large array of
"bioinfomatics" data. Since I don't know what bioinfomatics data is, I
just assumed that they were sequences of the amino acids A, C, G, and T.

And since Jie never gave a sample of his/her data, I created a program
to do so:

#!/usr/bin/perl
use strict;
use warnings;

my @bases = qw/A C T G/;
my $rows = shift() || 5;
my $cols = shift() || 10;

for my $n (1 .. ($rows * $cols)) {
my $base = $bases[rand @bases];
print $base;
print "\n" if (0 == ($n % $cols));
}
print "\n";

__END__

I then went to writing the transposition program. A gigabyte (1e6 rows ×
1000 columns) is a lot of memory, so I wanted to avoid forcing each item
to take an entire byte, and since A, C, T and G are only four distinct
values, an entire byte is not needed to encode them, so I opted to use
four bits for each item and the vec() function:

#!/usr/bin/perl
use strict;
use warnings;
use Fatal qw/open close/;
use constant WIDTH => 4;

my $infile = shift() || 'data/bases-small';
my $outfile = 'out';
my %bases = (A => 0, C => 1, G => 2, T => 3);
my %rbases = reverse %bases;

my $buffer = '';
open my $ifh, '<', $infile;
open my $ofh, '>', $outfile;

my $offset = 0;
my $maxset = 0;
my $reclen;
my $maxrow = -1;

while (<$ifh>) {
$reclen = length($_)-1 unless defined $reclen;
while (/([ACTG])/g) {
vec($buffer, $offset++, WIDTH) = $bases{$1};
}
$maxrow++;
}

$maxset = $offset;
system(ps => 'up', $$);

for my $col (0 .. $reclen-1) {
for my $row (0 .. $maxrow-1) {
my $base = vec($buffer, $col + ($row * $reclen), WIDTH);
print $ofh $rbases{$base};
}
print $ofh "\n";
}

print $ofh "\n";

close $ofh;
close $ifh;

__END__

This program works with a small file of 10MB, but it falls over with
"Out of memory!" for the one gigabyte file. I shouldn't be all too
surprised if vec() a gigabyte of data, but is there some documented
limit on vec?

I absolutely never had a chance to run out of swap while running this
program. It ran for about 20 minutes, consumed over 230MB of total
memory, then aborted.

Mumia W. · Aug 10, 2007

[...] I shouldn't be all too
surprised if vec() [can't handle] a gigabyte of data, but is there some documented
limit on vec?

That's what I meant to say.

anno4000 · Aug 10, 2007

[...]

I then went to writing the transposition program. A gigabyte (1e6 rows ×
1000 columns) is a lot of memory, so I wanted to avoid forcing each item
to take an entire byte, and since A, C, T and G are only four distinct
values, an entire byte is not needed to encode them, so I opted to use
four bits for each item and the vec() function:

Why four? You only need two bits to encode four bases.

[...]

This program works with a small file of 10MB, but it falls over with
"Out of memory!" for the one gigabyte file. I shouldn't be all too
surprised if vec() a gigabyte of data, but is there some documented
limit on vec?

No. Like any Perl string a vec can be as big as your memory allows.

Anno

Ted Zlatanov · Aug 10, 2007

On 10 Aug 2007 07:35:48 GMT (e-mail address removed)-berlin.de wrote:

a> Why four? You only need two bits to encode four bases.

"--" was also allowed as data besides a pair of letters, so you have

[ACGT][ACGT] = 4 bits (which is what Mumia means by "item", I think)
plus "--" as a value = 5 bits

Ted

Mumia W. · Aug 10, 2007

On 10 Aug 2007 07:35:48 GMT (e-mail address removed)-berlin.de wrote:

a> Why four? You only need two bits to encode four bases.

"--" was also allowed as data besides a pair of letters, so you have

[ACGT][ACGT] = 4 bits (which is what Mumia means by "item", I think)
plus "--" as a value = 5 bits

Ted

I started programming before Jie submitted any sample data, so I thought
of [ACGT] as an item. I would've liked to use three bits, but vec() does
not seem to allow that.

This program
----------------------
my $buffer = '';
vec($buffer, 0, 3) = 7;
----------------------
produces "Illegal number of bits in vec at ...", so I'm stuck with using
four bits.

Now that Jie has give us some data, I can find a way to get [ACGT][ACGT]
into a single four-bit group.

Mumia W. · Aug 10, 2007

On 10 Aug 2007 07:35:48 GMT (e-mail address removed)-berlin.de wrote:

a> Why four? You only need two bits to encode four bases.

"--" was also allowed as data besides a pair of letters, so you have

[ACGT][ACGT] = 4 bits (which is what Mumia means by "item", I think)
plus "--" as a value = 5 bits

Ted

Click to expand...

[...]

Now that Jie has give us some data, I can find a way to get [ACGT][ACGT]
into a single four-bit group.

Duh. Sorry Anno. Sorry Ted. Of course I only need two bits for [ACGT]

I guess it's obvious that I haven't needed to count in binary for a while
:-(

But the "_" characters do throw a monkey wrench into my plans.

I wrote another version of the program that used bytes. It took over 80
minutes to complete and consumed 930MB of memory--mostly swap.

However, I'll still try to get a vec() version working because keeping
the data out of the swap space will probably speed things up by 100 times.

One half of 930MB is 465MB. If I exit Gnome and don't use my PC for
browsing the web while the program is running, I'll probably have enough
RAM for the program to run without using any swap. But I've got to get
vec() working.

PS.
Thanks again Anno. I changed the program to use two bits, and it worked.

Mumia W. · Aug 11, 2007

"--" was also allowed as data besides a pair of letters, so you have

[ACGT][ACGT] = 4 bits (which is what Mumia means by "item", I think)
plus "--" as a value = 5 bits

Ted

Alright, I admit I cheated

Five bits were needed, so I used five bits. No, I didn't hack the vec()
function in the Perl core. I just used two buffers:

#!/usr/local/bin/perl5.9.4
use strict;
use warnings;
use Fatal qw/open close/;
use Data:

umper;

my $infile = shift() || 'BIG.txt';
my $outfile = 'out';

# A C G T _
my %bases = (
0,'AA',1,'AC',2,'AG',3,'AT',4,'A-',5,'CA',6,'CC',
7,'CG',8,'CT',9,'C-',10,'GA',11,'GC',12,'GG',13,'GT',
14,'G-',15,'TA',16,'TC',17,'TG',18,'TT',19,'T-',
20,'-A',21,'-C',22,'-G',23,'-T',24,'--',
);
my %rbases = reverse %bases;

my $buffer1 = '';
my $buffer2 = '';
my $maxcols = 0;
my $maxrows = 0;
my $pos = 0;

my (%inp, %out);
open $inp{hand}, '<', $infile;
open $out{hand}, '>', $outfile;

scalar readline $inp{hand};
my $startpos = tell($inp{hand});

while (readline $inp{hand}) {
my @f = split; shift @f;

bpstore($_, $pos++) for @f;
# print "@f\n";

$maxcols = @f if @f > $maxcols;
$maxrows++;
# last if $inp{lines}++ > 4;
}

select $out{hand};

for my $col (0 .. $maxcols-1) {
for my $row (0 .. $maxrows-1) {
# print "($row, $col) ";
my $npos = $col + ($row * $maxcols);
my $bp = $bases{bpfetch($npos)};
print "$bp ";
}
print "\n";
}

close $inp{hand};
close $out{hand};

system(ps => 'up', $$);
# system(cat => $outfile);

##############################

sub bpstore {
my ($bp, $pos) = @_;
$bp = $rbases{$bp};
vec($buffer1, $pos, 4) = 0xF & $bp;;
vec($buffer2, $pos, 1) = 1 & ($bp >> 4);
}

sub bpfetch {
my $pos = shift;
my $bp = vec($buffer1, $pos, 4);
$bp = $bp | (vec($buffer2, $pos, 1) << 4);
}

__END__

This works on Jie's submitted data, and it takes less than two minutes
to run and requires only 10MB of memory. I'm happy about that, but I
suspect it may encounter problems when dealing with over a gigabyte of
input data.

Ted Zlatanov · Aug 11, 2007

MW> Alright, I admit I cheated

MW> Five bits were needed, so I used five bits. No, I didn't hack the
MW> vec() function in the Perl core. I just used two buffers:

....

MW> This works on Jie's submitted data, and it takes less than two minutes
MW> to run and requires only 10MB of memory. I'm happy about that, but I
MW> suspect it may encounter problems when dealing with over a gigabyte of
MW> input data.

Cool solution. I thought about using a separate list to store the "--"
instances, so the storage would be 4 bits per pair plus a sorted list of
offsets (or hash of offsets, for faster lookups). I think your solution
is better, though.

Ted

Dr.Ruud · Aug 11, 2007

Mumia W. schreef:

I have 512mb RAM and 800mb swap,

Yeah, millibits won't get you far.
;-)

Dr.Ruud · Aug 11, 2007

Mumia W. schreef:

my %bases = (
0,'AA',1,'AC',2,'AG',3,'AT',4,'A-',5,'CA',6,'CC',
7,'CG',8,'CT',9,'C-',10,'GA',11,'GC',12,'GG',13,'GT',
14,'G-',15,'TA',16,'TC',17,'TG',18,'TT',19,'T-',
20,'-A',21,'-C',22,'-G',23,'-T',24,'--',
);
my %rbases = reverse %bases;

Maybe a base-5 approach will give somebody some ideas:

0 => 00 => --
1 => 01 => -A
2 => 02 => -C
3 => 03 => -G
4 => 04 => -T
5 => 10 => A-
6 => 11 => AA
7 => 12 => AC
8 => 13 => AG
9 => 14 => AT
A => 20 => B-
B => 21 => BA
C => 22 => BC
D => 23 => BG
E => 24 => BT
F => 30 => G-
G => 31 => GA
H => 32 => GC
I => 33 => GG
J => 34 => GT
K => 40 => T-
L => 41 => TA
M => 42 => TC
N => 43 => TG
O => 44 => TT

Maybe a combination of a sparse array, index being the result of
t/-ACGT/01234/, is faster than a hash lookup?

Mumia W. · Aug 11, 2007

Mumia W. schreef:

Yeah, millibits won't get you far.
;-)

Yeah, I keep forgetting.

vec() in lvalue sub	7	Aug 13, 2008
FAQ 4.53 How do I manipulate arrays of bits?	0	Feb 10, 2011
vec - a misunderstanding	1	Mar 18, 2006
RSA implementation issues in public key pem loader function	0	May 21, 2025
"Out of memory!" error with PerlIO::via	0	Oct 16, 2008
Out Of Memory! - error	3	Aug 18, 2006
Weird Behavior with Rays in C and OpenGL	4	Feb 12, 2024
Cannot figure out line of code, also not understanding error	9	Feb 20, 2014

Out of memory in vec

Mumia W.

Mumia W.

anno4000

Ted Zlatanov

Mumia W.

Mumia W.

Mumia W.

Ted Zlatanov

Dr.Ruud

Dr.Ruud

Mumia W.

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads