M
Mumia W.
I have 512mb RAM and 800mb swap, Perl 5.8.4 (and 5.9.4) under i386 Linux.
I decided to try out Jie Huang's idea of transposing a large array of
"bioinfomatics" data. Since I don't know what bioinfomatics data is, I
just assumed that they were sequences of the amino acids A, C, G, and T.
And since Jie never gave a sample of his/her data, I created a program
to do so:
#!/usr/bin/perl
use strict;
use warnings;
my @bases = qw/A C T G/;
my $rows = shift() || 5;
my $cols = shift() || 10;
for my $n (1 .. ($rows * $cols)) {
my $base = $bases[rand @bases];
print $base;
print "\n" if (0 == ($n % $cols));
}
print "\n";
__END__
I then went to writing the transposition program. A gigabyte (1e6 rows ×
1000 columns) is a lot of memory, so I wanted to avoid forcing each item
to take an entire byte, and since A, C, T and G are only four distinct
values, an entire byte is not needed to encode them, so I opted to use
four bits for each item and the vec() function:
#!/usr/bin/perl
use strict;
use warnings;
use Fatal qw/open close/;
use constant WIDTH => 4;
my $infile = shift() || 'data/bases-small';
my $outfile = 'out';
my %bases = (A => 0, C => 1, G => 2, T => 3);
my %rbases = reverse %bases;
my $buffer = '';
open my $ifh, '<', $infile;
open my $ofh, '>', $outfile;
my $offset = 0;
my $maxset = 0;
my $reclen;
my $maxrow = -1;
while (<$ifh>) {
$reclen = length($_)-1 unless defined $reclen;
while (/([ACTG])/g) {
vec($buffer, $offset++, WIDTH) = $bases{$1};
}
$maxrow++;
}
$maxset = $offset;
system(ps => 'up', $$);
for my $col (0 .. $reclen-1) {
for my $row (0 .. $maxrow-1) {
my $base = vec($buffer, $col + ($row * $reclen), WIDTH);
print $ofh $rbases{$base};
}
print $ofh "\n";
}
print $ofh "\n";
close $ofh;
close $ifh;
__END__
This program works with a small file of 10MB, but it falls over with
"Out of memory!" for the one gigabyte file. I shouldn't be all too
surprised if vec() a gigabyte of data, but is there some documented
limit on vec?
I absolutely never had a chance to run out of swap while running this
program. It ran for about 20 minutes, consumed over 230MB of total
memory, then aborted.
I decided to try out Jie Huang's idea of transposing a large array of
"bioinfomatics" data. Since I don't know what bioinfomatics data is, I
just assumed that they were sequences of the amino acids A, C, G, and T.
And since Jie never gave a sample of his/her data, I created a program
to do so:
#!/usr/bin/perl
use strict;
use warnings;
my @bases = qw/A C T G/;
my $rows = shift() || 5;
my $cols = shift() || 10;
for my $n (1 .. ($rows * $cols)) {
my $base = $bases[rand @bases];
print $base;
print "\n" if (0 == ($n % $cols));
}
print "\n";
__END__
I then went to writing the transposition program. A gigabyte (1e6 rows ×
1000 columns) is a lot of memory, so I wanted to avoid forcing each item
to take an entire byte, and since A, C, T and G are only four distinct
values, an entire byte is not needed to encode them, so I opted to use
four bits for each item and the vec() function:
#!/usr/bin/perl
use strict;
use warnings;
use Fatal qw/open close/;
use constant WIDTH => 4;
my $infile = shift() || 'data/bases-small';
my $outfile = 'out';
my %bases = (A => 0, C => 1, G => 2, T => 3);
my %rbases = reverse %bases;
my $buffer = '';
open my $ifh, '<', $infile;
open my $ofh, '>', $outfile;
my $offset = 0;
my $maxset = 0;
my $reclen;
my $maxrow = -1;
while (<$ifh>) {
$reclen = length($_)-1 unless defined $reclen;
while (/([ACTG])/g) {
vec($buffer, $offset++, WIDTH) = $bases{$1};
}
$maxrow++;
}
$maxset = $offset;
system(ps => 'up', $$);
for my $col (0 .. $reclen-1) {
for my $row (0 .. $maxrow-1) {
my $base = vec($buffer, $col + ($row * $reclen), WIDTH);
print $ofh $rbases{$base};
}
print $ofh "\n";
}
print $ofh "\n";
close $ofh;
close $ifh;
__END__
This program works with a small file of 10MB, but it falls over with
"Out of memory!" for the one gigabyte file. I shouldn't be all too
surprised if vec() a gigabyte of data, but is there some documented
limit on vec?
I absolutely never had a chance to run out of swap while running this
program. It ran for about 20 minutes, consumed over 230MB of total
memory, then aborted.