J> I have a huge text file with 1000 columns and about 1 million rows,
J> and I need to transpose this text file so that row become column and
J> column become row. (in case you are curious, this is a genotype
J> file).
J> Can someone recommend me an easy and efficient way to transpose such
J> a large dataset, hopefully with Perl ?
I think your file-based approach is inefficient. You need to do this
with a database. They are built to handle this kind of data; in fact
your data set is not that big (looks like 10GB at most). Once your data
is in the database, you can generate output any way you like, or do
operations directly on the contents, which may make your job much
easier.
You could try SQLite as a DB engine, but note the end of
http://www.sqlite.org/limits.html which says basically it's not designed
for large data sets. Consider PostgreSQL, for example (there are many
others in the market, free and commercial).
To avoid the 1000-open-files solution, you can do the following:
my $size;
my $big = "big.txt";
my $brk = "break.txt";
open F, '<', $big or die "Couldn't read from $big file: $!";
open B, '>', $brk or die "Couldn't write to $brk file: $!";
while (<F>)
{
chomp;
my @data = split ' '; # you may want to ensure the size of @data is the same every time
$size = scalar @data; # but $size will have the LAST size of @data
print B join("\n", @data), "\n";
}
close F;
close B;
Basically converting a MxN matrix into (MN)x1
Let's assume you will just have 1000 columns for this example.
Now you can write each inverted output line by looking in break.txt,
reading every line, chomp() it, and append it to your current output
line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
Write "\n" to end the current output line.
Now you can do the next output line, which requires lines 1, 1001, 2001,
etc. You can reopen the break.txt file or just seek to the beginning.
I am not writing out the whole thing because it's tedious and I think
you should consider a database instead. It could be optimized, but
you're basically putting lipstick on a pig when you spend your time
optimizing the wrong solution for your needs.
Ted