how to tranpose a huge text file

J

Jie

I have a huge text file with 1000 columns and about 1 million rows,
and I need to transpose this text file so that row become column and
column become row. (in case you are curious, this is a genotype file).

Can someone recommend me an easy and efficient way to transpose such a
large dataset, hopefully with Perl ?

Thank you very much!

Jie
 
M

Mirco Wahab

Jie said:
I have a huge text file with 1000 columns and about 1 million rows,
and I need to transpose this text file so that row become column and
column become row. (in case you are curious, this is a genotype file).

Can someone recommend me an easy and efficient way to transpose such a
large dataset, hopefully with Perl ?

Is this a fixed format? Has each row exactly the same number of columns?
Does each row have a "line ending" character?

How's that beast looking? Normally, you'd read it in one stroke
(1,000c x 10^6r ==> 1Gig) and dump out the new rows via substr()
offsets (R times rowlength + C) subsequentially.

But without having more info, I could only guess ...

Regards

M.
 
X

xhoster

Jie said:
I have a huge text file with 1000 columns and about 1 million rows,
and I need to transpose this text file so that row become column and
column become row. (in case you are curious, this is a genotype file).

Can someone recommend me an easy and efficient way to transpose such a
large dataset, hopefully with Perl ?

Before doing that, I would spend a few minutes considering whether the time
would be better spent in making a more permanent solution, perhaps by
structuring the data so you can jump through it at will using seek or
sysseek, rather than always scanning it.

Anyway, I'd do something combining Perl and the system sort command
something like this:

perl -lne '@x=split/\t/; foreach (0..$#x) {
print "$_\t$.\t$x[$_]\n"}}' big_file.in | \
sort -k1n -k2n | \
perl -ne 'chomp; @x=split/\t/,$_,3; print $x[2];
print $.%1000 ?"\t":"\n"' > big_file.out



Xho
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Jie
I have a huge text file with 1000 columns and about 1 million rows,
and I need to transpose this text file so that row become column and
column become row. (in case you are curious, this is a genotype file).

Can someone recommend me an easy and efficient way to transpose such a
large dataset, hopefully with Perl ?

If your CRTL allows opening a 1000 output files, read a line, and
append the entries into corresponding files. Then cat the files
together.

If your CRTL allows opening only 32 output files, you need 3 passes,
not 2. First break into 32 files, 32 colums per file; then repeat
breaking for 32 generated files. Again, you get 1000 output files;
cat them together.

Hope this helps,
Ilya

P.S. If all your output data should fit into memory, use scalars
instead of files (preallocate scalars to be extra safe: $a =
'a'; $a x= 4e6; $a = '' preallocates 4MB of buffer for a
variable).

Read file line-by-line, appending to 1000 strings in memory.
Then write them out to a file.
 
M

Mumia W.

[A complimentary Cc of this posting was sent to
Jie
[...]
Can someone recommend me an easy and efficient way to transpose such a
large dataset, hopefully with Perl ?

If your CRTL allows opening a 1000 output files, read a line, and
append the entries into corresponding files. Then cat the files
together.

If your CRTL allows opening only 32 output files, you need 3 passes,
not 2. [...]

FileCache might also be useful here.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Mumia W.
[A complimentary Cc of this posting was sent to
Jie
[...]
Can someone recommend me an easy and efficient way to transpose such a
large dataset, hopefully with Perl ?

If your CRTL allows opening a 1000 output files, read a line, and
append the entries into corresponding files. Then cat the files
together.

If your CRTL allows opening only 32 output files, you need 3 passes,
not 2. [...]

FileCache might also be useful here.

Do not think so. What gave you this idea? You want to open files 1e9
times?

Puzzled,
Ilya
 
P

patrick

I have a huge text file with 1000 columns and about 1 million rows,
and I need to transpose this text file so that row become column and
column become row. (in case you are curious, this is a genotype file).

Can someone recommend me an easy and efficient way to transpose such a
large dataset, hopefully with Perl ?

Thank you very much!

Jie

If you're on UNIX and the columns are fixed length or delimited
you may want to consider using the cut command inside a loop.

Loop from 1 to 1000 to process each column
cut -f<loop counter> <input file> | perl -e 'while (<>)
{chomp;print;print "|";}print "\n"' >> <output file>

Patrick
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Jim Gibson
Open files do consume a bit of memory: file control blocks,
input/output buffers, etc.

Peanuts of you do not open a million of files.
That is a fine approach, except for the practical matter that most
operating systems will not allow a normal user to have 1000 files open
at one time. The limit is 256 for my current system (Mac OS 10.4):

jgibson 34% ulimit -n
256

This is not the limit imposed by your operating system. Just the
limit suggested by one of the ancessors of your program. Try raising
ulimit -n; when if fails, it would indicate the limit given by the OS.

Hope this helps,
Ilya
 
J

Jie

Hi, Thank you so much for all the responses.

First, here is a sample dataset, but the real one is much bigger, with
1,000 columns instead of 14.
http://www.humanbee.com/BIG.txt

I could think of two ways to transpose this file.

Option1: write a line as a column and append, something like below

open IN, "<big_file.txt";
open OUT, ">transposed_file.txt";
while (IN) {
append ??????????????
}


Option 2: generate a huge 2-dimentional array and write it out in the
other way

$row=0;
while (IN) {
$big_ARRAY[$row][] = split(/ /);
}

foreach $row (0 ..$row) {
foreach $column (0 ..$#big_ARRAY) {
print OUT$big_ARRAY[$column][$row];
}
}



But I really doubt that either will work. So, can someone please throw
some idea and hopefully code here?!

Thank you!!

Jie
 
M

Mirco Wahab

Jie said:
But I really doubt that either will work. So, can someone please throw
some idea and hopefully code here?!

PLEASE give the slightest hint on how
this file *looks like*. If you could
bring up an example of .. say the first
10 rows => the left 20 and the right 20
columns in each, so everybody can guess
what you're talking about.

Regards

M.
 
T

Ted Zlatanov

J> I have a huge text file with 1000 columns and about 1 million rows,
J> and I need to transpose this text file so that row become column and
J> column become row. (in case you are curious, this is a genotype
J> file).

J> Can someone recommend me an easy and efficient way to transpose such
J> a large dataset, hopefully with Perl ?

I think your file-based approach is inefficient. You need to do this
with a database. They are built to handle this kind of data; in fact
your data set is not that big (looks like 10GB at most). Once your data
is in the database, you can generate output any way you like, or do
operations directly on the contents, which may make your job much
easier.

You could try SQLite as a DB engine, but note the end of
http://www.sqlite.org/limits.html which says basically it's not designed
for large data sets. Consider PostgreSQL, for example (there are many
others in the market, free and commercial).

To avoid the 1000-open-files solution, you can do the following:

my $size;
my $big = "big.txt";
my $brk = "break.txt";
open F, '<', $big or die "Couldn't read from $big file: $!";
open B, '>', $brk or die "Couldn't write to $brk file: $!";

while (<F>)
{
chomp;
my @data = split ' '; # you may want to ensure the size of @data is the same every time
$size = scalar @data; # but $size will have the LAST size of @data
print B join("\n", @data), "\n";
}

close F;
close B;

Basically converting a MxN matrix into (MN)x1

Let's assume you will just have 1000 columns for this example.

Now you can write each inverted output line by looking in break.txt,
reading every line, chomp() it, and append it to your current output
line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
Write "\n" to end the current output line.

Now you can do the next output line, which requires lines 1, 1001, 2001,
etc. You can reopen the break.txt file or just seek to the beginning.

I am not writing out the whole thing because it's tedious and I think
you should consider a database instead. It could be optimized, but
you're basically putting lipstick on a pig when you spend your time
optimizing the wrong solution for your needs.

Ted
 
A

anno4000

Ted Zlatanov said:
J> I have a huge text file with 1000 columns and about 1 million rows,
J> and I need to transpose this text file so that row become column and
J> column become row. (in case you are curious, this is a genotype
J> file).

[Nice single-file solution snipped]
$size = scalar @data; # but $size will have the LAST size of @data

Useless use of scalar() here.

Anno
 
T

Ted Zlatanov

a> Useless use of scalar() here.

I like to make scalar context explicit. IMHO it makes code more
legible. It's my style, and AFAIK it doesn't cause problems (I also
like to say "shift @_" and "shift @ARGV" to be explicit as to what I'm
shifting).

I'll be the first to admit my style is peculiar, e.g. single-space
indents, but at least I'm consistent :)

Ted
 
X

xhoster

Jie said:
But I really doubt that either will work. So, can someone please throw
some idea and hopefully code here?!

Hi Jie,

We've already thrown out several ideas. Some take a lot of memory, some
take a lot of file-handles, some need to re-read the file once for each
column.

You haven't really commented on the suitability of any of these methods,
and the new information you provided is very minimal. So I wouldn't expect
to get many more ideas just be asking the same question again!

What did you think of the ideas we already gave you? Do they exceed your
system's memory? Do they exceed your file-handle limit? Do they just take
too long?

Xho
 
X

xhoster

Jie said:
I have a huge text file with 1000 columns and about 1 million rows,
and I need to transpose this text file so that row become column and
column become row. (in case you are curious, this is a genotype file).

Can someone recommend me an easy and efficient way to transpose such a
large dataset, hopefully with Perl ?

Thank you very much!

Out of curiosity, I made this fairly general purpose program for
transposing files (as long as they are tab separated, and rectangular). It
will revert to using multiple passes if it can't open enough temp files to
do it all in one pass. There are more efficient ways of doing it in that
case, but they are more complicated and I'm lazy. On my computer, it seems
to be even faster than reading all the data into memory and buidling
in-memory strings, and of course uses a lot less memory.

It doesn't work to C<seek> on ARGV, so I had to open the (single) input
file explicitly instead. Writes to STDOUT.

use strict;
use warnings;

open my $in, $ARGV[0] or die "$ARGV[0] $!";
my @cols=split /\t/, scalar <$in>;
my $cols=@cols;

my $i=0; ## the first unprocessed column
while ($i<$cols) {
my @fh;
my $j;
## open as many files as the fd limit will let us
foreach ($j=$i;$j<$cols; $j++) {
open $fh[@fh], "+>",undef or do {
die "$j $!" unless $!{EMFILE};
pop @fh;
last;
};
};
$j--;
##warn "working on columns $i..$j";
seek $in,0,0 or die $!;
while (<$in>) { chomp;
my $x=0;
print {$fh[$x++]} "\t$_" or die $! foreach (split/\t/)[$i..$j];
}
foreach my $x (@fh) {
seek $x,0,0 or die $!;
$_=<$x>;
s/^\t//; # chop the unneeded leading tab
print "$_\n"
}
$i=$j+1;
}


Xho
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Ted Zlatanov
To avoid the 1000-open-files solution, you can do the following:
while (<F>)
{
chomp;
my @data = split ' '; # you may want to ensure the size of @data is the same every time
$size = scalar @data; # but $size will have the LAST size of @data
print B join("\n", @data), "\n";
}

This is a NULL operation. You just converted " " to "\n".
Essentially, nothing changed. [And $size is not used.]
Now you can write each inverted output line by looking in break.txt,
reading every line, chomp() it, and append it to your current output
line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
Write "\n" to end the current output line.

Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!

Hope this helps,
Ilya
 
T

Ted Zlatanov

IZ> [A complimentary Cc of this posting was sent to
IZ> Ted Zlatanov

IZ> This is a NULL operation. You just converted " " to "\n".
IZ> Essentially, nothing changed.

I disagree, but it's somewhat irrelevant, see at end...

IZ> [And $size is not used.]

It's necessary later when you are jumping $size line forward (I used
1000 in the example later). It's also handy to check if @data is not
the right size compared to the last line. Sorry I didn't mention that.

IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!

I suggested a database, actually. I specifically said I don't recommend
doing this with file operations. I agree my file-based approach isn't
better than what you and others have suggested, but it does avoid the
multiple open files, and it has low memory usage.

Ted
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Ted Zlatanov
IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!

I suggested a database, actually.

And why do you think this would decrease the load on head seeks?
Either the data fits in memory (then database is not needed), or it is
read from disk (which would, IMO, imply the same amount of seeks with
database as with any other file-based operation).

One needs not a database, but a program with build-in caching
optimized for non-random access to 2-dimensional arrays. AFAIK,
imagemagick is mostly memory-based. On the other side of spectrum,
GIMP is based on tile-caching algorithms; if there were a way to
easily hook into this algorithm (with no screen display involved), one
could handle much larger datasets.

Yet another way might be compression; suppose that there are only
(e.g.) 130 "types" of entries; then one can compress the matrix into
1GB of data, which should be handled easily by almost any computer.

Hope this helps,
Ilya
 
T

Ted Zlatanov

IZ> [A complimentary Cc of this posting was sent to
IZ> Ted Zlatanov

IZ> And why do you think this would decrease the load on head seeks?
IZ> Either the data fits in memory (then database is not needed), or it is
IZ> read from disk (which would, IMO, imply the same amount of seeks with
IZ> database as with any other file-based operation).

Look, databases are optimized to store large amounts of data
efficiently. You can always create a hand-tuned program that will do
one task (e.g. transposing a huge text file) well, but you're missing
the big picture: future uses of the data. I really doubt the only thing
anyone will ever want with that data is to transpose it.

IZ> One needs not a database, but a program with build-in caching
IZ> optimized for non-random access to 2-dimensional arrays. AFAIK,
IZ> imagemagick is mostly memory-based. On the other side of spectrum,
IZ> GIMP is based on tile-caching algorithms; if there were a way to
IZ> easily hook into this algorithm (with no screen display involved), one
IZ> could handle much larger datasets.

You and everyone else are overcomplicating this.

Rewrite the original input file for fixed-length records. Then you just
need to seek to a particular offset to read a record, and the problem
becomes transposing a matrix piece by piece. This is fairly simple.

IZ> Yet another way might be compression; suppose that there are only
IZ> (e.g.) 130 "types" of entries; then one can compress the matrix into
IZ> 1GB of data, which should be handled easily by almost any computer.

You need 5 bits per item: it has 16 possible values ([ACTG]{2}), plus
"--".

A database table, to come back to my point, would store these items as
enums. Then you, the user, don't have to worry about the bits per item
in the storage, and you can just use the database.

Ted
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,436
Messages
2,571,696
Members
48,796
Latest member
Greg L.
Top