how to tranpose a huge text file

Discussion in 'Perl Misc' started by Jie, Aug 8, 2007.

  1. Jie

    Jie Guest

    I have a huge text file with 1000 columns and about 1 million rows,
    and I need to transpose this text file so that row become column and
    column become row. (in case you are curious, this is a genotype file).

    Can someone recommend me an easy and efficient way to transpose such a
    large dataset, hopefully with Perl ?

    Thank you very much!

    Jie
     
    Jie, Aug 8, 2007
    #1
    1. Advertising

  2. Jie

    Mirco Wahab Guest

    Jie wrote:
    > I have a huge text file with 1000 columns and about 1 million rows,
    > and I need to transpose this text file so that row become column and
    > column become row. (in case you are curious, this is a genotype file).
    >
    > Can someone recommend me an easy and efficient way to transpose such a
    > large dataset, hopefully with Perl ?


    Is this a fixed format? Has each row exactly the same number of columns?
    Does each row have a "line ending" character?

    How's that beast looking? Normally, you'd read it in one stroke
    (1,000c x 10^6r ==> 1Gig) and dump out the new rows via substr()
    offsets (R times rowlength + C) subsequentially.

    But without having more info, I could only guess ...

    Regards

    M.
     
    Mirco Wahab, Aug 8, 2007
    #2
    1. Advertising

  3. Jie

    Guest

    Jie <> wrote:
    > I have a huge text file with 1000 columns and about 1 million rows,
    > and I need to transpose this text file so that row become column and
    > column become row. (in case you are curious, this is a genotype file).
    >
    > Can someone recommend me an easy and efficient way to transpose such a
    > large dataset, hopefully with Perl ?


    Before doing that, I would spend a few minutes considering whether the time
    would be better spent in making a more permanent solution, perhaps by
    structuring the data so you can jump through it at will using seek or
    sysseek, rather than always scanning it.

    Anyway, I'd do something combining Perl and the system sort command
    something like this:

    perl -lne '@x=split/\t/; foreach (0..$#x) {
    print "$_\t$.\t$x[$_]\n"}}' big_file.in | \
    sort -k1n -k2n | \
    perl -ne 'chomp; @x=split/\t/,$_,3; print $x[2];
    print $.%1000 ?"\t":"\n"' > big_file.out



    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Aug 9, 2007
    #3
  4. [A complimentary Cc of this posting was sent to
    Jie
    <>], who wrote in article <>:
    >
    > I have a huge text file with 1000 columns and about 1 million rows,
    > and I need to transpose this text file so that row become column and
    > column become row. (in case you are curious, this is a genotype file).
    >
    > Can someone recommend me an easy and efficient way to transpose such a
    > large dataset, hopefully with Perl ?


    If your CRTL allows opening a 1000 output files, read a line, and
    append the entries into corresponding files. Then cat the files
    together.

    If your CRTL allows opening only 32 output files, you need 3 passes,
    not 2. First break into 32 files, 32 colums per file; then repeat
    breaking for 32 generated files. Again, you get 1000 output files;
    cat them together.

    Hope this helps,
    Ilya

    P.S. If all your output data should fit into memory, use scalars
    instead of files (preallocate scalars to be extra safe: $a =
    'a'; $a x= 4e6; $a = '' preallocates 4MB of buffer for a
    variable).

    Read file line-by-line, appending to 1000 strings in memory.
    Then write them out to a file.
     
    Ilya Zakharevich, Aug 9, 2007
    #4
  5. Jie

    Mumia W. Guest

    On 08/08/2007 09:00 PM, Ilya Zakharevich wrote:
    > [A complimentary Cc of this posting was sent to
    > Jie
    > <>], who wrote in article <>:
    >> [...]
    >> Can someone recommend me an easy and efficient way to transpose such a
    >> large dataset, hopefully with Perl ?

    >
    > If your CRTL allows opening a 1000 output files, read a line, and
    > append the entries into corresponding files. Then cat the files
    > together.
    >
    > If your CRTL allows opening only 32 output files, you need 3 passes,
    > not 2. [...]


    FileCache might also be useful here.
     
    Mumia W., Aug 9, 2007
    #5
  6. [A complimentary Cc of this posting was sent to
    Mumia W.
    <>], who wrote in article <>:
    > On 08/08/2007 09:00 PM, Ilya Zakharevich wrote:
    > > [A complimentary Cc of this posting was sent to
    > > Jie
    > > <>], who wrote in article <>:
    > >> [...]
    > >> Can someone recommend me an easy and efficient way to transpose such a
    > >> large dataset, hopefully with Perl ?

    > >
    > > If your CRTL allows opening a 1000 output files, read a line, and
    > > append the entries into corresponding files. Then cat the files
    > > together.
    > >
    > > If your CRTL allows opening only 32 output files, you need 3 passes,
    > > not 2. [...]

    >
    > FileCache might also be useful here.


    Do not think so. What gave you this idea? You want to open files 1e9
    times?

    Puzzled,
    Ilya
     
    Ilya Zakharevich, Aug 9, 2007
    #6
  7. Jie

    patrick Guest

    On Aug 8, 1:56 pm, Jie <> wrote:
    > I have a huge text file with 1000 columns and about 1 million rows,
    > and I need to transpose this text file so that row become column and
    > column become row. (in case you are curious, this is a genotype file).
    >
    > Can someone recommend me an easy and efficient way to transpose such a
    > large dataset, hopefully with Perl ?
    >
    > Thank you very much!
    >
    > Jie


    If you're on UNIX and the columns are fixed length or delimited
    you may want to consider using the cut command inside a loop.

    Loop from 1 to 1000 to process each column
    cut -f<loop counter> <input file> | perl -e 'while (<>)
    {chomp;print;print "|";}print "\n"' >> <output file>

    Patrick
     
    patrick, Aug 9, 2007
    #7
  8. [A complimentary Cc of this posting was sent to
    Jim Gibson
    <>], who wrote in article <090820070901097436%>:
    > > This solution not consumpt much memory and should be relatively quick.

    >
    > Open files do consume a bit of memory: file control blocks,
    > input/output buffers, etc.


    Peanuts of you do not open a million of files.

    > > BTW: I'm curious to reactions of this programmer community :)

    >
    > That is a fine approach, except for the practical matter that most
    > operating systems will not allow a normal user to have 1000 files open
    > at one time. The limit is 256 for my current system (Mac OS 10.4):
    >
    > jgibson 34% ulimit -n
    > 256


    This is not the limit imposed by your operating system. Just the
    limit suggested by one of the ancessors of your program. Try raising
    ulimit -n; when if fails, it would indicate the limit given by the OS.

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Aug 9, 2007
    #8
  9. Jie

    Jie Guest

    Hi, Thank you so much for all the responses.

    First, here is a sample dataset, but the real one is much bigger, with
    1,000 columns instead of 14.
    http://www.humanbee.com/BIG.txt

    I could think of two ways to transpose this file.

    Option1: write a line as a column and append, something like below

    open IN, "<big_file.txt";
    open OUT, ">transposed_file.txt";
    while (IN) {
    append ??????????????
    }


    Option 2: generate a huge 2-dimentional array and write it out in the
    other way

    $row=0;
    while (IN) {
    $big_ARRAY[$row][] = split(/ /);
    }

    foreach $row (0 ..$row) {
    foreach $column (0 ..$#big_ARRAY) {
    print OUT$big_ARRAY[$column][$row];
    }
    }



    But I really doubt that either will work. So, can someone please throw
    some idea and hopefully code here?!

    Thank you!!

    Jie
     
    Jie, Aug 9, 2007
    #9
  10. Jie

    Mirco Wahab Guest

    Jie wrote:
    > But I really doubt that either will work. So, can someone please throw
    > some idea and hopefully code here?!


    PLEASE give the slightest hint on how
    this file *looks like*. If you could
    bring up an example of .. say the first
    10 rows => the left 20 and the right 20
    columns in each, so everybody can guess
    what you're talking about.

    Regards

    M.
     
    Mirco Wahab, Aug 9, 2007
    #10
  11. Jie

    Jie Guest

    Jie, Aug 10, 2007
    #11
  12. Jie

    Ted Zlatanov Guest

    On Wed, 08 Aug 2007 20:56:24 -0000 Jie <> wrote:

    J> I have a huge text file with 1000 columns and about 1 million rows,
    J> and I need to transpose this text file so that row become column and
    J> column become row. (in case you are curious, this is a genotype
    J> file).

    J> Can someone recommend me an easy and efficient way to transpose such
    J> a large dataset, hopefully with Perl ?

    I think your file-based approach is inefficient. You need to do this
    with a database. They are built to handle this kind of data; in fact
    your data set is not that big (looks like 10GB at most). Once your data
    is in the database, you can generate output any way you like, or do
    operations directly on the contents, which may make your job much
    easier.

    You could try SQLite as a DB engine, but note the end of
    http://www.sqlite.org/limits.html which says basically it's not designed
    for large data sets. Consider PostgreSQL, for example (there are many
    others in the market, free and commercial).

    To avoid the 1000-open-files solution, you can do the following:

    my $size;
    my $big = "big.txt";
    my $brk = "break.txt";
    open F, '<', $big or die "Couldn't read from $big file: $!";
    open B, '>', $brk or die "Couldn't write to $brk file: $!";

    while (<F>)
    {
    chomp;
    my @data = split ' '; # you may want to ensure the size of @data is the same every time
    $size = scalar @data; # but $size will have the LAST size of @data
    print B join("\n", @data), "\n";
    }

    close F;
    close B;

    Basically converting a MxN matrix into (MN)x1

    Let's assume you will just have 1000 columns for this example.

    Now you can write each inverted output line by looking in break.txt,
    reading every line, chomp() it, and append it to your current output
    line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
    Write "\n" to end the current output line.

    Now you can do the next output line, which requires lines 1, 1001, 2001,
    etc. You can reopen the break.txt file or just seek to the beginning.

    I am not writing out the whole thing because it's tedious and I think
    you should consider a database instead. It could be optimized, but
    you're basically putting lipstick on a pig when you spend your time
    optimizing the wrong solution for your needs.

    Ted
     
    Ted Zlatanov, Aug 10, 2007
    #12
  13. Jie

    -berlin.de Guest

    Ted Zlatanov <> wrote in comp.lang.perl.misc:
    > On Wed, 08 Aug 2007 20:56:24 -0000 Jie <> wrote:
    >
    > J> I have a huge text file with 1000 columns and about 1 million rows,
    > J> and I need to transpose this text file so that row become column and
    > J> column become row. (in case you are curious, this is a genotype
    > J> file).


    [Nice single-file solution snipped]

    > $size = scalar @data; # but $size will have the LAST size of @data


    Useless use of scalar() here.

    Anno
     
    -berlin.de, Aug 10, 2007
    #13
  14. Jie

    Ted Zlatanov Guest

    On 10 Aug 2007 15:52:09 GMT -berlin.de wrote:

    a> Ted Zlatanov <> wrote in comp.lang.perl.misc:
    >> $size = scalar @data; # but $size will have the LAST size of @data


    a> Useless use of scalar() here.

    I like to make scalar context explicit. IMHO it makes code more
    legible. It's my style, and AFAIK it doesn't cause problems (I also
    like to say "shift @_" and "shift @ARGV" to be explicit as to what I'm
    shifting).

    I'll be the first to admit my style is peculiar, e.g. single-space
    indents, but at least I'm consistent :)

    Ted
     
    Ted Zlatanov, Aug 10, 2007
    #14
  15. Jie

    Guest

    Jie <> wrote:

    > But I really doubt that either will work. So, can someone please throw
    > some idea and hopefully code here?!


    Hi Jie,

    We've already thrown out several ideas. Some take a lot of memory, some
    take a lot of file-handles, some need to re-read the file once for each
    column.

    You haven't really commented on the suitability of any of these methods,
    and the new information you provided is very minimal. So I wouldn't expect
    to get many more ideas just be asking the same question again!

    What did you think of the ideas we already gave you? Do they exceed your
    system's memory? Do they exceed your file-handle limit? Do they just take
    too long?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Aug 10, 2007
    #15
  16. Jie

    Guest

    Jie <> wrote:
    > I have a huge text file with 1000 columns and about 1 million rows,
    > and I need to transpose this text file so that row become column and
    > column become row. (in case you are curious, this is a genotype file).
    >
    > Can someone recommend me an easy and efficient way to transpose such a
    > large dataset, hopefully with Perl ?
    >
    > Thank you very much!


    Out of curiosity, I made this fairly general purpose program for
    transposing files (as long as they are tab separated, and rectangular). It
    will revert to using multiple passes if it can't open enough temp files to
    do it all in one pass. There are more efficient ways of doing it in that
    case, but they are more complicated and I'm lazy. On my computer, it seems
    to be even faster than reading all the data into memory and buidling
    in-memory strings, and of course uses a lot less memory.

    It doesn't work to C<seek> on ARGV, so I had to open the (single) input
    file explicitly instead. Writes to STDOUT.

    use strict;
    use warnings;

    open my $in, $ARGV[0] or die "$ARGV[0] $!";
    my @cols=split /\t/, scalar <$in>;
    my $cols=@cols;

    my $i=0; ## the first unprocessed column
    while ($i<$cols) {
    my @fh;
    my $j;
    ## open as many files as the fd limit will let us
    foreach ($j=$i;$j<$cols; $j++) {
    open $fh[@fh], "+>",undef or do {
    die "$j $!" unless $!{EMFILE};
    pop @fh;
    last;
    };
    };
    $j--;
    ##warn "working on columns $i..$j";
    seek $in,0,0 or die $!;
    while (<$in>) { chomp;
    my $x=0;
    print {$fh[$x++]} "\t$_" or die $! foreach (split/\t/)[$i..$j];
    }
    foreach my $x (@fh) {
    seek $x,0,0 or die $!;
    $_=<$x>;
    s/^\t//; # chop the unneeded leading tab
    print "$_\n"
    }
    $i=$j+1;
    }


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Aug 10, 2007
    #16
  17. [A complimentary Cc of this posting was sent to
    Ted Zlatanov
    <>], who wrote in article <>:
    > To avoid the 1000-open-files solution, you can do the following:


    > while (<F>)
    > {
    > chomp;
    > my @data = split ' '; # you may want to ensure the size of @data is the same every time
    > $size = scalar @data; # but $size will have the LAST size of @data
    > print B join("\n", @data), "\n";
    > }


    This is a NULL operation. You just converted " " to "\n".
    Essentially, nothing changed. [And $size is not used.]

    > Now you can write each inverted output line by looking in break.txt,
    > reading every line, chomp() it, and append it to your current output
    > line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
    > Write "\n" to end the current output line.


    Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Aug 10, 2007
    #17
  18. Jie

    Ted Zlatanov Guest

    On Fri, 10 Aug 2007 21:11:57 +0000 (UTC) Ilya Zakharevich <> wrote:

    IZ> [A complimentary Cc of this posting was sent to
    IZ> Ted Zlatanov
    IZ> <>], who wrote in article <>:
    >> To avoid the 1000-open-files solution, you can do the following:


    >> while (<F>)
    >> {
    >> chomp;
    >> my @data = split ' '; # you may want to ensure the size of @data is the same every time
    >> $size = scalar @data; # but $size will have the LAST size of @data
    >> print B join("\n", @data), "\n";
    >> }


    IZ> This is a NULL operation. You just converted " " to "\n".
    IZ> Essentially, nothing changed.

    I disagree, but it's somewhat irrelevant, see at end...

    IZ> [And $size is not used.]

    It's necessary later when you are jumping $size line forward (I used
    1000 in the example later). It's also handy to check if @data is not
    the right size compared to the last line. Sorry I didn't mention that.

    >> Now you can write each inverted output line by looking in break.txt,
    >> reading every line, chomp() it, and append it to your current output
    >> line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
    >> Write "\n" to end the current output line.


    IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!

    I suggested a database, actually. I specifically said I don't recommend
    doing this with file operations. I agree my file-based approach isn't
    better than what you and others have suggested, but it does avoid the
    multiple open files, and it has low memory usage.

    Ted
     
    Ted Zlatanov, Aug 10, 2007
    #18
  19. [A complimentary Cc of this posting was sent to
    Ted Zlatanov
    <>], who wrote in article <>:
    > >> Now you can write each inverted output line by looking in break.txt,
    > >> reading every line, chomp() it, and append it to your current output
    > >> line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
    > >> Write "\n" to end the current output line.

    >
    > IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!
    >
    > I suggested a database, actually.


    And why do you think this would decrease the load on head seeks?
    Either the data fits in memory (then database is not needed), or it is
    read from disk (which would, IMO, imply the same amount of seeks with
    database as with any other file-based operation).

    One needs not a database, but a program with build-in caching
    optimized for non-random access to 2-dimensional arrays. AFAIK,
    imagemagick is mostly memory-based. On the other side of spectrum,
    GIMP is based on tile-caching algorithms; if there were a way to
    easily hook into this algorithm (with no screen display involved), one
    could handle much larger datasets.

    Yet another way might be compression; suppose that there are only
    (e.g.) 130 "types" of entries; then one can compress the matrix into
    1GB of data, which should be handled easily by almost any computer.

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Aug 10, 2007
    #19
  20. Jie

    Ted Zlatanov Guest

    On Fri, 10 Aug 2007 22:28:34 +0000 (UTC) Ilya Zakharevich <> wrote:

    IZ> [A complimentary Cc of this posting was sent to
    IZ> Ted Zlatanov
    IZ> <>], who wrote in article <>:
    >> >> Now you can write each inverted output line by looking in break.txt,
    >> >> reading every line, chomp() it, and append it to your current output
    >> >> line if it's divisible by 1000 (so 0, 1000, 2000, etc. will match).
    >> >> Write "\n" to end the current output line.

    >>

    IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!
    >>
    >> I suggested a database, actually.


    IZ> And why do you think this would decrease the load on head seeks?
    IZ> Either the data fits in memory (then database is not needed), or it is
    IZ> read from disk (which would, IMO, imply the same amount of seeks with
    IZ> database as with any other file-based operation).

    Look, databases are optimized to store large amounts of data
    efficiently. You can always create a hand-tuned program that will do
    one task (e.g. transposing a huge text file) well, but you're missing
    the big picture: future uses of the data. I really doubt the only thing
    anyone will ever want with that data is to transpose it.

    IZ> One needs not a database, but a program with build-in caching
    IZ> optimized for non-random access to 2-dimensional arrays. AFAIK,
    IZ> imagemagick is mostly memory-based. On the other side of spectrum,
    IZ> GIMP is based on tile-caching algorithms; if there were a way to
    IZ> easily hook into this algorithm (with no screen display involved), one
    IZ> could handle much larger datasets.

    You and everyone else are overcomplicating this.

    Rewrite the original input file for fixed-length records. Then you just
    need to seek to a particular offset to read a record, and the problem
    becomes transposing a matrix piece by piece. This is fairly simple.

    IZ> Yet another way might be compression; suppose that there are only
    IZ> (e.g.) 130 "types" of entries; then one can compress the matrix into
    IZ> 1GB of data, which should be handled easily by almost any computer.

    You need 5 bits per item: it has 16 possible values ([ACTG]{2}), plus
    "--".

    A database table, to come back to my point, would store these items as
    enums. Then you, the user, don't have to worry about the bits per item
    in the storage, and you can just use the database.

    Ted
     
    Ted Zlatanov, Aug 11, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Brown Smith
    Replies:
    1
    Views:
    517
    Frankie
    Jun 25, 2005
  2. Marc H.
    Replies:
    2
    Views:
    339
    Martin Franklin
    Mar 13, 2005
  3. Replies:
    5
    Views:
    534
    Eric Sosman
    Nov 1, 2006
  4. Replies:
    3
    Views:
    547
  5. Brian Green
    Replies:
    2
    Views:
    125
    Brian Green
    Sep 5, 2008
Loading...

Share This Page