how to tranpose a huge text file

X

xhoster

Ted Zlatanov said:
On Fri, 10 Aug 2007 22:28:34 +0000 (UTC) Ilya Zakharevich

IZ> [A complimentary Cc of this posting was sent to
IZ> Ted Zlatanov
IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good
luck!
IZ> And why do you think this would decrease the load on head seeks?
IZ> Either the data fits in memory (then database is not needed), or it
is IZ> read from disk (which would, IMO, imply the same amount of seeks
with IZ> database as with any other file-based operation).

Look, databases are optimized to store large amounts of data
efficiently.

For some not very general meanings of "efficiently", sure. They generally
expand the data quite a bit upon storage; they aren't very good at straight
retrieval unless you have just the right index structures in place and your
queries have a high selectivity; most of them put a huge amount of effort
into transactionality and concurrency which maybe not be needed here but
imposes a high overhead whether you use it or not. One of the major
gene-chip companies was very proud that in one of their upgrades, they
started using a database instead of plain files for storing the data. And
then their customers were very pleased when in a following upgrade they
abandoned that, and went back to using plain files for the bulk data and
using the database just for the small DoE metadata.

You can always create a hand-tuned program that will do
one task (e.g. transposing a huge text file) well, but you're missing
the big picture: future uses of the data. I really doubt the only thing
anyone will ever want with that data is to transpose it.

And I really doubt that any single database design is going to support
everything that anyone may ever want to do with the data, either.
IZ> One needs not a database, but a program with build-in caching
IZ> optimized for non-random access to 2-dimensional arrays. AFAIK,
IZ> imagemagick is mostly memory-based. On the other side of spectrum,
IZ> GIMP is based on tile-caching algorithms; if there were a way to
IZ> easily hook into this algorithm (with no screen display involved),
one IZ> could handle much larger datasets.

You and everyone else are overcomplicating this.

Rewrite the original input file for fixed-length records.

Actually, that is just what I initially did recommended.
Then you just
need to seek to a particular offset to read a record, and the problem
becomes transposing a matrix piece by piece. This is fairly simple.

I think you are missing the big picture. Once you make a seekable file
format, that probably does away with the need to transpose the data in the
first place--whatever operation you wanted to do with the transposition can
be probably be done on the seekable file instead.

Xho
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Ted Zlatanov
IZ> And why do you think this would decrease the load on head seeks?
IZ> Either the data fits in memory (then database is not needed), or it is
IZ> read from disk (which would, IMO, imply the same amount of seeks with
IZ> database as with any other file-based operation).
Look, databases are optimized to store large amounts of data
efficiently.

Words words words. You can't do *all the things* efficiently.
Databases are optimized for some particular access patterns. I doubt
that even "good databases" are optimized for *this particular* access
pattern. And, AFAIK, MySQL is famous for its lousiness...
You can always create a hand-tuned program that will do
one task (e.g. transposing a huge text file) well, but you're missing
the big picture: future uses of the data. I really doubt the only thing
anyone will ever want with that data is to transpose it.

If the transposed form is well tuned for further manipulation (of
which we were not informed), then databasing looks like an overkill.
If not, then indeed.
You and everyone else are overcomplicating this.

Rewrite the original input file for fixed-length records. Then you just
need to seek to a particular offset to read a record, and the problem
becomes transposing a matrix piece by piece. This is fairly simple.

Sure. Do 1e9 seeks in your spare time...
IZ> Yet another way might be compression; suppose that there are only
IZ> (e.g.) 130 "types" of entries; then one can compress the matrix into
IZ> 1GB of data, which should be handled easily by almost any computer.

You need 5 bits per item: it has 16 possible values ([ACTG]{2}), plus
"--".
A database table, to come back to my point, would store these items as
enums. Then you, the user, don't have to worry about the bits per item
in the storage, and you can just use the database.

Of course one does not care about anything - IF the solution using the
database is going to give an answer during the following month. Which
I doubt...

Hope this helps,
Ilya
 
J

Jie

I really appreciate all the reponses here!!

I did not jump in because I want to test each proposed solution first
before providing any feedback.

I will definitely try to understand and test all the solutions and let
everybody know what I got.

Thank you guys so much!

Jie

IZ> [A complimentary Cc of this posting was sent to
IZ> Ted Zlatanov

IZ> Good. So what you suggest, is 1000 passes over a 4GB file. Good luck!



IZ> And why do you think this would decrease the load on head seeks?
IZ> Either the data fits in memory (then database is not needed), or it is
IZ> read from disk (which would, IMO, imply the same amount of seeks with
IZ> database as with any other file-based operation).

Look, databases are optimized to store large amounts of data
efficiently. You can always create a hand-tuned program that will do
one task (e.g. transposing a huge text file) well, but you're missing
the big picture: future uses of the data. I really doubt the only thing
anyone will ever want with that data is to transpose it.

IZ> One needs not a database, but a program with build-in caching
IZ> optimized for non-random access to 2-dimensional arrays. AFAIK,
IZ> imagemagick is mostly memory-based. On the other side of spectrum,
IZ> GIMP is based on tile-caching algorithms; if there were a way to
IZ> easily hook into this algorithm (with no screen display involved), one
IZ> could handle much larger datasets.

You and everyone else are overcomplicating this.

Rewrite the original input file for fixed-length records. Then you just
need to seek to a particular offset to read a record, and the problem
becomes transposing a matrix piece by piece. This is fairly simple.

IZ> Yet another way might be compression; suppose that there are only
IZ> (e.g.) 130 "types" of entries; then one can compress the matrix into
IZ> 1GB of data, which should be handled easily by almost any computer.

You need 5 bits per item: it has 16 possible values ([ACTG]{2}), plus
"--".

A database table, to come back to my point, would store these items as
enums. Then you, the user, don't have to worry about the bits per item
in the storage, and you can just use the database.

Ted
 
T

Ted Zlatanov

IZ> Words words words. You can't do *all the things* efficiently.
IZ> Databases are optimized for some particular access patterns. I doubt
IZ> that even "good databases" are optimized for *this particular* access
IZ> pattern.

You mean "SELECT column FROM table" (which is what you need in order to
write each line of the transposed file)? I think that's a pretty common
access pattern, and would perform well in most databases.

Ted
 
T

Ted Zlatanov

x> For some not very general meanings of "efficiently", sure.

Actually, it's the general meaning that I had in mind. For specific
efficiency and optimization, databases are not always the right tool.
For instance, a database will never perform as quickly as a fixed-offset
file for individual seeks to retrieve a record. Based on the OP's data,
I think a database is the right solution.

x> One of the major gene-chip companies was very proud that in one of
x> their upgrades, they started using a database instead of plain files
x> for storing the data. And then their customers were very pleased
x> when in a following upgrade they abandoned that, and went back to
x> using plain files for the bulk data and using the database just for
x> the small DoE metadata.

I can give you many examples where databases improved a business; I've
worked with large sets of data to do data analysis and storing it in a
database was ridiculously better than the equivalent work on a flat-file
database. For the OP's data set, I think database storage is a
reasonable, cost-efficient, and maintainable solution.

x> And I really doubt that any single database design is going to support
x> everything that anyone may ever want to do with the data, either.

My point is that the single-task program to transpose a file will be
much less useful than the database setup for general use. You're taking
the argument to an absurd extreme ("everything that anyone may ever
want").

x> I think you are missing the big picture. Once you make a seekable
x> file format, that probably does away with the need to transpose the
x> data in the first place--whatever operation you wanted to do with the
x> transposition can be probably be done on the seekable file instead.

Usually the rewrite is for another program that wants that data as
input, so no, you can't do "whatever" operation on the seekable file
without rewriting the consumer program's input routine.

Ted
 
X

xhoster

Ted Zlatanov said:
On 11 Aug 2007 01:17:08 GMT (e-mail address removed) wrote:


x> One of the major gene-chip companies was very proud that in one of
x> their upgrades, they started using a database instead of plain files
x> for storing the data. And then their customers were very pleased
x> when in a following upgrade they abandoned that, and went back to
x> using plain files for the bulk data and using the database just for
x> the small DoE metadata.

I can give you many examples where databases improved a business;

Well I should hope you could, and so can I. I just don't think from what
we have seen that this is particularly likely to be one of those cases.

x> And I really doubt that any single database design is going to support
x> everything that anyone may ever want to do with the data, either.

My point is that the single-task program to transpose a file will be
much less useful than the database setup for general use. You're taking
the argument to an absurd extreme ("everything that anyone may ever
want").

While it may or may not be useful for purposes invisible to us, I don't
think it will be very useful for the one desired task which we *do* know
about.

x> I think you are missing the big picture. Once you make a seekable
x> file format, that probably does away with the need to transpose the
x> data in the first place--whatever operation you wanted to do with the
x> transposition can be probably be done on the seekable file instead.

Usually the rewrite is for another program that wants that data as
input, so no, you can't do "whatever" operation on the seekable file
without rewriting the consumer program's input routine.

If the program you are trying to use only accepts one input format and you
are unwilling to change it, then what is putting the data it needs into a
database rather than into the necessary format going to get you? If I have
to make gigantic transposed file, I'd far rather start with the given text
file than with a database containing that data.

Xho
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Ted Zlatanov
IZ> Words words words. You can't do *all the things* efficiently.
IZ> Databases are optimized for some particular access patterns. I doubt
IZ> that even "good databases" are optimized for *this particular* access
IZ> pattern.

You mean "SELECT column FROM table" (which is what you need in order to
write each line of the transposed file)? I think that's a pretty common
access pattern, and would perform well in most databases.

I have very little experience with databases. However, my hunch
(based on my experience with other types of software) is that they are
not as optimized as an optimistic point of view may imply.

Doing one "SELECT column FROM table" (which results in a megaarray)
may be not *that* slow - given quick enough hardware. Doing it 1000
times would unravel all the missing optimizations in the
implementation of the database.

Hope this helps,
Ilya
 
T

Ted Zlatanov

IZ> [A complimentary Cc of this posting was sent to
IZ> Ted Zlatanov

IZ> I have very little experience with databases. However, my hunch
IZ> (based on my experience with other types of software) is that they are
IZ> not as optimized as an optimistic point of view may imply.

I respect your experience and skills you have demonstrated repeatedly in
this newsgroup and in Perl's source code base. I think you may want to
investigate database technology further; software like PostgreSQL and
SQLite has gathered significant traction and respect, and is adaptable
to many technical tasks.

IZ> Doing one "SELECT column FROM table" (which results in a megaarray)
IZ> may be not *that* slow - given quick enough hardware. Doing it 1000
IZ> times would unravel all the missing optimizations in the
IZ> implementation of the database.

You don't have to get all the results of a SELECT at once. You can
fetch each row of the result set, which does not generate a large array.
I don't know the internal mechanics of result set management in the
various databases on the market, nor do I want to; from practical
experience with large result sets under Oracle I can tell you that
performance is quite good there. Do you want benchmarks to be convinced?

Ted
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Ted Zlatanov
IZ> Doing one "SELECT column FROM table" (which results in a megaarray)
IZ> may be not *that* slow - given quick enough hardware. Doing it 1000
IZ> times would unravel all the missing optimizations in the
IZ> implementation of the database.

You don't have to get all the results of a SELECT at once. You can
fetch each row of the result set, which does not generate a large array.

Fetching by rows is trivial (with data at hand). It is fetching by
columns which leads to very hard-to-optimize signature of disk access.
I don't know the internal mechanics of result set management in the
various databases on the market, nor do I want to; from practical
experience with large result sets under Oracle I can tell you that
performance is quite good there. Do you want benchmarks to be convinced?

Sure, I would be very much interested.

E.g., on a machine with 512MB memory, create a 1e3 x 1e6 array with 5bit
entries. Then access all its columns one-by-one, then all its rows
one-by-one. This would mimic the problem at hand.

Thanks,
Ilya
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Jie
I really appreciate all the reponses here!!

I did not jump in because I want to test each proposed solution first
before providing any feedback.

*If* you have 512 MB of memory, and *if* each of your 1e6 x 1e3
entries is compressable to 1 byte, then the following solution should
be the optimal of what is proposed:

a) Preallocate an array of 334 strings of length 1e6 bytes;

b1) Assign '' to each string;

c1) Read file line-by-line, for each line: split; ignore all results
but the first 333; compress each of 333 results to 1 byte; append
each byte to the corresponding string;

d1) When read: for each of the corresponding 333 strings: uncompress
each byte, print space-separated.

b2,c2,d2): Likewise with results 334..666;
b3,c3,d3): Likewise with results 667..1000.

This is 3 passes through the original file. If you skip compression,
you can do the same with 10 passes - should also be not slow with
contemporary hard drives...

Hope this helps,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top