Transpose a large file(>2GB)

A

Ams Lo

Hi -

What is the fast and ruby way to transpose a large file(>2GB)??

I can't read the whole file into memory due to the file sizes..
 
X

Xavier Noria

What is the fast and ruby way to transpose a large file(>2GB)??

I can't read the whole file into memory due to the file sizes..

Interesting.

If the file is

12
3
456

You want

134
2 5
6

?

-- fxn
 
X

Xavier Noria

Interesting.

If the file is

12
3
456

You want

134
2 5
6

Sorry, I don't know why Mail.app deletes a single leading whitespace
when there's one, last line should be

space space six

-- fxn
 
P

Pascal Bourguignon

Ams Lo said:
Exactly.. Sorry, I should have explained it better..

Wrong. It should be below.

Like this:
Exactly.. Sorry, I should have explained it better..

Either have a 64-bit machine, and do it naively,

or have a vector of file positions pointing to the beginning of each
line O(file_size), and then write the character pointed to by these
file position (or a space if they reached the end of line) on a single
output line, and advance the file positions (unless they reach the end
of line). Repeat until all have reached end of line.

Note: this may be bad on the disk, if the input file has a lot of
lines. In this case, you may want to implement a recursive
transposition algorithm:

to transpose a matrix of dimension (c r) do
if (c=1) and (r=1)
then return the matrix itself
else
return the matrix made of:
| transpose of submatrix [0..(c/2)[,[0..(r/2)[ transpose of submatrix [0..(c/2)[,[(r/2)..r[ |
| transpose of submatrix [(c/2)..c[,[0..(r/2)[ transpose of submatrix [(c/2)..c[,[(r/2)..r[ |

that is, you cut the matrix in four, transpose each part
independently, and combine the four parts, exchanging the parts on the
anti-diagonal.

With this algorithm disk accesses should be more localized, so you may
get better speed on very big files with very big lines.

--
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
----------> http://www.netmeister.org/news/learn2quote.html <-----------
---> http://homepage.ntlworld.com/g.mccaughan/g/remarks/uquote.html <---

__Pascal Bourguignon__ http://www.informatimago.com/
 
X

Xavier Noria

Save pieces of it to disk.

Since seeking and inserting naively in text files is expensive I
wondered whether it could pay to use an intermediate specialized
software like DBM or whatever to build some structure from a line-
oriented reading of the input file, and then use it to output the
transposition in a line-oriented way.

Just off the top of my head without measuring, perhaps it is just much
worse.
 
A

Avdi Grimm

Hi -

What is the fast and ruby way to transpose a large file(>2GB)??

fseek to the end of the file. fseek backwards by a nice round number
like 4k. Read in that exact number of bytes. Transpose it. Write
the transposed block to your output file. Then fseek back 2*4k.
Read, transpose, write. fseek back 3*4k. Etc. Rinse and repeat to
the beginning of the file.

--
Avdi

Home: http://avdi.org
Developer Blog: http://avdi.org/devblog/
Twitter: http://twitter.com/avdi
Journal: http://avdi.livejournal.com
 
A

Avdi Grimm

fseek to the end of the file. fseek backwards by a nice round number
like 4k. Read in that exact number of bytes. Transpose it. Write
the transposed block to your output file. Then fseek back 2*4k.
Read, transpose, write. fseek back 3*4k. Etc. Rinse and repeat to
the beginning of the file.

Of course, if you're transposing lines, that's trickier.

--
Avdi

Home: http://avdi.org
Developer Blog: http://avdi.org/devblog/
Twitter: http://twitter.com/avdi
Journal: http://avdi.livejournal.com
 
L

Lionel Bouton

Pascal said:
[...] to transpose a matrix of dimension (c r) do
if (c=1) and (r=1)
then return the matrix itself
else
return the matrix made of:
| transpose of submatrix [0..(c/2)[,[0..(r/2)[ transpose of submatrix [0..(c/2)[,[(r/2)..r[ |
| transpose of submatrix [(c/2)..c[,[0..(r/2)[ transpose of submatrix [(c/2)..c[,[(r/2)..r[ |

that is, you cut the matrix in four, transpose each part
independently, and combine the four parts, exchanging the parts on the
anti-diagonal.

With this algorithm disk accesses should be more localized, so you may
get better speed on very big files with very big lines.

Here's an algorithm that should minimize the amount of seeks.

Given that the file doesn't fit in memory, you'll have to fight against
the disk cache as it won't be effective (caching data that won't ever be
reused) so you want to load as much useful data in memory at once and
only useful data.

If your program can allocate an amount <M> of memory, compute the number
<n> of element that fit in <M> / matrix_line_count. You probably want to
test that with actual Ruby code as it's difficult to guess correctly.

Be creative on how you store data in memory to gain space, you shoud try
to store it pre-transposed in memory and already compacted, but it might
be difficult/slow to write (although disk seeks are slow enough that you
might be able to benefit from relatively heavy computations).

On each line, read the <n> first elements (effectively reading the <n>
first columns), then write the <n> first lines in the transposed matrix.
Continue <n> by <n> columns.

With this method you'll always read/write sequential data on disk when
it's possible with as large data chunks as possible without hitting the
swap.

You can test the opposite method (reading <n> lines, writing <n>
columns, it may be faster on your hardware but my gutt feeling is that
you should avoid seeking on writes).

Lionel
 
X

Xavier Noria

With this method you'll always read/write sequential data on disk
when it's possible with as large data chunks as possible without
hitting the swap.

In what sense is it sequential? These approeaches require prepending/
appending *rectangular* chunks of text in the ouput file?
 
L

Lionel Bouton

Xavier said:
In what sense is it sequential?

It doesn't read or write one byte at a time :) Which would be the
result of a naïve transpose algorithm using direct disk access.
These approeaches require prepending/
appending *rectangular* chunks of text in the ouput file?

There's no way around that if you can't fit all the matrix in memory,
the only thing you can do is optimize the rectangles to minimize the
number of chunks to read...

Lionel
 
K

Ken Bloom

Hi -

What is the fast and ruby way to transpose a large file(>2GB)??

I can't read the whole file into memory due to the file sizes..

I assume all of the lines in the file are the same length, and the line
length is a multiple of the disk block size. Thus the file is laid out
like

000011112222
333344445555
666677778888
9999aaaabbbb
ccccddddeeee
ffffgggghhhh
iiiijjjjkkkk
llllmmmmnnnn
ooooppppqqqq
rrrrsssstttt
uuuuvvvvwwww
xxxxyyyyzzzz

(where the letters and numbers indicate the block number where the data
is stored). We're going to transpose in chunks, indicated by separators,
so that each chunk has one full disk block per row:


0000|1111|2222
3333|4444|5555
6666|7777|8888
9999|aaaa|bbbb
----+----+----
cccc|dddd|eeee
ffff|gggg|hhhh
iiii|jjjj|kkkk
llll|mmmm|nnnn
----+----+----
oooo|pppp|qqqq
rrrr|ssss|tttt
uuuu|vvvv|wwww
xxxx|yyyy|zzzz

Basically, for each chunk x,y where (x>y):
*read in chunk x,y and y,x
here, that's 8 disk blocks
*transpose the data in chunk x,y in place in memory
*transpose the data in chunk y,x in place in memory
*write the transposed data from chunk x,y to chunk y,x
*write the transposed data from chunk y,x to chunk x,y
Now, for each chunk x,x (the diagonal)
*read in chunk x,x
*transpose the data in chunk x,x in memory
*write out the transposed data to chunk x,x

Thus, each disk block is read exactly once and written exactly once.

If your lines don't align precisely to disk blocks then you get:

0000|0111|1122
2223|3333|4444
4555|5566|6667
7777|8888|8999
----+----+----
99aa|aaab|bbbb
cccc|cddd|ddee
eeef|ffff|gggg
ghhh|hhii|iiij
----+----+----
jjjj|kkkk|klll
llmm|mmmn|nnnn
oooo|oppp|ppqq
qqqr|rrrr|ssss

You could make a pass through the file to pad things first, but it looks
like you could choose appropriate chunk sizes so that you'd touch each
disk block at most twice anyway, and that would have the same
performance characteristics (ignoring disk seek time) as making a pass
through the file to pad it.

--Ken
 
K

Ken Bloom

I assume all of the lines in the file are the same length, and the line
length is a multiple of the disk block size. Thus the file is laid out
like

000011112222
333344445555
666677778888
9999aaaabbbb
ccccddddeeee
ffffgggghhhh
iiiijjjjkkkk
llllmmmmnnnn
ooooppppqqqq
rrrrsssstttt
uuuuvvvvwwww
xxxxyyyyzzzz

(where the letters and numbers indicate the block number where the data
is stored). We're going to transpose in chunks, indicated by separators,
so that each chunk has one full disk block per row:


0000|1111|2222
3333|4444|5555
6666|7777|8888
9999|aaaa|bbbb
----+----+----
cccc|dddd|eeee
ffff|gggg|hhhh
iiii|jjjj|kkkk
llll|mmmm|nnnn
----+----+----
oooo|pppp|qqqq
rrrr|ssss|tttt
uuuu|vvvv|wwww
xxxx|yyyy|zzzz

Basically, for each chunk x,y where (x>y):
*read in chunk x,y and y,x
here, that's 8 disk blocks
*transpose the data in chunk x,y in place in memory *transpose the
data in chunk y,x in place in memory *write the transposed data from
chunk x,y to chunk y,x *write the transposed data from chunk y,x to
chunk x,y
Now, for each chunk x,x (the diagonal)
*read in chunk x,x
*transpose the data in chunk x,x in memory *write out the transposed
data to chunk x,x

Thus, each disk block is read exactly once and written exactly once.

If your lines don't align precisely to disk blocks then you get:

0000|0111|1122
2223|3333|4444
4555|5566|6667
7777|8888|8999
----+----+----
99aa|aaab|bbbb
cccc|cddd|ddee
eeef|ffff|gggg
ghhh|hhii|iiij
----+----+----
jjjj|kkkk|klll
llmm|mmmn|nnnn
oooo|oppp|ppqq
qqqr|rrrr|ssss

You could make a pass through the file to pad things first, but it looks
like you could choose appropriate chunk sizes so that you'd touch each
disk block at most twice anyway, and that would have the same
performance characteristics (ignoring disk seek time) as making a pass
through the file to pad it.

--Ken

Clarification: the appropriate chunk size would be
(#rows = #columns) > 0.5 * disk block size.

--Ken
 
L

Lionel Bouton

Ken said:
I assume all of the lines in the file are the same length, and the line
length is a multiple of the disk block size.

From the rest of the algorithm, you assume that the OP wants to
transpose in place, and indeed your algorithm is probably one of the
best in this case (and is more or less Pascal Bourguignon's algorithm
tuned for disk access). I'm not sure this is what the OP wants (wasn't
specified). He seemed memory constrained but didn't report any disk
space constraint, so he may benefit from an algorithm writing another file.

In short I wonder if it's the best in the general case.

For a data point a 20000 * 20000 64 bit integer matrix is >2GB but only
160kb-wide, splitting it in narrower chunks will take a huge toll on the
reading/writing speed compared to the huge sequential reads (or writes
depending on the choices I left open) I proposed. The slow writes (or
reads) due to numerous seeks would even be faster than in your algorithm
as you advise a block size access which is only 512bytes, with 2G of RAM
you can write/read in ~ 2 * 10^9 / 20000 * 8 = 12500 bytes (12288 bytes
is the nearest multiple of 512), so it should be around 24 times faster
(assuming inverse proportion of bandwidth vs number of seeks).

The problem with splitting the matrix in submatrix (though an elegant
solution) is that it doesn't try to minimize disk seeks. Rectangular
portions as high or wide as the matrix are better suited to disk access
as they minimize the number of seeks in at least one of the read/write part.

Lionel
 
K

Ken Bloom

From the rest of the algorithm, you assume that the OP wants to
transpose in place, and indeed your algorithm is probably one of the
best in this case (and is more or less Pascal Bourguignon's algorithm
tuned for disk access). I'm not sure this is what the OP wants (wasn't
specified). He seemed memory constrained but didn't report any disk
space constraint, so he may benefit from an algorithm writing another
file.

In short I wonder if it's the best in the general case.

For a data point a 20000 * 20000 64 bit integer matrix is >2GB but only
160kb-wide, splitting it in narrower chunks will take a huge toll on the
reading/writing speed compared to the huge sequential reads (or writes
depending on the choices I left open) I proposed. The slow writes (or
reads) due to numerous seeks would even be faster than in your algorithm
as you advise a block size access which is only 512bytes, with 2G of RAM
you can write/read in ~ 2 * 10^9 / 20000 * 8 = 12500 bytes (12288 bytes
is the nearest multiple of 512), so it should be around 24 times faster
(assuming inverse proportion of bandwidth vs number of seeks).

The block size I used was just for illustration. I chose the blocksize
and line size that I did so that my diagrams wouldn't be overwhelmingly
tall. I also made no assumption about the data size. You should use your
disk's block size -- the size that is fetched anyway when you do a read
(). I don't know of a portable way to get this number. On my ext3
filesystem, I can get it by running dumpe2fs, and it's 4KB.

Beyond the block size, you have no guarantee of contiguous blocks in the
file being contiguous on disk anyway, though ext2 and ext3 are pretty
good at avoiding fragmentation.

Your algorithm poses other interesting questions about seeks, reads and
writes. Let's assume that whenever we allocate a new block it's
contiguous with the last one we wrote in the file. Let's read in the
original file contiguously, and write noncontiguously.

We need to read in enough lines to fill all of the blocks we are
outputting. With a 512k block size and 64-bit integers, that's 4 lines.
Not too bad memory-wise, but keep this in mind because if we read in less
lines, we'll be writing to the same disk block multiple times.

Now let's start writing. We write a block at the beginning of line 0,
then we seek to the beginning of line 1 and write another block. Since
Linux allows sparse files it doesn't write anything in the blocks between
these two blocks. They'll get filled in later when we write more blocks
there. The two blocks we just wrote are now next to each other on disk.
Keep going, and you'll find that both your reads and writes are
contiguous. On the other hand, if you try to transpose back, the file
that you just finished writing is the pathological worst case for seek
time.

Does this sound about right? In practice it may not be worth optimizing
seek time, since that's very much the realm of the operating system.

--Ken
 
L

Lionel Bouton

Ken said:
Now let's start writing. We write a block at the beginning of line 0,
then we seek to the beginning of line 1 and write another block. Since
Linux allows sparse files it doesn't write anything in the blocks between
these two blocks.
Yes

They'll get filled in later when we write more blocks
there. The two blocks we just wrote are now next to each other on disk.

Not necessarily. Filesystems might protect themselves against these
access patterns to avoid fragmentation (ext2/3 already prevent several
processes accessing the disk at the same time writing to the same
cylinder groups to avoid fragmentation), in fact if I understand the
source code of ext2 correctly this is one of the things indirect block
allocation prevent partially (ie: non-contiguous parts are allocated in
different parts of the disk and the fs tries to put a part next to a
contiguous one).
I guess you'd probably end with a matrix split in large continuous
blocks in various parts of one or perhaps more cylinder groups on disk.
Keep going, and you'll find that both your reads and writes are
contiguous.

This probably isn't true and if you want to get down to the real meat,
you'll have to evaluate the seek times and amount of them. On a modern
drive, a cylinder probably stores something of the order of the
megabytes or tens of megabytes. This means that more than one line of
the matrix fit in one single cylinder : consecutive lines can be
accessed without seek or minimal (track to next track) seek. The only
penalty is that you access a low percentage of what the disk could have
in one revolution.

On the other end, if you access lines in non contiguous parts of the
matrix with smaller segments (typical when you transpose by sub-matrix),
you diminish the amount of data used on each revolution and get more
seeks. If you are doing tens or hundred of thousands of them, this
begins to take a huge toll. Indeed reading 2G on a modern SATA disk
takes at most 40s (50+MB/s), so seeking can become the performance
bottleneck as soon as you get into the tens of thousands with quick
seeks, and a 2G 64bit integer matrix already have a number of lines in
this ballpark.
On the other hand, if you try to transpose back, the file
that you just finished writing is the pathological worst case for seek
time.

This is not obvious and at least not the case for ext2/3 (unless your
filesystem is already heavily fragmented). I believe the same allocation
heuristics are used in various modern fs to avoid file fragmentation.
Does this sound about right? In practice it may not be worth optimizing
seek time, since that's very much the realm of the operating system.

Unfortunately when you deal with data that can't fit in cache, you
pretty much defeat any mechanism that the OS uses to optimize disk
access so you can't rely on it. By the number of lines in the matrix,
you can guess the order of seek numbers neeeded and you can guess that
algorithms not optimizing them will get far worse performance than
others optimizing them.

Now what would be real fun is the OP telling that he uses a SSD with
access times in the order of 0.1ms for storing these matrices :) Then
I'd eat my hat and advise to go with the easiest algorithm to code :)

Lionel
 
K

Ken Bloom

Not necessarily. Filesystems might protect themselves against these
access patterns to avoid fragmentation (ext2/3 already prevent several
processes accessing the disk at the same time writing to the same
cylinder groups to avoid fragmentation), in fact if I understand the
source code of ext2 correctly this is one of the things indirect block
allocation prevent partially (ie: non-contiguous parts are allocated in
different parts of the disk and the fs tries to put a part next to a
contiguous one).
I guess you'd probably end with a matrix split in large continuous
blocks in various parts of one or perhaps more cylinder groups on disk.

In that case, none of the rest of what I said is relevant anymore.
Without reading into the guts of your filesystem, all you can do is guess
and benchmark.
Unfortunately when you deal with data that can't fit in cache, you
pretty much defeat any mechanism that the OS uses to optimize disk
access so you can't rely on it.

There's still an I/O scheduler. My suggestion now might actually be to
use your algorithm, making sure to read in several lines in a single read
(2) call, so that the OS's I/O scheduler could figure out the best way to
get the data. I'm not entirely sure about writes. Does the write system
call queue up the writes for the I/O scheduler and let the user process
keep going, or does it stop the process until the write completes (for
some definition of completes)?

My point was that you can't rely on the data having any particular
organization with respect to seek times.
Now what would be real fun is the OP telling that he uses a SSD with
access times in the order of 0.1ms for storing these matrices :) Then
I'd eat my hat and advise to go with the easiest algorithm to code :)

Yep. More unstated assumptions about the best way to do things.
I hope he finds a solution that's good enough.

--Ken
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top