Optimize write of large file

Yoann M. · May 12, 2011

Hello,
I have data to process and to write into files progressively. The data
files are in the end very large, but I append to them small strings. I
suppose buffering the strings before apending to the file would be
faster. I don't need the files to be written before the end of the whole
process (i.e. I don't use their content).

I've searched for info about how File buffer its data but it seems we
can not configure anything about this, did I miss something ?
My first idea was to buffer everything myself, appending lines to a
string, or an array of strings and write when I reach a big enough
amount of data. But if File uses a buffer anyway, it would be a waste of
time I suppose ?
Do you have any advice to optimize the writing of large files ?
Thanks !

Markus Schirp · May 12, 2011

Hi,

Ruby and the glibc the kernel etc are doing buffering already.
There is usually no need for explict buffering from ruby.

You can test this for yourself, try to write the same string for a
million time in a loop. Not each write triggers a disk transaction.

Regards,

Markus

Jeremy Bopp · May 12, 2011

Hello,
I have data to process and to write into files progressively. The data
files are in the end very large, but I append to them small strings. I
suppose buffering the strings before apending to the file would be
faster. I don't need the files to be written before the end of the whole
process (i.e. I don't use their content).

I've searched for info about how File buffer its data but it seems we
can not configure anything about this, did I miss something ?
My first idea was to buffer everything myself, appending lines to a
string, or an array of strings and write when I reach a big enough
amount of data. But if File uses a buffer anyway, it would be a waste of
time I suppose ?
Do you have any advice to optimize the writing of large files ?

As mentioned, the file writes are already being buffered by lower
layers; however, if you are closing and reopening the files throughout
your processing, the buffers aren't helping you much. Try to ensure
that you open each file only once and keep those file references around
to use until you know you're permanently done writing to each one.
Unless you have a large number of files to open, you shouldn't have to
worry about resource constraints on the number of concurrently open files.

-Jeremy

Yoann M. · May 12, 2011

You're right, doing the buffer myself does not make it faster. For
writing 10 millions lines, with an array of strings, one string, and no
homemade-buffer (code is attached) :
Buffer array : 11.141s
Buffer string : 9.748s
No buffer : 10.344s

Don't you think using more RAM before writing on disk could make the
process faster ? I thought so, then I'd like to say to File how much RAM
it can uses to speed things up, because I can use a lot of RAM.

Regards

Attachments:
http://www.ruby-forum.com/attachment/6191/test_write.rb

Robert Klemme · May 12, 2011

You're right, doing the buffer myself does not make it faster. For
writing 10 millions lines, with an array of strings, one string, and no
homemade-buffer (code is attached) :
Buffer array : 11.141s
Buffer string : 9.748s
No buffer : 10.344s

Don't you think using more RAM before writing on disk could make the
process faster ? I thought so, then I'd like to say to File how much RAM
it can uses to speed things up, because I can use a lot of RAM.

No, more does not help more. With modern operating systems you never
directly write through to the disk.* The OS is buffering your writes
anyway. Even worse: using up much memory in the process to hold the
whole file can make your program slower because of the overhead of
memory allocation. In the worst case your program is paged to disk.
Don't worry too much about this.

* Note there are some circumstances where you write directly to disk
(or rather, the write operation returns only after the disk
acknowledged the data). This is sometimes called "direct IO". This
does make sense in special circumstances only (some RDBMS can do it).

Attachments:
http://www.ruby-forum.com/attachment/6191/test_write.rb

You can make your life easier by using Benchmark for this.

require 'benchmark'

Benchmark.bm 20 do |x|
x.report "a test" do
...
end

x.report "another test" do
..
end
end

Kind regards

robert

Markus Schirp · May 12, 2011

IMHO the primary speed bottleneck is the disk drive itself and the "possible"
File-System fragmentation.

RAM just let the operating system do the writes "as optimal as
possible". The effect of drastically more ram wont be more than 1-5%.

When you use a ramdisk this will differ much

But when you are worried about file persistence you should not do this
*g*

I do not knew any details about your use case, there are other
possiblities:
* writing direkt to the block device, bypassing file systems
* mirror ramdisk writes to other machines for persistence
* ?

You're right, doing the buffer myself does not make it faster. For
writing 10 millions lines, with an array of strings, one string, and no
homemade-buffer (code is attached) :
Buffer array : 11.141s
Buffer string : 9.748s
No buffer : 10.344s

Don't you think using more RAM before writing on disk could make the
process faster ? I thought so, then I'd like to say to File how much RAM
it can uses to speed things up, because I can use a lot of RAM.

Regards

Attachments:
http://www.ruby-forum.com/attachment/6191/test_write.rb

--
Markus Schirp
Phone: 049 201 / 647 59 63
Mobile: 049 178 / 529 91 42
Web: www.seonic.net
Email: (e-mail address removed)
Seonic IT-Systems GbR
Anton Shatalov & Markus Schirp
Walterhohmannstraße 1
D-45141 Essen

Yoann M. · May 13, 2011

Thanks for your answers, I'll let the OS optimize this on its own then
;-)

How to create PDF file in Batch	5	May 11, 2022
Fast alternatives to "File" and "IO" for large numbers of files ?	6	Feb 24, 2011
How to optimize?	0	Apr 9, 2008
Need help getting the duration of an audio file	7	Mar 31, 2022
large xml file...	11	Aug 23, 2011
ofstream Error: Failed to write file: Result is too large.	1	Apr 16, 2012
Processing large CSV files - how to maximise throughput?	11	Oct 25, 2013
Communicating between processes	0	May 14, 2023

Optimize write of large file

Yoann M.

Markus Schirp

Jeremy Bopp

Yoann M.

Robert Klemme

Markus Schirp

Yoann M.

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads