Optimize write of large file

Y

Yoann M.

Hello,
I have data to process and to write into files progressively. The data
files are in the end very large, but I append to them small strings. I
suppose buffering the strings before apending to the file would be
faster. I don't need the files to be written before the end of the whole
process (i.e. I don't use their content).

I've searched for info about how File buffer its data but it seems we
can not configure anything about this, did I miss something ?
My first idea was to buffer everything myself, appending lines to a
string, or an array of strings and write when I reach a big enough
amount of data. But if File uses a buffer anyway, it would be a waste of
time I suppose ?
Do you have any advice to optimize the writing of large files ?
Thanks !
 
M

Markus Schirp

Hi,

Ruby and the glibc the kernel etc are doing buffering already.
There is usually no need for explict buffering from ruby.

You can test this for yourself, try to write the same string for a
million time in a loop. Not each write triggers a disk transaction.

Regards,

Markus
 
J

Jeremy Bopp

Hello,
I have data to process and to write into files progressively. The data
files are in the end very large, but I append to them small strings. I
suppose buffering the strings before apending to the file would be
faster. I don't need the files to be written before the end of the whole
process (i.e. I don't use their content).

I've searched for info about how File buffer its data but it seems we
can not configure anything about this, did I miss something ?
My first idea was to buffer everything myself, appending lines to a
string, or an array of strings and write when I reach a big enough
amount of data. But if File uses a buffer anyway, it would be a waste of
time I suppose ?
Do you have any advice to optimize the writing of large files ?

As mentioned, the file writes are already being buffered by lower
layers; however, if you are closing and reopening the files throughout
your processing, the buffers aren't helping you much. Try to ensure
that you open each file only once and keep those file references around
to use until you know you're permanently done writing to each one.
Unless you have a large number of files to open, you shouldn't have to
worry about resource constraints on the number of concurrently open files.

-Jeremy
 
Y

Yoann M.

You're right, doing the buffer myself does not make it faster. For
writing 10 millions lines, with an array of strings, one string, and no
homemade-buffer (code is attached) :
Buffer array : 11.141s
Buffer string : 9.748s
No buffer : 10.344s

Don't you think using more RAM before writing on disk could make the
process faster ? I thought so, then I'd like to say to File how much RAM
it can uses to speed things up, because I can use a lot of RAM.

Regards

Attachments:
http://www.ruby-forum.com/attachment/6191/test_write.rb
 
R

Robert Klemme

You're right, doing the buffer myself does not make it faster. For
writing 10 millions lines, with an array of strings, one string, and no
homemade-buffer (code is attached) :
Buffer array : 11.141s
Buffer string : 9.748s
No buffer : 10.344s

Don't you think using more RAM before writing on disk could make the
process faster ? I thought so, then I'd like to say to File how much RAM
it can uses to speed things up, because I can use a lot of RAM.

No, more does not help more. With modern operating systems you never
directly write through to the disk.* The OS is buffering your writes
anyway. Even worse: using up much memory in the process to hold the
whole file can make your program slower because of the overhead of
memory allocation. In the worst case your program is paged to disk.
Don't worry too much about this.

* Note there are some circumstances where you write directly to disk
(or rather, the write operation returns only after the disk
acknowledged the data). This is sometimes called "direct IO". This
does make sense in special circumstances only (some RDBMS can do it).

You can make your life easier by using Benchmark for this.

require 'benchmark'

Benchmark.bm 20 do |x|
x.report "a test" do
...
end

x.report "another test" do
..
end
end

Kind regards

robert
 
M

Markus Schirp

IMHO the primary speed bottleneck is the disk drive itself and the "possible"
File-System fragmentation.

RAM just let the operating system do the writes "as optimal as
possible". The effect of drastically more ram wont be more than 1-5%.

When you use a ramdisk this will differ much ;)

But when you are worried about file persistence you should not do this
*g*

I do not knew any details about your use case, there are other
possiblities:
* writing direkt to the block device, bypassing file systems
* mirror ramdisk writes to other machines for persistence
* ?

You're right, doing the buffer myself does not make it faster. For
writing 10 millions lines, with an array of strings, one string, and no
homemade-buffer (code is attached) :
Buffer array : 11.141s
Buffer string : 9.748s
No buffer : 10.344s

Don't you think using more RAM before writing on disk could make the
process faster ? I thought so, then I'd like to say to File how much RAM
it can uses to speed things up, because I can use a lot of RAM.

Regards

Attachments:
http://www.ruby-forum.com/attachment/6191/test_write.rb

--
Markus Schirp
Phone: 049 201 / 647 59 63
Mobile: 049 178 / 529 91 42
Web: www.seonic.net
Email: (e-mail address removed)
Seonic IT-Systems GbR
Anton Shatalov & Markus Schirp
Walterhohmannstraße 1
D-45141 Essen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,079
Latest member
ElidaWarin

Latest Threads

Top