File.read(fname) vs. File.read(fname,File.size(fname))

Alex Dowad · Apr 30, 2010

Hi, this is my first post on ruby-forum. Hope this is useful to someone!

I have learned from experience to avoid reading files using
File.read(filename)... it gives terrible performance on even moderately
large files. Reading large files line by line is much faster, and uses
much less memory. However, there are cases when you do want the entire
file in a single string. I just discovered that you can do this MUCH
faster with File.read(filename,File.size(filename))... check this out:

File.size(bigfile) # not really that big... just 10 MB => 10531519
Benchmark.bm do |bm|

* bm.report("straight read") { File.read(bigfile) }
* bm.report("read w/ size") { File.read(bigfile,File.size(bigfile)) }
* end
user system total real
straight read 28.875000 18.032000 46.907000 ( 47.812500)
read w/ size 0.000000 0.031000 0.031000 ( 0.031250)

...for just a *moderate* 1500x boost in performance.

I believe that these are the offending lines, in io.c:

1622: static VALUE
1623: read_all(rb_io_t *fptr, long siz, VALUE str)
... intervening lines omitted...
1668: siz += BUFSIZ;
1669: rb_str_resize(str, siz);

It appears that the buffer is being grown linearly, giving O(n^2)
performance. If this is the case, switching to an exponential growth
strategy should give O(n) performance instead; a BIG improvement. Is
there a good reason why the code is written this way? Could it really be
an oversight? Seems hard to believe.

Comments please!

Alex Dowad

Roger Pack · Apr 30, 2010

user system total real

straight read 28.875000 18.032000 46.907000 ( 47.812500)
read w/ size 0.000000 0.031000 0.031000 ( 0.031250)

...for just a *moderate* 1500x boost in performance. ...
It appears that the buffer is being grown linearly, giving O(n^2)
performance.

Yeah this is true. I think it has been fixed in ruby trunk, though (try
it out there). Also if you're on windows try binread.
-rp

Roger Pack · Apr 30, 2010

I'm unable to reproduce this except on windows, so that's where you are,
I assume?

Yeah unfortunately with windows it does exactly what you described
(except in trunk, where it has been fixed, except still has a bit of
slowdown when you read files in ascii+translation mode (see last post of
this thread:
http://www.ruby-forum.com/topic/182875#new)

It does seem that the idea results in some speedup, though:

linux 1.9.2, 500MB file

user system total real
normal 0.130000 0.810000 0.940000 ( 0.938607)
optimized 0.000000 0.740000 0.740000 ( 0.749340)

windows 1.9.2, 500MB file

user system total real
normal 0.250000 0.671000 0.921000 ( 0.921829)
optimized 0.000000 0.764000 0.764000 ( 0.774697)

plus results in a huge increase in speed for ascii mode in windows.

ruby 1.9.2dev (2010-05-01) [i386-mingw32]
user system total real
normal 11.342000 0.718000 12.060000 ( 12.735092)
optimized 0.000000 0.437000 0.437000 ( 0.446179)

(maybe there is still some N^2 action going on?)

I'll file a feature request for it.

-rp

Alex Dowad · May 1, 2010

Thanks for your reply, Roger!

I'm unable to reproduce this except on windows, so that's where you are,
I assume?

Yes. Sorry, I'll make that clear next time I post.

...
I'll file a feature request for it.

Thanks! There's no point in growing a buffer dynamically, when you know
from the beginning how many bytes you need to store in it.

Rather than chaining directly to IO.read, File.read could easily pass
the file size along if no "length" argument is passed in.

Alex Dowad

Michel Demazure · May 1, 2010

Alex said:
Thanks for your reply, Roger!

Yes. Sorry, I'll make that clear next time I post.

In my case, on Windows with 1.9.1 and an utf-8 file, the two commands
are not equivalent.

Replacing File.read(file) by File.read(file, File.size(file)) raises
encoding errors. They behave differently wrt encodings.

net/http performance	14	Jul 15, 2006
Fast way to process large files line by line	18	Nov 15, 2006
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
IDLE and non-ascii encoding workaround?	1	Nov 17, 2003
problems with base64	2	Jul 10, 2004
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
Ruby Weekly News 14th - 20th March 2005	0	Mar 20, 2005
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007

File.read(fname) vs. File.read(fname,File.size(fname))

Alex Dowad

Roger Pack

Roger Pack

Alex Dowad

Michel Demazure

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads