A
Alex Dowad
Hi, this is my first post on ruby-forum. Hope this is useful to someone!
I have learned from experience to avoid reading files using
File.read(filename)... it gives terrible performance on even moderately
large files. Reading large files line by line is much faster, and uses
much less memory. However, there are cases when you do want the entire
file in a single string. I just discovered that you can do this MUCH
faster with File.read(filename,File.size(filename))... check this out:
* bm.report("read w/ size") { File.read(bigfile,File.size(bigfile)) }
* end
user system total real
straight read 28.875000 18.032000 46.907000 ( 47.812500)
read w/ size 0.000000 0.031000 0.031000 ( 0.031250)
...for just a *moderate* 1500x boost in performance.
I believe that these are the offending lines, in io.c:
1622: static VALUE
1623: read_all(rb_io_t *fptr, long siz, VALUE str)
... intervening lines omitted...
1668: siz += BUFSIZ;
1669: rb_str_resize(str, siz);
It appears that the buffer is being grown linearly, giving O(n^2)
performance. If this is the case, switching to an exponential growth
strategy should give O(n) performance instead; a BIG improvement. Is
there a good reason why the code is written this way? Could it really be
an oversight? Seems hard to believe.
Comments please!
Alex Dowad
I have learned from experience to avoid reading files using
File.read(filename)... it gives terrible performance on even moderately
large files. Reading large files line by line is much faster, and uses
much less memory. However, there are cases when you do want the entire
file in a single string. I just discovered that you can do this MUCH
faster with File.read(filename,File.size(filename))... check this out:
* bm.report("straight read") { File.read(bigfile) }File.size(bigfile) # not really that big... just 10 MB => 10531519
Benchmark.bm do |bm|
* bm.report("read w/ size") { File.read(bigfile,File.size(bigfile)) }
* end
user system total real
straight read 28.875000 18.032000 46.907000 ( 47.812500)
read w/ size 0.000000 0.031000 0.031000 ( 0.031250)
...for just a *moderate* 1500x boost in performance.
I believe that these are the offending lines, in io.c:
1622: static VALUE
1623: read_all(rb_io_t *fptr, long siz, VALUE str)
... intervening lines omitted...
1668: siz += BUFSIZ;
1669: rb_str_resize(str, siz);
It appears that the buffer is being grown linearly, giving O(n^2)
performance. If this is the case, switching to an exponential growth
strategy should give O(n) performance instead; a BIG improvement. Is
there a good reason why the code is written this way? Could it really be
an oversight? Seems hard to believe.
Comments please!
Alex Dowad