File.read(fname) vs. File.read(fname,File.size(fname))

Discussion in 'Ruby' started by Alex Dowad, Apr 30, 2010.

  1. Alex Dowad

    Alex Dowad Guest

    Hi, this is my first post on ruby-forum. Hope this is useful to someone!

    I have learned from experience to avoid reading files using
    File.read(filename)... it gives terrible performance on even moderately
    large files. Reading large files line by line is much faster, and uses
    much less memory. However, there are cases when you do want the entire
    file in a single string. I just discovered that you can do this MUCH
    faster with File.read(filename,File.size(filename))... check this out:

    > File.size(bigfile) # not really that big... just 10 MB

    => 10531519
    > Benchmark.bm do |bm|

    * bm.report("straight read") { File.read(bigfile) }
    * bm.report("read w/ size") { File.read(bigfile,File.size(bigfile)) }
    * end
    user system total real
    straight read 28.875000 18.032000 46.907000 ( 47.812500)
    read w/ size 0.000000 0.031000 0.031000 ( 0.031250)

    ...for just a *moderate* 1500x boost in performance.

    I believe that these are the offending lines, in io.c:

    1622: static VALUE
    1623: read_all(rb_io_t *fptr, long siz, VALUE str)
    ... intervening lines omitted...
    1668: siz += BUFSIZ;
    1669: rb_str_resize(str, siz);

    It appears that the buffer is being grown linearly, giving O(n^2)
    performance. If this is the case, switching to an exponential growth
    strategy should give O(n) performance instead; a BIG improvement. Is
    there a good reason why the code is written this way? Could it really be
    an oversight? Seems hard to believe.

    Comments please!

    Alex Dowad
    --
    Posted via http://www.ruby-forum.com/.
    Alex Dowad, Apr 30, 2010
    #1
    1. Advertising

  2. Alex Dowad

    Roger Pack Guest

    > user system total real
    > straight read 28.875000 18.032000 46.907000 ( 47.812500)
    > read w/ size 0.000000 0.031000 0.031000 ( 0.031250)
    >
    > ...for just a *moderate* 1500x boost in performance.

    ...
    > It appears that the buffer is being grown linearly, giving O(n^2)
    > performance.


    Yeah this is true. I think it has been fixed in ruby trunk, though (try
    it out there). Also if you're on windows try binread.
    -rp
    --
    Posted via http://www.ruby-forum.com/.
    Roger Pack, Apr 30, 2010
    #2
    1. Advertising

  3. Alex Dowad

    Roger Pack Guest

    Roger Pack wrote:
    >> user system total real
    >> straight read 28.875000 18.032000 46.907000 ( 47.812500)
    >> read w/ size 0.000000 0.031000 0.031000 ( 0.031250)
    >>
    >> ...for just a *moderate* 1500x boost in performance.

    > ...
    >> It appears that the buffer is being grown linearly, giving O(n^2)
    >> performance.


    I'm unable to reproduce this except on windows, so that's where you are,
    I assume?

    Yeah unfortunately with windows it does exactly what you described
    (except in trunk, where it has been fixed, except still has a bit of
    slowdown when you read files in ascii+translation mode (see last post of
    this thread:
    http://www.ruby-forum.com/topic/182875#new)

    It does seem that the idea results in some speedup, though:

    linux 1.9.2, 500MB file

    user system total real
    normal 0.130000 0.810000 0.940000 ( 0.938607)
    optimized 0.000000 0.740000 0.740000 ( 0.749340)

    windows 1.9.2, 500MB file

    user system total real
    normal 0.250000 0.671000 0.921000 ( 0.921829)
    optimized 0.000000 0.764000 0.764000 ( 0.774697)

    plus results in a huge increase in speed for ascii mode in windows.


    ruby 1.9.2dev (2010-05-01) [i386-mingw32]
    user system total real
    normal 11.342000 0.718000 12.060000 ( 12.735092)
    optimized 0.000000 0.437000 0.437000 ( 0.446179)


    (maybe there is still some N^2 action going on?)

    I'll file a feature request for it.

    -rp
    --
    Posted via http://www.ruby-forum.com/.
    Roger Pack, Apr 30, 2010
    #3
  4. Alex Dowad

    Alex Dowad Guest

    Thanks for your reply, Roger!

    > I'm unable to reproduce this except on windows, so that's where you are,
    > I assume?


    Yes. Sorry, I'll make that clear next time I post.

    > ...
    > I'll file a feature request for it.


    Thanks! There's no point in growing a buffer dynamically, when you know
    from the beginning how many bytes you need to store in it.

    Rather than chaining directly to IO.read, File.read could easily pass
    the file size along if no "length" argument is passed in.

    Alex Dowad
    --
    Posted via http://www.ruby-forum.com/.
    Alex Dowad, May 1, 2010
    #4
  5. Alex Dowad wrote:
    > Thanks for your reply, Roger!
    >
    >> I'm unable to reproduce this except on windows, so that's where you are,
    >> I assume?

    >
    > Yes. Sorry, I'll make that clear next time I post.
    >

    In my case, on Windows with 1.9.1 and an utf-8 file, the two commands
    are not equivalent.

    Replacing File.read(file) by File.read(file, File.size(file)) raises
    encoding errors. They behave differently wrt encodings.
    --
    Posted via http://www.ruby-forum.com/.
    Michel Demazure, May 1, 2010
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jason Cavett

    Preferred Size, Minimum Size, Size

    Jason Cavett, May 23, 2008, in forum: Java
    Replies:
    5
    Views:
    12,518
    Michael Jung
    May 25, 2008
  2. keobox
    Replies:
    1
    Views:
    227
    Steven D'Aprano
    Sep 16, 2011
  3. Mike
    Replies:
    1
    Views:
    204
    Tom Vande Stouwe MCSD.net
    Jul 24, 2003
  4. Trans

    File.yaml?(fname)

    Trans, Dec 9, 2006, in forum: Ruby
    Replies:
    4
    Views:
    140
    Trans
    Dec 9, 2006
  5. Great Deals

    open (F, "$fname") vs open $F, $fname;

    Great Deals, Oct 1, 2003, in forum: Perl Misc
    Replies:
    1
    Views:
    115
    Tad McClellan
    Oct 1, 2003
Loading...

Share This Page