Fast alternatives to "File" and "IO" for large numbers of files ?

Discussion in 'Ruby' started by Philip Rhoades, Feb 24, 2011.

  1. People,

    I have script that does:

    - statistical processing from data in 50x32x20 (32,000) large input files

    - writes a small text file (22 lines with one or more columns of numbers)
    for each input file

    - read all small files back in again for final processing.

    Profiling shows that IO is taking up more than 60% of the time - short of
    making fewer, larger files for the data (which is inconvenient for random
    viewing/ processing of individual results) are there other alternatives to
    using the "File" and "IO" classes that would be faster?

    Thanks,

    Phil.
     
    Philip Rhoades, Feb 24, 2011
    #1
    1. Advertising

  2. Philip Rhoades

    pp Guest

    Re: Fast alternatives to "File" and "IO" for large numbers of files?

    [Note: parts of this message were removed to make it a legal post.]


    Hi, could you be more specific on what do you do with the small files, read/write in per-line or whole file?for rapid file ops due to file system heaps(or sort) may be slow anyway.so maybe you can try less file ops, for example, write a file with a single string may serve the io cache well. or, maybe, have a lot of files to write/read in a new thread, so that IO may not interfere your none-IO calculations, if you have some

    > Date: Thu, 24 Feb 2011 12:09:48 +0900
    > From:
    > Subject: Fast alternatives to "File" and "IO" for large numbers of files ?
    > To:
    >
    > People,
    >
    > I have script that does:
    >
    > - statistical processing from data in 50x32x20 (32,000) large input files
    >
    > - writes a small text file (22 lines with one or more columns of numbers)
    > for each input file
    >
    > - read all small files back in again for final processing.
    >
    > Profiling shows that IO is taking up more than 60% of the time - short of
    > making fewer, larger files for the data (which is inconvenient for random
    > viewing/ processing of individual results) are there other alternatives to
    > using the "File" and "IO" classes that would be faster?
    >
    > Thanks,
    >
    > Phil.
    >
     
    pp, Feb 24, 2011
    #2
    1. Advertising

  3. Philip Rhoades

    Peter Zotov Guest

    Re: Fast alternatives to "File" and "IO" for large numbers of files?

    On Thu, 24 Feb 2011 12:09:48 +0900, Philip Rhoades wrote:
    > People,
    >
    > I have script that does:
    >
    > - statistical processing from data in 50x32x20 (32,000) large input
    > files
    >
    > - writes a small text file (22 lines with one or more columns of
    > numbers)
    > for each input file
    >
    > - read all small files back in again for final processing.
    >
    > Profiling shows that IO is taking up more than 60% of the time -
    > short of
    > making fewer, larger files for the data (which is inconvenient for
    > random
    > viewing/ processing of individual results) are there other
    > alternatives to
    > using the "File" and "IO" classes that would be faster?
    >
    > Thanks,
    >
    > Phil.


    I can think of two approaches here.

    First, you can write one large file (perhaps creating it in memory
    first) and then splitting it afterwards.

    Second, if you're on *nix, you can write your output files to a tmpfs.

    Both should reduce number of seeks and improve performance.

    --
    WBR, Peter Zotov.
     
    Peter Zotov, Feb 24, 2011
    #3
  4. Re: Fast alternatives to "File" and "IO" for large numbers of files?

    On Thu, Feb 24, 2011 at 4:09 AM, Philip Rhoades <> wrote:
    > I have script that does:
    >
    > - statistical processing from data in 50x32x20 (32,000) large input files
    >
    > - writes a small text file (22 lines with one or more columns of numbers)
    > for each input file
    >
    > - read all small files back in again for final processing.
    >
    > Profiling shows that IO is taking up more than 60% of the time - short of
    > making fewer, larger files for the data (which is inconvenient for random
    > viewing/ processing of individual results) are there other alternatives to
    > using the "File" and "IO" classes that would be faster?


    I think whatever you do, as long as you do not get rid of the IO or
    improve IO access patterns your performance gains will only be
    marginally. Even a C extension would not help you if you stick with
    the same IO patterns.

    We should probably learn more about the nature of your processing but
    considering that you only write 32,000 * 22 * 80 (estimated line
    length) = 56,320,000 bytes (~ 54MB) NOT writing those small files is
    probably an option. Burning 54MB of memory in a structure suitable
    for later processing (i.e. you do not need to parse all those small
    files) is a small price compared to the large amount of IO you need to
    do to read that data back again (plus the CPU cycles for parsing).

    The second best option would be to keep the data in memory as before
    but still write those small files if you really need them (for example
    for later processing). In this case you could put this in a separate
    thread so your main processing can continue on the state in memory.
    That way you'll gain another improvement.

    For reading of the large files I would use at most two threads because
    I assume they all reside on the same filesystem. With two threads one
    can do calculations (e.g. parsing, aggregating) while the other thread
    is doing IO. If you have more threads you'll likely see a slowdown
    because you may introduce too many seeks etc.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Feb 24, 2011
    #4
  5. Re: Fast alternatives to "File" and "IO" for large numbers of files?

    If you read in all the data files and build a single Ruby data structure
    which contains all the data you're interested in, you can dump it out
    like this:

    File.open("foo.msh","wb") { |f| Marshal.dump(myobj, f) }

    And you can reload it in another program like this:

    myobj = File.open("foo.msh","rb") { |f| Marshal.load(f) }

    This is *very* fast.

    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Feb 24, 2011
    #5
  6. Re: Fast alternatives to "File" and "IO" for large numbers of files?

    People,

    Thanks to all who responded - I have concatenated the replies for ease
    of response:


    On 2011-02-24 19:15, pp wrote:
    >
    >> Date: Thu, 24 Feb 2011 12:09:48 +0900 From:
    >> Subject: Fast alternatives to "File" and "IO" for large numbers of
    >> files ? To:
    >>
    >> People,
    >>
    >> I have script that does:
    >>
    >> - statistical processing from data in 50x32x20 (32,000) large input
    >> files
    >>
    >> - writes a small text file (22 lines with one or more columns of
    >> numbers) for each input file
    >>
    >> - read all small files back in again for final processing.
    >>
    >> Profiling shows that IO is taking up more than 60% of the time -
    >> short of making fewer, larger files for the data (which is
    >> inconvenient for random viewing/ processing of individual results)
    >> are there other alternatives to using the "File" and "IO" classes
    >> that would be faster?
    >>
    >> Thanks,
    >>
    >> Phil.
    >>

    > Hi, could you be more specific on what do you do with the small
    > files, read/write in per-line or whole file?for rapid file ops due to
    > file system heaps(or sort) may be slow anyway.so maybe you can try
    > less file ops, for example, write a file with a single string may
    > serve the io cache well. or, maybe, have a lot of files to write/read
    > in a new thread, so that IO may not interfere your none-IO
    > calculations, if you have some



    Each individual small file is written in one go ie file opened, written
    to and closed - there is no re-opening and more writing. See later for
    current approach.


    On 2011-02-24 19:19, Peter Zotov wrote:
    >
    > I can think of two approaches here.
    >
    > First, you can write one large file (perhaps creating it in memory
    > first) and then splitting it afterwards.
    >
    > Second, if you're on *nix, you can write your output files to a
    > tmpfs.
    >
    > Both should reduce number of seeks and improve performance.



    After staying up all night, I eventually settled on a hash table
    outputted via YAML to ONE very large file. I need a human friendly form
    for spot checking of statistical calculations so I have used a hash
    table and the key lets me find a particular calculation in the big file
    in the same way I would have found it in the similarly named
    subdirectories. I haven't actually implemented this on the full system
    yet so it will be interesting to see if Vim can handle opening a 32,000
    x 23 line file (and bigger actually if each individual small file is
    bigger than a 23x1 array).


    On 2011-02-24 19:52, Robert Klemme wrote:
    >
    > I think whatever you do, as long as you do not get rid of the IO or
    > improve IO access patterns your performance gains will only be
    > marginally. Even a C extension would not help you if you stick with
    > the same IO patterns.



    Right.


    > We should probably learn more about the nature of your processing
    > but considering that you only write 32,000 * 22 * 80 (estimated line
    > length) = 56,320,000 bytes (~ 54MB) NOT writing those small files is
    > probably an option. Burning 54MB of memory in a structure suitable
    > for later processing (i.e. you do not need to parse all those small
    > files) is a small price compared to the large amount of IO you need
    > to do to read that data back again (plus the CPU cycles for
    > parsing).



    Yep - I came to that conclusion too and went for one big hash table and
    one file.


    > The second best option would be to keep the data in memory as before
    > but still write those small files if you really need them (for
    > example for later processing). In this case you could put this in a
    > separate thread so your main processing can continue on the state in
    > memory. That way you'll gain another improvement.



    Interesting idea but I'm not sure how to actually implement that but I
    will see how the hash table/one file approach goes first.


    > For reading of the large files I would use at most two threads
    > because I assume they all reside on the same filesystem. With two
    > threads one can do calculations (e.g. parsing, aggregating) while the
    > other thread is doing IO. If you have more threads you'll likely see
    > a slowdown because you may introduce too many seeks etc.



    OK, this idea might help for the next stage.


    On 2011-02-24 20:02, Brian Candler wrote:
    > If you read in all the data files and build a single Ruby data
    > structure which contains all the data you're interested in, you can
    > dump it out like this:
    >
    > File.open("foo.msh","wb") {|f| Marshal.dump(myobj, f) }



    I did read up about this stuff but I have to have human readable files.


    > And you can reload it in another program like this:
    >
    > myobj = File.open("foo.msh","rb") {|f| Marshal.load(f) }
    >
    > This is*very* fast.



    I might check this out as an exercise!

    Thanks to all again!

    Phil.
    --
    Philip Rhoades

    GPO Box 3411
    Sydney NSW 2001
    Australia
    E-mail:
     
    Philip Rhoades, Feb 26, 2011
    #6
  7. Re: Fast alternatives to "File" and "IO" for large numbers of files?

    Philip Rhoades wrote in post #984112:
    >> If you read in all the data files and build a single Ruby data
    >> structure which contains all the data you're interested in, you can
    >> dump it out like this:
    >>
    >> File.open("foo.msh","wb") {|f| Marshal.dump(myobj, f) }

    >
    >
    > I did read up about this stuff but I have to have human readable files.


    You can use YAML.dump and .load too. Not as fast, and rather buggy(*),
    but it would do the job.

    (*) There are various strings which ruby's default YAML implementation
    (syck) cannot serialize and deserialize back to the same string. These
    might have been fixed, or you could use a different YAML implementation.

    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Feb 27, 2011
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. PedroX
    Replies:
    9
    Views:
    1,571
    Bryce K. Nielsen
    Jun 28, 2005
  2. Yi Xing
    Replies:
    6
    Views:
    482
    Simon Forman
    Jul 26, 2006
  3. Catherine Moroney

    fast copying of large files in python

    Catherine Moroney, Nov 2, 2011, in forum: Python
    Replies:
    1
    Views:
    951
    Dave Angel
    Nov 2, 2011
  4. Devesh Agrawal
    Replies:
    18
    Views:
    274
  5. Stuart Clarke

    Fast searching of large files

    Stuart Clarke, Jul 1, 2010, in forum: Ruby
    Replies:
    6
    Views:
    212
    Roger Pack
    Jul 1, 2010
Loading...

Share This Page