Removing parts from a file

Discussion in 'Ruby' started by Thomas Dutch, Dec 18, 2005.

  1. Thomas Dutch

    Thomas Dutch Guest

    Hello,

    I'm relatively new to Ruby and I have a question:

    Is it possible to remove one or more lines from a file, without reading
    the whole file and writing it away again? This because I'll have to do
    this with files of 1 gigabyte and larger... Is there a high performance
    solution for this?

    Thank you!

    --
    Posted via http://www.ruby-forum.com/.
    Thomas Dutch, Dec 18, 2005
    #1
    1. Advertising

  2. Thomas Dutch

    Harpo Guest

    Thomas Dutch wrote:

    > Hello,
    >
    > I'm relatively new to Ruby and I have a question:
    >
    > Is it possible to remove one or more lines from a file, without
    > reading the whole file and writing it away again? This because I'll
    > have to do this with files of 1 gigabyte and larger... Is there a high
    > performance solution for this?
    >
    > Thank you!


    Said like this, i don't think it is possible as it is not related to the
    language but to the file structure.
    It depends on the programs which read the file, can they be fixed to
    accept lines which begin with somethinq that says 'skip me', such as a
    '#' ?
    Harpo, Dec 18, 2005
    #2
    1. Advertising

  3. Thomas Dutch

    Guest

    On Dec 18, 2005, at 12:32 PM, Thomas Dutch wrote:
    > Is it possible to remove one or more lines from a file, without
    > reading
    > the whole file and writing it away again? This because I'll have to do
    > this with files of 1 gigabyte and larger... Is there a high
    > performance
    > solution for this?


    In general, no. I'm answering from the perspective of a typical
    Unix/Posix file system. You can truncate a file to discard some
    number of trailing bytes using the ftruncate system call.
    In Ruby that system call is accessed via File.truncate.

    There is no analogous function for removing bytes at the start
    of a file. You can, of course, seek to any position in a file before
    doing IO. In ruby you want to look at IO#seek.

    Hope this helps.


    Gary Wright
    , Dec 18, 2005
    #3
  4. Thomas Dutch wrote:
    > Hello,
    >
    > I'm relatively new to Ruby and I have a question:
    >
    > Is it possible to remove one or more lines from a file, without reading
    > the whole file and writing it away again? This because I'll have to do
    > this with files of 1 gigabyte and larger... Is there a high performance
    > solution for this?


    You could read the files in small blocks or line per
    line and only process that small part at once. I must
    say I am not quite sure if ruby internally buffers the
    file (though I assume not), so maybe try:

    File.open(fromfile, 'r') {|from|
    File.open(tofile, 'w') {|to|
    if (line = from.gets) == some_condition
    to.puts line
    end
    }
    }

    > Thank you!



    E

    --
    Posted via http://www.ruby-forum.com/.
    Eero Saynatkari, Dec 18, 2005
    #4
  5. Thomas Dutch

    Guest

    On Dec 18, 2005, at 7:23 PM, Charles Ballowe wrote:
    > Swap record to be deleted with valid record in file (don't even need
    > to do a full swap, just overwrite the bad record) - repeat until all
    > records to be deleted are at the end of the file and truncate the file
    > before them.


    Reasonable idea.

    > (if all deletions were block aligned, i'd start looking into direct
    > filesystem manipulation for pure performance, but I don't know how
    > that would work -- and i don't think it would have nice effects on
    > fragmentation)


    I hope you don't mean reading/writing to the raw device. I would
    do a whole lot of other things before I would drop down to
    raw device access. Premature optimization is not a good thing.

    Gary Wright
    , Dec 19, 2005
    #5
  6. On 12/18/05, <> wrote:
    >
    > I hope you don't mean reading/writing to the raw device. I would
    > do a whole lot of other things before I would drop down to
    > raw device access. Premature optimization is not a good thing.
    >

    nah... it would require filesystem interfaces to the block mapping in
    the inode - I don't know if such things exist and they certainly
    aren't portable, but it seems like it could be a very efficient way to
    drop data out of the middle of the file. Probably over-optimizing
    though.

    Of course if the records were fixed sized and block aligned, the
    shuffling would be pretty efficient and the extra level of
    optimization would likely be overkill.

    -Charlie
    Charles Ballowe, Dec 19, 2005
    #6
  7. Thomas Dutch

    Guest

    On Dec 18, 2005, at 8:43 PM, Charles Ballowe wrote:
    > On 12/18/05, <> wrote:
    >>
    >> I hope you don't mean reading/writing to the raw device. I would
    >> do a whole lot of other things before I would drop down to
    >> raw device access. Premature optimization is not a good thing.
    >>

    > nah... it would require filesystem interfaces to the block mapping in
    > the inode - I don't know if such things exist and they certainly
    > aren't portable, but it seems like it could be a very efficient way to
    > drop data out of the middle of the file. Probably over-optimizing
    > though.


    uh, yep.

    Just to be clear, when I say reading/writing to the raw device I mean
    something like /dev/rdisk0, which presents the underlying media as a
    single large file. This bypasses the standard filesystem so that
    the media just looks like a huge array of blocks. I was not suggesting
    actually writing a disk device driver to interface with the hardware
    directly.

    It is possible to query the filesystem to get the block size of the
    device. You could then arrange for your I/O to be in multiples of
    the native block size to improve performance. In Ruby:

    File.stat("testfile").blksize

    Note: this is all system programming stuff and has little if
    anything to do with Ruby except for how Ruby abstracts the underlying
    filesystem calls.

    Disclaimer: I have no idea what the situation is on the Windows side of
    the house.
    , Dec 19, 2005
    #7
  8. Thomas Dutch

    Guest

    On Mon, 19 Dec 2005, Charles Ballowe wrote:

    > On 12/18/05, <> wrote:
    >>
    >> I hope you don't mean reading/writing to the raw device. I would
    >> do a whole lot of other things before I would drop down to
    >> raw device access. Premature optimization is not a good thing.
    >>

    > nah... it would require filesystem interfaces to the block mapping in
    > the inode - I don't know if such things exist and they certainly
    > aren't portable, but it seems like it could be a very efficient way to
    > drop data out of the middle of the file. Probably over-optimizing
    > though.
    >
    > Of course if the records were fixed sized and block aligned, the
    > shuffling would be pretty efficient and the extra level of
    > optimization would likely be overkill.
    >
    > -Charlie


    if your records are fixed size you'd be mad to takle this application without
    considering bdb (berkeley db) and using it's record database file format. this
    interface would make modifying the data extremely quick. also, if the records
    are in fact fixed size, using mmap is the cheapest way:

    [ahoward@jib ahoward]$ cat a.rb
    require "yaml"
    require "mmap"

    records =
    %w( a b c ),
    %w( 0 1 2 ),
    %w( x y z )
    open("records", "w"){|f| f.write records.join}

    y "records" => IO::read("records")

    mmap = Mmap::new "records", "rw", Mmap::MAP_SHARED

    record_0 = mmap[0,3]
    record_1 = mmap[3,3]
    record_2 = mmap[6,3]
    mmap[3,3] = record_2 # move record down
    mmap[6 .. -1] = "" # truncate

    mmap.msync
    mmap.munmap


    y "records" => IO::read("records")


    [ahoward@jib ahoward]$ ruby a.rb
    ---
    records: abc012xyz
    ---
    records: abcxyz


    it's tough to do io better than the kernel...

    regards.

    -a
    --
    ===============================================================================
    | ara [dot] t [dot] howard [at] noaa [dot] gov
    | all happiness comes from the desire for others to be happy. all misery
    | comes from the desire for oneself to be happy.
    | -- bodhicaryavatara
    ===============================================================================
    , Dec 19, 2005
    #8
  9. Windows has block sizes too, but File.stat(...).blksize returns nil.
    (With Win XP Pro, Ruby 1.8.2)

    But you can hardcode block sizes: Find or create a small file (a few
    hundred bytes or less) and select 'properties' in Windows Explorer. It
    says something like "size: 112 bytes. size on disk: 4096 bytes". 'Size
    on disk' is the block size.

    Sysread/write on my system seems to benefit from using a block-sized
    buffer, it is slower with a buffer twice or half the size of the
    block. Thanks for the tip :)
    Plain buffered read/write appears to be less sensitive to buffer size,
    but performs best with about twice the buffer size. There seems to be
    little to distinguish buffered standard read/write from buffered
    sysread/write.

    Timings on a laptop for read/write of a 1.2 GB file:
    - 3.5 min: buffered plain read/write (buffer 8192), buffered
    sysread/write (buffer 4096)
    - 17 min: File#each

    Just in case the records are not fixed size.

    jf

    BTW: There are two more scenarios:

    - If there is only one record to remove each time the file is opened,
    it may be possible to use read/write mode (a+) and update the file in
    place: Use IO#seek to go to the entry, and move all blocks following
    the deleted entry forward. On average you save the writing of half the
    file. If there are dozens of records, there is little gain. (Because
    the first one is likely to be relatively close to the start of the
    file.)

    - Use 'lazy delete': Merely overwrite the record(s) with blanks,
    nulls, newlines, whatever, or mark them as deleted in some other
    fashion. The file keeps the same size, and all entries have the same
    file position as before. Repackage the file once in a while, removing
    the blank entries, when they start to take up a significant proportion
    of the file size.
    This is clearly the best-performing solution by far, but other
    programs using the file may need to be updated to recognize the 'this
    entry is deleted' marking.

    On 12/18/05, <> wrote:

    > It is possible to query the filesystem to get the block size of the
    > device. You could then arrange for your I/O to be in multiples of
    > the native block size to improve performance. In Ruby:
    >
    > File.stat("testfile").blksize
    >
    > Disclaimer: I have no idea what the situation is on the Windows side of
    > the house.
    >
    >
    Johannes Friestad, Dec 19, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    582
    Keith Thompson
    Mar 31, 2007
  2. Replies:
    1
    Views:
    935
    =?Utf-8?B?UGV0ZXIgQnJvbWJlcmcgW0MjIE1WUF0=?=
    Apr 12, 2007
  3. kizk
    Replies:
    0
    Views:
    559
  4. Kasper Middelboe Petersen

    Removing access to parts of memeory

    Kasper Middelboe Petersen, Oct 1, 2010, in forum: C++
    Replies:
    9
    Views:
    290
    Victor Bazarov
    Oct 5, 2010
Loading...

Share This Page