Removing parts from a file

Thomas Dutch · Dec 18, 2005

Hello,

I'm relatively new to Ruby and I have a question:

Is it possible to remove one or more lines from a file, without reading
the whole file and writing it away again? This because I'll have to do
this with files of 1 gigabyte and larger... Is there a high performance
solution for this?

Thank you!

Harpo · Dec 18, 2005

Thomas said:
Hello,

I'm relatively new to Ruby and I have a question:

Is it possible to remove one or more lines from a file, without
reading the whole file and writing it away again? This because I'll
have to do this with files of 1 gigabyte and larger... Is there a high
performance solution for this?

Thank you!

Said like this, i don't think it is possible as it is not related to the
language but to the file structure.
It depends on the programs which read the file, can they be fixed to
accept lines which begin with somethinq that says 'skip me', such as a
'#' ?

gwtmp01 · Dec 18, 2005

Is it possible to remove one or more lines from a file, without
reading
the whole file and writing it away again? This because I'll have to do
this with files of 1 gigabyte and larger... Is there a high
performance
solution for this?

In general, no. I'm answering from the perspective of a typical
Unix/Posix file system. You can truncate a file to discard some
number of trailing bytes using the ftruncate system call.
In Ruby that system call is accessed via File.truncate.

There is no analogous function for removing bytes at the start
of a file. You can, of course, seek to any position in a file before
doing IO. In ruby you want to look at IO#seek.

Hope this helps.

Gary Wright

Eero Saynatkari · Dec 18, 2005

Thomas said:
Hello,

I'm relatively new to Ruby and I have a question:

Is it possible to remove one or more lines from a file, without reading
the whole file and writing it away again? This because I'll have to do
this with files of 1 gigabyte and larger... Is there a high performance
solution for this?

You could read the files in small blocks or line per
line and only process that small part at once. I must
say I am not quite sure if ruby internally buffers the
file (though I assume not), so maybe try:

File.open(fromfile, 'r') {|from|
File.open(tofile, 'w') {|to|
if (line = from.gets) == some_condition
to.puts line
end
}
}

Thank you!

E

gwtmp01 · Dec 19, 2005

Swap record to be deleted with valid record in file (don't even need
to do a full swap, just overwrite the bad record) - repeat until all
records to be deleted are at the end of the file and truncate the file
before them.

Reasonable idea.

(if all deletions were block aligned, i'd start looking into direct
filesystem manipulation for pure performance, but I don't know how
that would work -- and i don't think it would have nice effects on
fragmentation)

I hope you don't mean reading/writing to the raw device. I would
do a whole lot of other things before I would drop down to
raw device access. Premature optimization is not a good thing.

Gary Wright

Charles Ballowe · Dec 19, 2005

I hope you don't mean reading/writing to the raw device. I would
do a whole lot of other things before I would drop down to
raw device access. Premature optimization is not a good thing.

nah... it would require filesystem interfaces to the block mapping in
the inode - I don't know if such things exist and they certainly
aren't portable, but it seems like it could be a very efficient way to
drop data out of the middle of the file. Probably over-optimizing
though.

Of course if the records were fixed sized and block aligned, the
shuffling would be pretty efficient and the extra level of
optimization would likely be overkill.

-Charlie

gwtmp01 · Dec 19, 2005

nah... it would require filesystem interfaces to the block mapping in
the inode - I don't know if such things exist and they certainly
aren't portable, but it seems like it could be a very efficient way to
drop data out of the middle of the file. Probably over-optimizing
though.

uh, yep.

Just to be clear, when I say reading/writing to the raw device I mean
something like /dev/rdisk0, which presents the underlying media as a
single large file. This bypasses the standard filesystem so that
the media just looks like a huge array of blocks. I was not suggesting
actually writing a disk device driver to interface with the hardware
directly.

It is possible to query the filesystem to get the block size of the
device. You could then arrange for your I/O to be in multiples of
the native block size to improve performance. In Ruby:

File.stat("testfile").blksize

Note: this is all system programming stuff and has little if
anything to do with Ruby except for how Ruby abstracts the underlying
filesystem calls.

Disclaimer: I have no idea what the situation is on the Windows side of
the house.

ara.t.howard · Dec 19, 2005

nah... it would require filesystem interfaces to the block mapping in
the inode - I don't know if such things exist and they certainly
aren't portable, but it seems like it could be a very efficient way to
drop data out of the middle of the file. Probably over-optimizing
though.

Of course if the records were fixed sized and block aligned, the
shuffling would be pretty efficient and the extra level of
optimization would likely be overkill.

-Charlie

if your records are fixed size you'd be mad to takle this application without
considering bdb (berkeley db) and using it's record database file format. this
interface would make modifying the data extremely quick. also, if the records
are in fact fixed size, using mmap is the cheapest way:

[ahoward@jib ahoward]$ cat a.rb
require "yaml"
require "mmap"

records =
%w( a b c ),
%w( 0 1 2 ),
%w( x y z )
open("records", "w"){|f| f.write records.join}

y "records" => IO::read("records")

mmap = Mmap::new "records", "rw", Mmap::MAP_SHARED

record_0 = mmap[0,3]
record_1 = mmap[3,3]
record_2 = mmap[6,3]
mmap[3,3] = record_2 # move record down
mmap[6 .. -1] = "" # truncate

mmap.msync
mmap.munmap

y "records" => IO::read("records")

[ahoward@jib ahoward]$ ruby a.rb
---
records: abc012xyz
---
records: abcxyz

it's tough to do io better than the kernel...

regards.

-a
--
===============================================================================
| ara [dot] t [dot] howard [at] noaa [dot] gov
| all happiness comes from the desire for others to be happy. all misery
| comes from the desire for oneself to be happy.
| -- bodhicaryavatara
===============================================================================

Johannes Friestad · Dec 19, 2005

Windows has block sizes too, but File.stat(...).blksize returns nil.
(With Win XP Pro, Ruby 1.8.2)

But you can hardcode block sizes: Find or create a small file (a few
hundred bytes or less) and select 'properties' in Windows Explorer. It
says something like "size: 112 bytes. size on disk: 4096 bytes". 'Size
on disk' is the block size.

Sysread/write on my system seems to benefit from using a block-sized
buffer, it is slower with a buffer twice or half the size of the
block. Thanks for the tip

Plain buffered read/write appears to be less sensitive to buffer size,
but performs best with about twice the buffer size. There seems to be
little to distinguish buffered standard read/write from buffered
sysread/write.

Timings on a laptop for read/write of a 1.2 GB file:
- 3.5 min: buffered plain read/write (buffer 8192), buffered
sysread/write (buffer 4096)
- 17 min: File#each

Just in case the records are not fixed size.

jf

BTW: There are two more scenarios:

- If there is only one record to remove each time the file is opened,
it may be possible to use read/write mode (a+) and update the file in
place: Use IO#seek to go to the entry, and move all blocks following
the deleted entry forward. On average you save the writing of half the
file. If there are dozens of records, there is little gain. (Because
the first one is likely to be relatively close to the start of the
file.)

- Use 'lazy delete': Merely overwrite the record(s) with blanks,
nulls, newlines, whatever, or mark them as deleted in some other
fashion. The file keeps the same size, and all entries have the same
file position as before. Repackage the file once in a while, removing
the blank entries, when they start to take up a significant proportion
of the file size.
This is clearly the best-performing solution by far, but other
programs using the file may need to be updated to recognize the 'this
entry is deleted' marking.

Removing access to parts of memeory	9	Oct 1, 2010
How do I save information from an GUI into a XML-file?	0	Aug 17, 2022
How do i convert a Chinese DAT file from a game I play	2	Feb 4, 2022
identifying and removing comments in a seprate ruby file	3	Dec 9, 2010
How to create PDF file in Batch	5	May 11, 2022
Fix and improve a UDF File System Driver	0	Aug 20, 2023
removing BOM prepended by codecs?	0	Sep 24, 2013
Remove Parts of a String	13	Jun 1, 2008

Removing parts from a file

Thomas Dutch

Harpo

gwtmp01

Eero Saynatkari

gwtmp01

Charles Ballowe

gwtmp01

ara.t.howard

Johannes Friestad

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads