Removing parts from a file

T

Thomas Dutch

Hello,

I'm relatively new to Ruby and I have a question:

Is it possible to remove one or more lines from a file, without reading
the whole file and writing it away again? This because I'll have to do
this with files of 1 gigabyte and larger... Is there a high performance
solution for this?

Thank you!
 
H

Harpo

Thomas said:
Hello,

I'm relatively new to Ruby and I have a question:

Is it possible to remove one or more lines from a file, without
reading the whole file and writing it away again? This because I'll
have to do this with files of 1 gigabyte and larger... Is there a high
performance solution for this?

Thank you!

Said like this, i don't think it is possible as it is not related to the
language but to the file structure.
It depends on the programs which read the file, can they be fixed to
accept lines which begin with somethinq that says 'skip me', such as a
'#' ?
 
G

gwtmp01

Is it possible to remove one or more lines from a file, without
reading
the whole file and writing it away again? This because I'll have to do
this with files of 1 gigabyte and larger... Is there a high
performance
solution for this?

In general, no. I'm answering from the perspective of a typical
Unix/Posix file system. You can truncate a file to discard some
number of trailing bytes using the ftruncate system call.
In Ruby that system call is accessed via File.truncate.

There is no analogous function for removing bytes at the start
of a file. You can, of course, seek to any position in a file before
doing IO. In ruby you want to look at IO#seek.

Hope this helps.


Gary Wright
 
E

Eero Saynatkari

Thomas said:
Hello,

I'm relatively new to Ruby and I have a question:

Is it possible to remove one or more lines from a file, without reading
the whole file and writing it away again? This because I'll have to do
this with files of 1 gigabyte and larger... Is there a high performance
solution for this?

You could read the files in small blocks or line per
line and only process that small part at once. I must
say I am not quite sure if ruby internally buffers the
file (though I assume not), so maybe try:

File.open(fromfile, 'r') {|from|
File.open(tofile, 'w') {|to|
if (line = from.gets) == some_condition
to.puts line
end
}
}
Thank you!


E
 
G

gwtmp01

Swap record to be deleted with valid record in file (don't even need
to do a full swap, just overwrite the bad record) - repeat until all
records to be deleted are at the end of the file and truncate the file
before them.

Reasonable idea.
(if all deletions were block aligned, i'd start looking into direct
filesystem manipulation for pure performance, but I don't know how
that would work -- and i don't think it would have nice effects on
fragmentation)

I hope you don't mean reading/writing to the raw device. I would
do a whole lot of other things before I would drop down to
raw device access. Premature optimization is not a good thing.

Gary Wright
 
C

Charles Ballowe

I hope you don't mean reading/writing to the raw device. I would
do a whole lot of other things before I would drop down to
raw device access. Premature optimization is not a good thing.
nah... it would require filesystem interfaces to the block mapping in
the inode - I don't know if such things exist and they certainly
aren't portable, but it seems like it could be a very efficient way to
drop data out of the middle of the file. Probably over-optimizing
though.

Of course if the records were fixed sized and block aligned, the
shuffling would be pretty efficient and the extra level of
optimization would likely be overkill.

-Charlie
 
G

gwtmp01

nah... it would require filesystem interfaces to the block mapping in
the inode - I don't know if such things exist and they certainly
aren't portable, but it seems like it could be a very efficient way to
drop data out of the middle of the file. Probably over-optimizing
though.

uh, yep.

Just to be clear, when I say reading/writing to the raw device I mean
something like /dev/rdisk0, which presents the underlying media as a
single large file. This bypasses the standard filesystem so that
the media just looks like a huge array of blocks. I was not suggesting
actually writing a disk device driver to interface with the hardware
directly.

It is possible to query the filesystem to get the block size of the
device. You could then arrange for your I/O to be in multiples of
the native block size to improve performance. In Ruby:

File.stat("testfile").blksize

Note: this is all system programming stuff and has little if
anything to do with Ruby except for how Ruby abstracts the underlying
filesystem calls.

Disclaimer: I have no idea what the situation is on the Windows side of
the house.
 
A

ara.t.howard

nah... it would require filesystem interfaces to the block mapping in
the inode - I don't know if such things exist and they certainly
aren't portable, but it seems like it could be a very efficient way to
drop data out of the middle of the file. Probably over-optimizing
though.

Of course if the records were fixed sized and block aligned, the
shuffling would be pretty efficient and the extra level of
optimization would likely be overkill.

-Charlie

if your records are fixed size you'd be mad to takle this application without
considering bdb (berkeley db) and using it's record database file format. this
interface would make modifying the data extremely quick. also, if the records
are in fact fixed size, using mmap is the cheapest way:

[ahoward@jib ahoward]$ cat a.rb
require "yaml"
require "mmap"

records =
%w( a b c ),
%w( 0 1 2 ),
%w( x y z )
open("records", "w"){|f| f.write records.join}

y "records" => IO::read("records")

mmap = Mmap::new "records", "rw", Mmap::MAP_SHARED

record_0 = mmap[0,3]
record_1 = mmap[3,3]
record_2 = mmap[6,3]
mmap[3,3] = record_2 # move record down
mmap[6 .. -1] = "" # truncate

mmap.msync
mmap.munmap


y "records" => IO::read("records")


[ahoward@jib ahoward]$ ruby a.rb
---
records: abc012xyz
---
records: abcxyz


it's tough to do io better than the kernel...

regards.

-a
--
===============================================================================
| ara [dot] t [dot] howard [at] noaa [dot] gov
| all happiness comes from the desire for others to be happy. all misery
| comes from the desire for oneself to be happy.
| -- bodhicaryavatara
===============================================================================
 
J

Johannes Friestad

Windows has block sizes too, but File.stat(...).blksize returns nil.
(With Win XP Pro, Ruby 1.8.2)

But you can hardcode block sizes: Find or create a small file (a few
hundred bytes or less) and select 'properties' in Windows Explorer. It
says something like "size: 112 bytes. size on disk: 4096 bytes". 'Size
on disk' is the block size.

Sysread/write on my system seems to benefit from using a block-sized
buffer, it is slower with a buffer twice or half the size of the
block. Thanks for the tip :)
Plain buffered read/write appears to be less sensitive to buffer size,
but performs best with about twice the buffer size. There seems to be
little to distinguish buffered standard read/write from buffered
sysread/write.

Timings on a laptop for read/write of a 1.2 GB file:
- 3.5 min: buffered plain read/write (buffer 8192), buffered
sysread/write (buffer 4096)
- 17 min: File#each

Just in case the records are not fixed size.

jf

BTW: There are two more scenarios:

- If there is only one record to remove each time the file is opened,
it may be possible to use read/write mode (a+) and update the file in
place: Use IO#seek to go to the entry, and move all blocks following
the deleted entry forward. On average you save the writing of half the
file. If there are dozens of records, there is little gain. (Because
the first one is likely to be relatively close to the start of the
file.)

- Use 'lazy delete': Merely overwrite the record(s) with blanks,
nulls, newlines, whatever, or mark them as deleted in some other
fashion. The file keeps the same size, and all entries have the same
file position as before. Repackage the file once in a while, removing
the blank entries, when they start to take up a significant proportion
of the file size.
This is clearly the best-performing solution by far, but other
programs using the file may need to be updated to recognize the 'this
entry is deleted' marking.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top