FileString - request for comments

A

apeiros

Hi there

I just put FileString on github: http://github.com/apeiros/filestring
FileString is a class that wraps a path on the filesystem (a file) and provides an exact copy of the String API. This means you can code as if you had a String and your file on the disk gets manipulated just "magically".

The library is very young (just a bit more than 24h), so please use with care.

I'd appreciate any kind of comment.

Regards
Stefan
 
J

James Edward Gray II

I just put FileString on github: http://github.com/apeiros/filestring
FileString is a class that wraps a path on the filesystem (a file)
and provides an exact copy of the String API. This means you can
code as if you had a String and your file on the disk gets
manipulated just "magically".

Interesting choice to use a String. I used Tie::File a couple of
times in Perl code. It works as an Array instead:

http://search.cpan.org/~mjd/Tie-File-0.96/lib/Tie/File.pm

James Edward Gray II
 
J

Joel VanderWerf

James said:
Interesting choice to use a String. I used Tie::File a couple of times
in Perl code. It works as an Array instead:

http://search.cpan.org/~mjd/Tie-File-0.96/lib/Tie/File.pm

James Edward Gray II

What would the advantage over mmap[1] be? FileString is pure ruby
(right?) and hence more portable, but probably mmap is much more
efficient? Any other tradeoffs?

[1] http://moulon.inra.fr/ruby/mmap.html; looks like this project of Guy
Decoux's has been recently adopted by knu: http://github.com/knu/ruby-mmap.
 
A

apeiros

Hi Joel
What would the advantage over mmap[1] be? FileString is pure ruby
(right?) and hence more portable, but probably mmap is much more
efficient? Any other tradeoffs?

Interesting, I was looking if a solution existed already and didn't find mmap. Yes, FileString is pure ruby and should therefore run on all ruby implementations. And yes, I'd expect mmap to be more efficient on the other hand. It'd be interesting to combine the two (if that's at all possible).
In a quick test it seems FileString is more complete too, e.g. Mmap doesn't have #replace (should be trivial to add). But Mmap has the feature to only tie a part of the file.
Thanks for the link

Regards
Stefan
 
A

apeiros

-------- Original-Nachricht --------
Datum: Mon, 9 Nov 2009 12:37:17 +0900
Von: James Edward Gray II <[email protected]>
An: (e-mail address removed)
Betreff: Re: FileString - request for comments

Interesting choice to use a String. I used Tie::File a couple of
times in Perl code. It works as an Array instead:

http://search.cpan.org/~mjd/Tie-File-0.96/lib/Tie/File.pm

James Edward Gray II

Somebody I know already implemented a TieFile in ruby, the repository is at http://killerfox.protection-fault.ch/gitrepo/tie_file.git

Personally I don't tend to think of a file as an array. I'd use Tie::File if I'd need a persistent array, so the problem is coming "the other way round". With FileString I explicitly want to deal with a File, but not with an IO like API (of course you could go at it as "I need a persistent String" too - but that wasn't/isn't the case for me).

Regards
Stefan
 
E

Eleanor McHugh

Hi Joel
What would the advantage over mmap[1] be? FileString is pure ruby
(right?) and hence more portable, but probably mmap is much more
efficient? Any other tradeoffs?

Interesting, I was looking if a solution existed already and didn't
find mmap. Yes, FileString is pure ruby and should therefore run on
all ruby implementations. And yes, I'd expect mmap to be more
efficient on the other hand. It'd be interesting to combine the two
(if that's at all possible).
In a quick test it seems FileString is more complete too, e.g. Mmap
doesn't have #replace (should be trivial to add). But Mmap has the
feature to only tie a part of the file.

It would probably be fairly trivial for you to directly support mmap
at the OS level using Ruby/DL, Ruby-FFI or even syscall (although
that's ugly and fragile). Take a look at some of my Plumber's Guide
presentations at the link in my signature and also at http://kenai.com/projects/ruby-ffi
for details of how to wrap these kinds of system calls such that
they'll run identically on JRuby, Rubinius and MRI.


Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net
 
R

Robert Klemme

2009/11/9 Eleanor McHugh said:
On 9 Nov 2009, at 13:54, (e-mail address removed) wrote:
It would probably be fairly trivial for you to directly support mmap at the
OS level using Ruby/DL, Ruby-FFI or even syscall (although that's ugly and
fragile).

I am still trying to wrap my head around the question whether hiding
file IO behind a String API is a good idea. Basically the reason to
create something like this is to be able to use a file in places which
expect to be given a String instance. However, code that uses String
assumes fast access to arbitrary portions of the string. When those
accesses are translated into random accesses to a file performance
_might_ suffer dramatically. Put differently: hiding the fact that we
are dealing with a file is convenient but may actually break your
neck. And although at a certain level of abstraction a file and a
String are pretty much the same (sequence of chars / bytes) it may
actually be a good thing to keep the API separate in order to treat
both appropriately. Stefan, what's your experience?

Kind regards

robert
 
R

Ralph Shnelvar

Robert,

RK> I am still trying to wrap my head around the question whether hiding
RK> file IO behind a String API is a good idea.

As the PickAxe book points out, by having file i/o represented by a
String ... that is, making it irrelevant whether one is talking to a
String or a File ... makes for some nice unit testing.
 
A

apeiros

-------- Original-Nachricht --------
Datum: Tue, 10 Nov 2009 00:28:56 +0900
Von: Robert Klemme <[email protected]>
An: (e-mail address removed)
Betreff: Re: FileString - request for comments

I am still trying to wrap my head around the question whether hiding
file IO behind a String API is a good idea. Basically the reason to
create something like this is to be able to use a file in places which
expect to be given a String instance.

No. At least that was not the idea (though, you could).
The reason is that e.g. replacing a part of a file is cumbersome.
Compare:

# IO API:
File.open(path, "r+b") do |fh|
fh.seek(offset+length)
rest = fh.read
fh.seek(offset)
fh.write(replacement)
fh.write(rest)
}

# String API:
fs = FileString.new(path)
fs[offset, length] = replacment # done!

Imagine how much more inconvenient it becomes when it's not offset & length but a Range, or when you have to accomodate negative offsets etc.

And there are other examples, just dive a bit in FileString's source :)

The String API is *far* more convenient.
However, code that uses String
assumes fast access to arbitrary portions of the string. When those
accesses are translated into random accesses to a file performance
_might_ suffer dramatically.

Yes. If you get that kind of problem - you can always use File.read instead of FileString#to_s (or to_str).
Put differently: hiding the fact that we
are dealing with a file is convenient but may actually break your
neck.

As all highlevel things. If you don't know the things you're dealing with you can easily kill performance. Consider e.g. ary.any? { |obj| other.include?(obj) } - there, just accidentally created an O(n^2) algorithm. It can happen everywhere and it can look totally innocent.
That's not a problem that's specific to FileString but to everything that's abstract.
And although at a certain level of abstraction a file and a
String are pretty much the same (sequence of chars / bytes) it may
actually be a good thing to keep the API separate in order to treat
both appropriately. Stefan, what's your experience?

As you see, I disagree :)
However, what you say is of course correct. Using FileString means you have to keep in mind that you're dealing with a file.
But: if you know you're dealing with a file, it can even help you making things faster. For example, if you indeed want to compare two files for equality, FileString#== will be faster and less memory intensive than you doing File.read(a) == File.read(b) if the two files are big.
Kind regards

robert

Thanks for your thoughts robert, much appreciated

regards
Stefan
 
E

Eleanor McHugh

Robert,

RK> I am still trying to wrap my head around the question whether
hiding
RK> file IO behind a String API is a good idea.

As the PickAxe book points out, by having file i/o represented by a
String ... that is, making it irrelevant whether one is talking to a
String or a File ... makes for some nice unit testing.

Using a given representation just because it's unit testing friendly
isn't necessarily a good idea...


Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net
 
R

Robert Klemme

-------- Original-Nachricht --------
As all highlevel things. If you don't know the things you're dealing with you can easily kill performance. Consider e.g. ary.any? { |obj| other.include?(obj) } - there, just accidentally created an O(n^2) algorithm. It can happen everywhere and it can look totally innocent.
That's not a problem that's specific to FileString but to everything that's abstract.
True.


As you see, I disagree :)
However, what you say is of course correct. Using FileString means
you > have to keep in mind that you're dealing with a file.
But: if you know you're dealing with a file, it can even help you
making things faster. For example, if you indeed want to compare two
files for equality, FileString#== will be faster and less memory
intensive than you doing File.read(a) == File.read(b) if the two
files > are big.

A good point! You're probably right and I was too pessimistic. I'd
love to see

fs[/foo(\w+)/, 1] = "bar"
fs.gsub! /foo/, "bar"

etc. because those would be the ones that would make FileString
convenient for me. :)
Thanks for your thoughts robert, much appreciated

Thanks for listening and sharing!

Kind regards

robert
 
A

apeiros

-------- Original-Nachricht --------
Datum: Tue, 10 Nov 2009 06:25:08 +0900
Von: Robert Klemme <[email protected]>
Betreff: Re: FileString - request for comments
A good point! You're probably right and I was too pessimistic. I'd
love to see

fs[/foo(\w+)/, 1] = "bar"
fs.gsub! /foo/, "bar"

etc. because those would be the ones that would make FileString
convenient for me. :)

Those already exist. Unfortunately optimizing regex matching is too involved as that I could have done that in 24h :)
Means fs[/foo(\w+)/, 1] = "bar" is just more convenient than writing:
data = File.read
data[/foo(\w+)/, 1] = "bar"
File.open(path, "w") { |fh| fh.write(data) }
But I think that's already quite worth it :)
I mean - that's just lots of boilerplate.
Thanks for listening and sharing!

Always :D
The listening part has made me change the docs btw., I know hint on thinking about performance and probably just use a string and write back when all is done.

Regards
Stefan
 
A

apeiros

-------- Original-Nachricht --------
Datum: Tue, 10 Nov 2009 06:25:08 +0900
Von: Robert Klemme <[email protected]>
Betreff: Re: FileString - request for comments
A good point! You're probably right and I was too pessimistic. I'd
love to see

fs[/foo(\w+)/, 1] = "bar"

I just noticed that I actually didn't have that functionality in. I added it now in the way described in the earlier reply.

Also a small correction of one of my earlier statements (typo):
You can use File.read or FileString#to_s (or to_str) instead of the FileString instance. FileString#to_s returns the contents of the file.

Regards
Stefan
 
D

David Masover

Hi Joel
What would the advantage over mmap[1] be? FileString is pure ruby
(right?) and hence more portable, but probably mmap is much more
efficient? Any other tradeoffs?

Interesting, I was looking if a solution existed already and didn't find
mmap. Yes, FileString is pure ruby and should therefore run on all ruby
implementations. And yes, I'd expect mmap to be more efficient on the other
hand.

I'd have looked for mmap first, knowing the concept from Linux. I'd also expect
that with mmap, you should be able to implement an efficient regex, though I'm
not sure how well gsub! would work, unless you can guarantee the match is
always exactly the length of the target string.

(And for gsub to be efficient, you'd need some fancy copy-on-write stuff, which
would make it that much more difficult to chain them.)

But if you were looking for comments, it looks awesome. Thanks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top