Reasons to use a buffer in IO::read?

Steve Midgley · Dec 6, 2007

Hi Ruby people,

I'm wondering what the functional and performance differences might be
between the two statements below? Assume 'io' is an IO instance with
gobs of data in it. Assume 'file' is an open file instance with write
access:

until io.eof? do
file.write(io.read(10485760))
end

buffer = ''
until io.eof? do
buffer = io.read(10485760)
file.write(buffer)
end

I see that Ruby provides for a buffer and I'm wondering what the
reason is? I read this article but am still not clear on the benefit
of a buffer at all:

http://rcoder.net/content/fast-ruby-io

I'm wondering if providing a buffer might reduce malloc issues and
speed things up? I can't see any other reason to use one..

Thanks in advance for any information!

Steve

MonkeeSage · Dec 6, 2007

Hi Ruby people,

I'm wondering what the functional and performance differences might be
between the two statements below? Assume 'io' is an IO instance with
gobs of data in it. Assume 'file' is an open file instance with write
access:

until io.eof? do
file.write(io.read(10485760))
end

buffer = ''
until io.eof? do
buffer = io.read(10485760)
file.write(buffer)
end

I see that Ruby provides for a buffer and I'm wondering what the
reason is? I read this article but am still not clear on the benefit
of a buffer at all:

http://rcoder.net/content/fast-ruby-io

I'm wondering if providing a buffer might reduce malloc issues and
speed things up? I can't see any other reason to use one..

Thanks in advance for any information!

Steve

$ ri IO#buffer
----------------------------------------------------------------
IO#read
ios.read([length [, buffer]]) => string, buffer, or nil
------------------------------------------------------------------------
Reads at most _length_ bytes from the I/O stream, or to the end
of
file if _length_ is omitted or is +nil+. _length_ must be a
non-negative integer or nil. If the optional _buffer_ argument is
present, it must reference a String, which will receive the data.

At end of file, it returns +nil+ or +""+ depend on _length_.
+_ios_.read()+ and +_ios_.read(nil)+ returns +""+.
+_ios_.read(_positive-integer_)+ returns nil.

f = File.new("testfile")
f.read(16) #=> "This is line one"

So...

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"

Regards,
Jordan

Steve Midgley · Dec 7, 2007

Hi Ruby people,

Click to expand...

I'm wondering what the functional and performance differences might be
between the two statements below? Assume 'io' is an IO instance with
gobs of data in it. Assume 'file' is an open file instance with write
access:

Click to expand...

until io.eof? do
file.write(io.read(10485760))
end

Click to expand...

buffer = ''
until io.eof? do
buffer = io.read(10485760)
file.write(buffer)
end

Click to expand...

I see that Ruby provides for a buffer and I'm wondering what the
reason is? I read this article but am still not clear on the benefit
of a buffer at all:

I'm wondering if providing a buffer might reduce malloc issues and
speed things up? I can't see any other reason to use one..

Click to expand...

Thanks in advance for any information!

Click to expand...

Steve

Click to expand...

$ ri IO#buffer
----------------------------------------------------------------
IO#read
ios.read([length [, buffer]]) => string, buffer, or nil
------------------------------------------------------------------------
Reads at most _length_ bytes from the I/O stream, or to the end
of
file if _length_ is omitted or is +nil+. _length_ must be a
non-negative integer or nil. If the optional _buffer_ argument is
present, it must reference a String, which will receive the data.

At end of file, it returns +nil+ or +""+ depend on _length_.
+_ios_.read()+ and +_ios_.read(nil)+ returns +""+.
+_ios_.read(_positive-integer_)+ returns nil.

f = File.new("testfile")
f.read(16) #=> "This is line one"

So...

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"

Regards,
Jordan

Thanks Jordan. How is your code different (if at all) from:

buffer = io.read
file.write(buffer)
print "I read this stuff ", buffer, "\n"

Am I missing something? I just don't see why buffer is useful - is it
a performance benefit or some kind of syntax improvement that I'm
missing? The only thing I can see is that it has some kind of low
level malloc optimization if the same string size is passed in
repeatedly during partial writes.

Steve

MonkeeSage · Dec 7, 2007

$ ri IO#buffer
----------------------------------------------------------------
IO#read
ios.read([length [, buffer]]) => string, buffer, or nil
------------------------------------------------------------------------
Reads at most _length_ bytes from the I/O stream, or to the end
of
file if _length_ is omitted or is +nil+. _length_ must be a
non-negative integer or nil. If the optional _buffer_ argument is
present, it must reference a String, which will receive the data.

Click to expand...

At end of file, it returns +nil+ or +""+ depend on _length_.
+_ios_.read()+ and +_ios_.read(nil)+ returns +""+.
+_ios_.read(_positive-integer_)+ returns nil.

Click to expand...

f = File.new("testfile")
f.read(16) #=> "This is line one"

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"

Click to expand...

Regards,
Jordan

Click to expand...

Thanks Jordan. How is your code different (if at all) from:

buffer = io.read
file.write(buffer)
print "I read this stuff ", buffer, "\n"

Am I missing something? I just don't see why buffer is useful - is it
a performance benefit or some kind of syntax improvement that I'm
missing? The only thing I can see is that it has some kind of low
level malloc optimization if the same string size is passed in
repeatedly during partial writes.

Steve

I don't know if there is any optimization is the back end, but it lets
you pass the results of io.read to another method and also put them in
buffer at the same time. But since you can do that with assignment, I
don't really see any point to it (I was just trying to give an example
as the docs describe). To me, unless as you say, there is some
optimization going on in the backend, this code...

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"

....looks the same as this code...

file.write(buffer = io.read)
print "I read this stuff ", buffer, "\n"

Regards,
Jordan

Robert Klemme · Dec 7, 2007

This line above is completely superfluous.

until io.eof? do
buffer = io.read(10485760)
file.write(buffer)
end

Click to expand...

I see that Ruby provides for a buffer and I'm wondering what the
reason is? I read this article but am still not clear on the benefit
of a buffer at all:

I'm wondering if providing a buffer might reduce malloc issues and
speed things up? I can't see any other reason to use one..

Click to expand...

Thanks in advance for any information!

Click to expand...

Steve

Click to expand...

$ ri IO#buffer
----------------------------------------------------------------
IO#read
ios.read([length [, buffer]]) => string, buffer, or nil
------------------------------------------------------------------------
Reads at most _length_ bytes from the I/O stream, or to the end
of
file if _length_ is omitted or is +nil+. _length_ must be a
non-negative integer or nil. If the optional _buffer_ argument is
present, it must reference a String, which will receive the data.

At end of file, it returns +nil+ or +""+ depend on _length_.
+_ios_.read()+ and +_ios_.read(nil)+ returns +""+.
+_ios_.read(_positive-integer_)+ returns nil.

f = File.new("testfile")
f.read(16) #=> "This is line one"

So...

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"

Regards,
Jordan

Click to expand...

Thanks Jordan. How is your code different (if at all) from:

buffer = io.read
file.write(buffer)
print "I read this stuff ", buffer, "\n"

Am I missing something? I just don't see why buffer is useful - is it
a performance benefit or some kind of syntax improvement that I'm
missing?

Yes, the string referenced by buffer is reused. This leads to
improved performance for the typical application which is like this:

buffer = ""
while ( io.read(1024, buffer) )
file.write buffer
end

The only thing I can see is that it has some kind of low
level malloc optimization if the same string size is passed in
repeatedly during partial writes.

Exactly (see above). Note that it is very inefficient to read with
such a large chunk size as you use in your original posting. If you
want to read the whole file you can simply do io.read.

Kind regards

robert

Jano Svitok · Dec 7, 2007

On Dec 5, 6:59 pm, SteveMidgley<[email protected]> wrote:

Click to expand...

Hi Ruby people,

Click to expand...

I'm wondering what the functional and performance differences might be
between the two statements below? Assume 'io' is an IO instance with
gobs of data in it. Assume 'file' is an open file instance with write
access:

Click to expand...

until io.eof? do
file.write(io.read(10485760))
end

Click to expand...

buffer = ''
until io.eof? do
buffer = io.read(10485760)
file.write(buffer)
end

Click to expand...

I see that Ruby provides for a buffer and I'm wondering what the
reason is? I read this article but am still not clear on the benefit
of a buffer at all:

I'm wondering if providing a buffer might reduce malloc issues and
speed things up? I can't see any other reason to use one..

Click to expand...

Thanks in advance for any information!

Steve

Click to expand...

$ ri IO#buffer
----------------------------------------------------------------
IO#read
ios.read([length [, buffer]]) => string, buffer, or nil
------------------------------------------------------------------------
Reads at most _length_ bytes from the I/O stream, or to the end
of
file if _length_ is omitted or is +nil+. _length_ must be a
non-negative integer or nil. If the optional _buffer_ argument is
present, it must reference a String, which will receive the data.

Click to expand...

At end of file, it returns +nil+ or +""+ depend on _length_.
+_ios_.read()+ and +_ios_.read(nil)+ returns +""+.
+_ios_.read(_positive-integer_)+ returns nil.

Click to expand...

f = File.new("testfile")
f.read(16) #=> "This is line one"

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"

Click to expand...

Regards,
Jordan

Click to expand...

Thanks Jordan. How is your code different (if at all) from:

buffer = io.read
file.write(buffer)
print "I read this stuff ", buffer, "\n"

Am I missing something? I just don't see why buffer is useful - is it
a performance benefit or some kind of syntax improvement that I'm
missing? The only thing I can see is that it has some kind of low
level malloc optimization if the same string size is passed in
repeatedly during partial writes.

Steve

Click to expand...

I don't know if there is any optimization is the back end, but it lets
you pass the results of io.read to another method and also put them in
buffer at the same time. But since you can do that with assignment, I
don't really see any point to it (I was just trying to give an example
as the docs describe). To me, unless as you say, there is some
optimization going on in the backend, this code...

buffer = ""
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, "\n"

...looks the same as this code...

file.write(buffer = io.read)

print "I read this stuff ", buffer, "\n"

Regards,
Jordan

I'd *assume* the former saves you a bunch of allocations when looping
through a file
(I assume the buffer is reused instead of allocating a new one for
each iteration).

i.e.
buffer = ""
File.open('xxx','r') do |f|
while f.read(1024, buffer) do
process(buffer)
end
end

vs.

File.open('xxx','r') do |f|
while true do
buffer = f.read(1024)
break if buffer.empty?
process(buffer)
end
end

MonkeeSage · Dec 7, 2007

I'd *assume* the former saves you a bunch of allocations when looping
through a file
(I assume the buffer is reused instead of allocating a new one for
each iteration).

I'm not the smartest C programmer (or the smartest anything
programmer), but I'm not seeing any optimization in the actual C code.
Please correct me if I'm wrong.

First, io_read() is the function called in the backend from IO#read.
Te relevant lines are:

====
rb_scan_args(argc, argv, "02", &length, &str);

if (NIL_P(length)) {
if (!NIL_P(str)) StringValue(str);
GetOpenFile(io, fptr);
rb_io_check_readable(fptr);
return read_all(fptr, remain_size(fptr), str);
}
len = NUM2LONG(length);
if (len < 0) {
rb_raise(rb_eArgError, "negative length %ld given", len);
}

if (NIL_P(str)) {
str = rb_tainted_str_new(0, len);
}
else {
StringValue(str);
rb_str_modify(str);
rb_str_resize(str,len);
}
====

So we see that we get a new string from rb_tainted_str_new if buffer
is is not passed in to IO#read; otherwise str is used and we call
StringValue on it.

So what is StringValue? A macro defined in ruby.h:

====
#define StringValue(v) rb_string_value(&(v))
====

And what is rb_string_value()? A function from string.c:

====
static char *null_str = "";

VALUE
rb_string_value(ptr)
volatile VALUE *ptr;
{
VALUE s = *ptr;
if (TYPE(s) != T_STRING) {
s = rb_str_to_str(s);
*ptr = s;
}
if (!RSTRING(s)->ptr) {
FL_SET(s, ELTS_SHARED);
RSTRING(s)->ptr = null_str;
}
return s;
}
====

So if it's not a string, we convert it to one, otherwise we zero it
out.

But the interesting lines are back up in io_read():

====
rb_str_modify(str);
rb_str_resize(str,len);
====

Now rb_str_modify() (string.c) is called with our zeroed string. And
it in turn calls str_make_independent():

====
static void
str_make_independent(str)
VALUE str;
{
char *ptr;

ptr = ALLOC_N(char, RSTRING(str)->len+1);
if (RSTRING(str)->ptr) {
memcpy(ptr, RSTRING(str)->ptr, RSTRING(str)->len);
}
ptr[RSTRING(str)->len] = 0;
RSTRING(str)->ptr = ptr;
RSTRING(str)->aux.capa = RSTRING(str)->len;
FL_UNSET(str, STR_NOCAPA);
}
====

And finally, rb_str_resize is called:

====
VALUE
rb_str_resize(str, len)
VALUE str;
long len;
{
if (len < 0) {
rb_raise(rb_eArgError, "negative string size (or size too big)");
}

rb_str_modify(str);
if (len != RSTRING(str)->len) {
if (RSTRING(str)->len < len || RSTRING(str)->len - len > 1024) {
REALLOC_N(RSTRING(str)->ptr, char, len+1);
if (!FL_TEST(str, STR_NOCAPA)) {
RSTRING(str)->aux.capa = len;
}
}
RSTRING(str)->len = len;
RSTRING(str)->ptr[len] = '\0'; /* sentinel */
}
return str;
}
====

Now, like I said, I'm not the greatest C programmer...but I fail to
see how, if I'm reading the code above correctly, passing in a buffer
string to IO#read is any more optimal than creating a new string (even
when looping many times), since it appears to me to be doing the same
thing (compare str_new from string.c, which is what rb_tainted_str_new
calls).

Regards,
Jordan

----
References:

http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/io.c
http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/ruby.h
http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/string.c

MonkeeSage · Dec 7, 2007

I'd *assume* the former saves you a bunch of allocations when looping
through a file
(I assume the buffer is reused instead of allocating a new one for
each iteration).

Click to expand...

I'm not the smartest C programmer (or the smartest anything
programmer), but I'm not seeing any optimization in the actual C code.
Please correct me if I'm wrong.

First, io_read() is the function called in the backend from IO#read.
Te relevant lines are:

====
rb_scan_args(argc, argv, "02", &length, &str);

if (NIL_P(length)) {
if (!NIL_P(str)) StringValue(str);
GetOpenFile(io, fptr);
rb_io_check_readable(fptr);
return read_all(fptr, remain_size(fptr), str);
}
len = NUM2LONG(length);
if (len < 0) {
rb_raise(rb_eArgError, "negative length %ld given", len);
}

if (NIL_P(str)) {
str = rb_tainted_str_new(0, len);
}
else {
StringValue(str);
rb_str_modify(str);
rb_str_resize(str,len);
}
====

So we see that we get a new string from rb_tainted_str_new if buffer
is is not passed in to IO#read; otherwise str is used and we call
StringValue on it.

So what is StringValue? A macro defined in ruby.h:

====
#define StringValue(v) rb_string_value(&(v))
====

And what is rb_string_value()? A function from string.c:

====
static char *null_str = "";

VALUE
rb_string_value(ptr)
volatile VALUE *ptr;
{
VALUE s = *ptr;
if (TYPE(s) != T_STRING) {
s = rb_str_to_str(s);
*ptr = s;
}
if (!RSTRING(s)->ptr) {
FL_SET(s, ELTS_SHARED);
RSTRING(s)->ptr = null_str;
}
return s;}

====

So if it's not a string, we convert it to one, otherwise we zero it
out.

But the interesting lines are back up in io_read():

====
rb_str_modify(str);
rb_str_resize(str,len);
====

Now rb_str_modify() (string.c) is called with our zeroed string. And
it in turn calls str_make_independent():

====
static void
str_make_independent(str)
VALUE str;
{
char *ptr;

ptr = ALLOC_N(char, RSTRING(str)->len+1);
if (RSTRING(str)->ptr) {
memcpy(ptr, RSTRING(str)->ptr, RSTRING(str)->len);
}
ptr[RSTRING(str)->len] = 0;
RSTRING(str)->ptr = ptr;
RSTRING(str)->aux.capa = RSTRING(str)->len;
FL_UNSET(str, STR_NOCAPA);}

====

And finally, rb_str_resize is called:

====
VALUE
rb_str_resize(str, len)
VALUE str;
long len;
{
if (len < 0) {
rb_raise(rb_eArgError, "negative string size (or size too big)");
}

rb_str_modify(str);
if (len != RSTRING(str)->len) {
if (RSTRING(str)->len < len || RSTRING(str)->len - len > 1024) {
REALLOC_N(RSTRING(str)->ptr, char, len+1);
if (!FL_TEST(str, STR_NOCAPA)) {
RSTRING(str)->aux.capa = len;
}
}
RSTRING(str)->len = len;
RSTRING(str)->ptr[len] = '\0'; /* sentinel */
}
return str;}

====

Now, like I said, I'm not the greatest C programmer...but I fail to
see how, if I'm reading the code above correctly, passing in a buffer
string to IO#read is any more optimal than creating a new string (even
when looping many times), since it appears to me to be doing the same
thing (compare str_new from string.c, which is what rb_tainted_str_new
calls).

Regards,
Jordan

Oh...wait...I'm completely dense. Duh! io_read() is going to create /
re-initialize new string anyway to put its results in. So If I create
a new string independently to store the return value of IO#read, then
I'm causing an extra allocation and copy. Sorry for wasting space.
Have pity on mentally handicapped people like me.

Regards,
Jordan

Robert Klemme · Dec 7, 2007

2007/12/7 said:
I'd *assume* the former saves you a bunch of allocations when looping
through a file
(I assume the buffer is reused instead of allocating a new one for
each iteration).

Click to expand...

I'm not the smartest C programmer (or the smartest anything
programmer), but I'm not seeing any optimization in the actual C code.
Please correct me if I'm wrong.

First, io_read() is the function called in the backend from IO#read.
Te relevant lines are:

====
rb_scan_args(argc, argv, "02", &length, &str);

if (NIL_P(length)) {
if (!NIL_P(str)) StringValue(str);
GetOpenFile(io, fptr);
rb_io_check_readable(fptr);
return read_all(fptr, remain_size(fptr), str);
}
len = NUM2LONG(length);
if (len < 0) {
rb_raise(rb_eArgError, "negative length %ld given", len);
}

if (NIL_P(str)) {
str = rb_tainted_str_new(0, len);
}
else {
StringValue(str);
rb_str_modify(str);
rb_str_resize(str,len);
}
====

So we see that we get a new string from rb_tainted_str_new if buffer
is is not passed in to IO#read; otherwise str is used and we call
StringValue on it.

So what is StringValue? A macro defined in ruby.h:

====
#define StringValue(v) rb_string_value(&(v))
====

And what is rb_string_value()? A function from string.c:

====
static char *null_str = "";

VALUE
rb_string_value(ptr)
volatile VALUE *ptr;
{
VALUE s = *ptr;
if (TYPE(s) != T_STRING) {
s = rb_str_to_str(s);
*ptr = s;
}
if (!RSTRING(s)->ptr) {
FL_SET(s, ELTS_SHARED);
RSTRING(s)->ptr = null_str;
}
return s;}

====

So if it's not a string, we convert it to one, otherwise we zero it
out.

But the interesting lines are back up in io_read():

====
rb_str_modify(str);
rb_str_resize(str,len);
====

Now rb_str_modify() (string.c) is called with our zeroed string. And
it in turn calls str_make_independent():

====
static void
str_make_independent(str)
VALUE str;
{
char *ptr;

ptr = ALLOC_N(char, RSTRING(str)->len+1);
if (RSTRING(str)->ptr) {
memcpy(ptr, RSTRING(str)->ptr, RSTRING(str)->len);
}
ptr[RSTRING(str)->len] = 0;
RSTRING(str)->ptr = ptr;
RSTRING(str)->aux.capa = RSTRING(str)->len;
FL_UNSET(str, STR_NOCAPA);}

====

And finally, rb_str_resize is called:

====
VALUE
rb_str_resize(str, len)
VALUE str;
long len;
{
if (len < 0) {
rb_raise(rb_eArgError, "negative string size (or size too big)");
}

rb_str_modify(str);
if (len != RSTRING(str)->len) {
if (RSTRING(str)->len < len || RSTRING(str)->len - len > 1024) {
REALLOC_N(RSTRING(str)->ptr, char, len+1);
if (!FL_TEST(str, STR_NOCAPA)) {
RSTRING(str)->aux.capa = len;
}
}
RSTRING(str)->len = len;
RSTRING(str)->ptr[len] = '\0'; /* sentinel */
}
return str;}

====

Now, like I said, I'm not the greatest C programmer...but I fail to
see how, if I'm reading the code above correctly, passing in a buffer
string to IO#read is any more optimal than creating a new string (even
when looping many times), since it appears to me to be doing the same
thing (compare str_new from string.c, which is what rb_tainted_str_new
calls).

Regards,
Jordan

Click to expand...

Oh...wait...I'm completely dense. Duh! io_read() is going to create /
re-initialize new string anyway to put its results in. So If I create
a new string independently to store the return value of IO#read, then
I'm causing an extra allocation and copy. Sorry for wasting space.
Have pity on mentally handicapped people like me.

LOL

Also, allocating of a String instance is not only the raw malloc of
the memory but as well the bookkeeping needed for GC. So it is more
expensive than a simple resize. Note also, that if you loop with code
like the one I showed the length of the string instance is adjusted
only *once* because all chunks have the same length or are shorter
(the last one potentially).

Kind regards

robert

Ruby/DL: How to create and use a buffer?	1	Aug 21, 2008
Read efficiency?	6	Feb 21, 2010
Proper way to allow Control-C to interrupt IO#gets	1	Feb 12, 2008
Boids, a use case	6	Mar 22, 2010
IO::Reactor on a disk file?	2	Jul 29, 2004
block on reading a half-filled buffer for ifstream	1	Feb 24, 2005
FAQ 5.3 How do I count the number of lines in a file?	0	Jan 31, 2011
is it possible to efficiently read a large file?	18	Aug 12, 2006

Reasons to use a buffer in IO::read?

Steve Midgley

MonkeeSage

Steve Midgley

MonkeeSage

Robert Klemme

Jano Svitok

MonkeeSage

MonkeeSage

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads