NOT reading an entire file into memory

Devi Web Development · Oct 27, 2007

I am trying to write a parser for a text-based file format. Files in
this format frequently become very large. While the specification
specifically allows applications to crash on large files, I know
several people who have taken to editing these files by hand in
Notepad or other basic text editors. This format is not at all
friendly for this type of editing, and it is extremely tedious work,
but their programs all crash due to the size of these files.
What I really want to know is:
I had been using File.readline and saving a lot of temporary files via
tempfile.rb (http://www.ruby-doc.org/stdlib/libdoc/tempfile/rdoc/index.html).
However, I have heard that File.readline is in fact equivalent to
File.read.split('\n').each, which would really ruin my purpose of not
loading the whole file. I'd really like to keep this in ruby, as I
want to package the whole thing via the wonderful rubyscipt2exe, as
well as, of course, a standard rubygem.
What I would actually really love is if there was a way to read lines
4 through 7 without reading the whole file.
My current method has made the program not nearly as beautiful as ruby
ought to be.

Konrad Meyer · Oct 28, 2007

--nextPart2994194.aspd7W03tV
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth Devi Web Development:

I am trying to write a parser for a text-based file format. Files in
this format frequently become very large. While the specification
specifically allows applications to crash on large files, I know
several people who have taken to editing these files by hand in
Notepad or other basic text editors. This format is not at all
friendly for this type of editing, and it is extremely tedious work,
but their programs all crash due to the size of these files.
What I really want to know is:
I had been using File.readline and saving a lot of temporary files via
tempfile.rb=20 (http://www.ruby-doc.org/stdlib/libdoc/tempfile/rdoc/index.html).
However, I have heard that File.readline is in fact equivalent to
File.read.split('\n').each, which would really ruin my purpose of not
loading the whole file. I'd really like to keep this in ruby, as I
want to package the whole thing via the wonderful rubyscipt2exe, as
well as, of course, a standard rubygem.
What I would actually really love is if there was a way to read lines
4 through 7 without reading the whole file.
My current method has made the program not nearly as beautiful as ruby
ought to be.
=20
-------------------------------------------
Daniel Brumbaugh Keeney
Devi Web Development
(e-mail address removed)
-------------------------------------------

f =3D File.open("myfile")
# skip through 3rd line
3.times do f.readline end

Array.new(4).map do
f.readline
end

=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart2994194.aspd7W03tV
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBHI84xCHB0oCiR2cwRAsesAJ48vE8IHThnCWc8AX4jdarwfh0ULACeLNYr
sRIpXyuajRaW0s1FKCJ51Ao=
=nGo1
-----END PGP SIGNATURE-----

--nextPart2994194.aspd7W03tV--

7stud -- · Oct 28, 2007

Devi said:
I have heard that File.readline is in fact equivalent to
File.read.split('\n').each, which would really ruin my purpose of not
loading the whole file.

I doubt that is true, but as is often the case with Ruby there is no
easily locatable documentation that describes File I/O buffering. Just
in case, here is another solution:

#create a data file containing:
#line 1
#line 2
#...
#line 10

File.open("data.txt", "w") do |file|
10.times do |i|
file.puts("line #{i+1}")
end
end

#read lines 4-7 and display them:
File.open("data.txt") do |file|
file.each_with_index do |line, i|
i = i + 1 #i starts at 0

if i < 4
next
elsif i < 8
puts line
else
break
end

end
end

7stud -- · Oct 28, 2007

--output:--
line 4
line 5
line 6
line 7

Konrad Meyer · Oct 28, 2007

--nextPart2448670.z5XDEfIDvv
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth 7stud --:

=20
I doubt that is true, but as is often the case with Ruby there is no=20
easily locatable documentation that describes File I/O buffering. Just=20
in case, here is another solution:
=20
#create a data file containing:
#line 1
#line 2
#...
#line 10
=20
File.open("data.txt", "w") do |file|
10.times do |i|
file.puts("line #{i+1}")
end
end
=20
=20
#read lines 4-7 and display them:
File.open("data.txt") do |file|
file.each_with_index do |line, i|
i =3D i + 1 #i starts at 0
=20
if i < 4
next
elsif i < 8
puts line
else
break
end
=20
end
end

IO#each_with_index and IO#readline are probably the same internally, so the=
=20
real answer here is that NO, IO#readline is NOT the same as=20
=46ile.read.split('\n'), that's IO#readlines.

=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart2448670.z5XDEfIDvv
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBHI/VJCHB0oCiR2cwRAroPAJ9PkmMnb0GaAuQGsF/PPgmMC+PkQQCfc+v8
7wxmgvidh89S15qXTNe5Yq4=
=rHhI
-----END PGP SIGNATURE-----

--nextPart2448670.z5XDEfIDvv--

7stud -- · Oct 28, 2007

Konrad said:
Quoth 7stud --:

IO#each_with_index and IO#readline are probably the same internally, so
the
real answer here is that NO, IO#readline is NOT the same as
File.read.split('\n'), that's IO#readlines.

The real question is: does readline do any buffering? What about
each()? If a file has ten lines in it, does ruby access the file ten
times? Or, does ruby read some reasonable amount of data into a buffer?

Konrad Meyer · Oct 28, 2007

--nextPart6708284.t5Ny6Zu6aX
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth 7stud --:

=20
The real question is: does readline do any buffering? What about=20
each()? If a file has ten lines in it, does ruby access the file ten=20
times? Or, does ruby read some reasonable amount of data into a buffer?

Performance isn't everything. If it was, you wouldn't be using ruby. The id=
ea=20
is that this will work "well enough", shouldn't take too much thought on th=
e=20
programmer's behalf, and doesn't load the entire (huge) file into ram.

=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart6708284.t5Ny6Zu6aX
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBHJEXCCHB0oCiR2cwRAt4sAKC3UYPEDVZZNMIfCsJthMT2Y8HuswCdFAd0
tb77cWABNbZlQ/CvwDNYlx8=
=vQVZ
-----END PGP SIGNATURE-----

--nextPart6708284.t5Ny6Zu6aX--

Robert Klemme · Oct 28, 2007

The real question is: does readline do any buffering? What about
each()? If a file has ten lines in it, does ruby access the file ten
times? Or, does ruby read some reasonable amount of data into a buffer?

Ruby does buffering but will not read the whole file unless asked to do so.

There are several ways to access only lines 4 through 7. For example:

# 1
require 'enumerator' # pre 1.9
File.to_enum

foreach, "foo.dat").each_with_index do |line,idx|
case idx
when 0...3
# ignore
when 3...7
puts line
else
break # or return or exit
end
end

# 2
File.open("foo.dat") do |io|
io.each do |line|
case io.lineno
when 1...4
# ignore
when 4..7
puts line
else
break
end
end
end

# 3
File.foreach "foo.dat" do |line|
case $.
when 1...4
# ignore
when 4..7
puts line
else
break
end
end

Kind regards

robert

Ken Bloom · Oct 29, 2007

The real question is: does readline do any buffering?

It must. There's no POSIX call that can read until the end of a line, so
you have to read(2) a bunch of data, look for a newline, and if there's
no newline in it you have to read more. If there is a newline in it, then
you have to buffer everything you read that comes after the newline.
That's life with POSIX.

The standard C library has fgets(3) which can find a newline, butit
probably does its own buffering internally, for the same reasons that
other POSIX apps would.

Ruby uses fread(3), the C library's equivalent of read(2), so ruby has to
do its own buffering.

What about
each()? If a file has ten lines in it, does ruby access the file ten
times? Or, does ruby read some reasonable amount of data into a buffer?

rb_io_each_line implements IO#each_line and IO#each. It boils down to a
loop:

while (!NIL_P(str = rb_io_getline(rs, io))) {
rb_yield(str);
}

and rb_io_getline reads only as much as it feels is necessary to find
that newline. It doesn't put the whole file in memory at once.

--Ken

Need help getting the duration of an audio file	7	Mar 31, 2022
How to create PDF file in Batch	5	May 11, 2022
repeatedly open file or save entire file to memory?	8	Sep 17, 2009
require an entire package?	5	Jul 1, 2009
Fix and improve a UDF File System Driver	0	Aug 20, 2023
Create thumbnails without loading entire file into memory	3	Apr 25, 2008
Need help to write data onto an XML file after reading data fromanother xml file	5	May 13, 2014
Reading whole file into memory. Parsing 'C' like file efficently	6	Jun 17, 2008

NOT reading an entire file into memory

Devi Web Development

Konrad Meyer

7stud --

7stud --

Konrad Meyer

7stud --

Konrad Meyer

Robert Klemme

Ken Bloom

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads