NOT reading an entire file into memory

  • Thread starter Devi Web Development
  • Start date
D

Devi Web Development

I am trying to write a parser for a text-based file format. Files in
this format frequently become very large. While the specification
specifically allows applications to crash on large files, I know
several people who have taken to editing these files by hand in
Notepad or other basic text editors. This format is not at all
friendly for this type of editing, and it is extremely tedious work,
but their programs all crash due to the size of these files.
What I really want to know is:
I had been using File.readline and saving a lot of temporary files via
tempfile.rb (http://www.ruby-doc.org/stdlib/libdoc/tempfile/rdoc/index.html).
However, I have heard that File.readline is in fact equivalent to
File.read.split('\n').each, which would really ruin my purpose of not
loading the whole file. I'd really like to keep this in ruby, as I
want to package the whole thing via the wonderful rubyscipt2exe, as
well as, of course, a standard rubygem.
What I would actually really love is if there was a way to read lines
4 through 7 without reading the whole file.
My current method has made the program not nearly as beautiful as ruby
ought to be.
 
K

Konrad Meyer

--nextPart2994194.aspd7W03tV
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth Devi Web Development:
I am trying to write a parser for a text-based file format. Files in
this format frequently become very large. While the specification
specifically allows applications to crash on large files, I know
several people who have taken to editing these files by hand in
Notepad or other basic text editors. This format is not at all
friendly for this type of editing, and it is extremely tedious work,
but their programs all crash due to the size of these files.
What I really want to know is:
I had been using File.readline and saving a lot of temporary files via
tempfile.rb=20 (http://www.ruby-doc.org/stdlib/libdoc/tempfile/rdoc/index.html).
However, I have heard that File.readline is in fact equivalent to
File.read.split('\n').each, which would really ruin my purpose of not
loading the whole file. I'd really like to keep this in ruby, as I
want to package the whole thing via the wonderful rubyscipt2exe, as
well as, of course, a standard rubygem.
What I would actually really love is if there was a way to read lines
4 through 7 without reading the whole file.
My current method has made the program not nearly as beautiful as ruby
ought to be.
=20
-------------------------------------------
Daniel Brumbaugh Keeney
Devi Web Development
(e-mail address removed)
-------------------------------------------

f =3D File.open("myfile")
# skip through 3rd line
3.times do f.readline end

Array.new(4).map do
f.readline
end

=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart2994194.aspd7W03tV
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBHI84xCHB0oCiR2cwRAsesAJ48vE8IHThnCWc8AX4jdarwfh0ULACeLNYr
sRIpXyuajRaW0s1FKCJ51Ao=
=nGo1
-----END PGP SIGNATURE-----

--nextPart2994194.aspd7W03tV--
 
7

7stud --

Devi said:
I have heard that File.readline is in fact equivalent to
File.read.split('\n').each, which would really ruin my purpose of not
loading the whole file.

I doubt that is true, but as is often the case with Ruby there is no
easily locatable documentation that describes File I/O buffering. Just
in case, here is another solution:

#create a data file containing:
#line 1
#line 2
#...
#line 10

File.open("data.txt", "w") do |file|
10.times do |i|
file.puts("line #{i+1}")
end
end


#read lines 4-7 and display them:
File.open("data.txt") do |file|
file.each_with_index do |line, i|
i = i + 1 #i starts at 0

if i < 4
next
elsif i < 8
puts line
else
break
end

end
end
 
K

Konrad Meyer

--nextPart2448670.z5XDEfIDvv
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth 7stud --:
=20
I doubt that is true, but as is often the case with Ruby there is no=20
easily locatable documentation that describes File I/O buffering. Just=20
in case, here is another solution:
=20
#create a data file containing:
#line 1
#line 2
#...
#line 10
=20
File.open("data.txt", "w") do |file|
10.times do |i|
file.puts("line #{i+1}")
end
end
=20
=20
#read lines 4-7 and display them:
File.open("data.txt") do |file|
file.each_with_index do |line, i|
i =3D i + 1 #i starts at 0
=20
if i < 4
next
elsif i < 8
puts line
else
break
end
=20
end
end

IO#each_with_index and IO#readline are probably the same internally, so the=
=20
real answer here is that NO, IO#readline is NOT the same as=20
=46ile.read.split('\n'), that's IO#readlines.

=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart2448670.z5XDEfIDvv
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBHI/VJCHB0oCiR2cwRAroPAJ9PkmMnb0GaAuQGsF/PPgmMC+PkQQCfc+v8
7wxmgvidh89S15qXTNe5Yq4=
=rHhI
-----END PGP SIGNATURE-----

--nextPart2448670.z5XDEfIDvv--
 
7

7stud --

Konrad said:
Quoth 7stud --:

IO#each_with_index and IO#readline are probably the same internally, so
the
real answer here is that NO, IO#readline is NOT the same as
File.read.split('\n'), that's IO#readlines.

The real question is: does readline do any buffering? What about
each()? If a file has ten lines in it, does ruby access the file ten
times? Or, does ruby read some reasonable amount of data into a buffer?
 
K

Konrad Meyer

--nextPart6708284.t5Ny6Zu6aX
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth 7stud --:
=20
The real question is: does readline do any buffering? What about=20
each()? If a file has ten lines in it, does ruby access the file ten=20
times? Or, does ruby read some reasonable amount of data into a buffer?

Performance isn't everything. If it was, you wouldn't be using ruby. The id=
ea=20
is that this will work "well enough", shouldn't take too much thought on th=
e=20
programmer's behalf, and doesn't load the entire (huge) file into ram.

=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart6708284.t5Ny6Zu6aX
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBHJEXCCHB0oCiR2cwRAt4sAKC3UYPEDVZZNMIfCsJthMT2Y8HuswCdFAd0
tb77cWABNbZlQ/CvwDNYlx8=
=vQVZ
-----END PGP SIGNATURE-----

--nextPart6708284.t5Ny6Zu6aX--
 
R

Robert Klemme

The real question is: does readline do any buffering? What about
each()? If a file has ten lines in it, does ruby access the file ten
times? Or, does ruby read some reasonable amount of data into a buffer?

Ruby does buffering but will not read the whole file unless asked to do so.

There are several ways to access only lines 4 through 7. For example:

# 1
require 'enumerator' # pre 1.9
File.to_enum:)foreach, "foo.dat").each_with_index do |line,idx|
case idx
when 0...3
# ignore
when 3...7
puts line
else
break # or return or exit
end
end


# 2
File.open("foo.dat") do |io|
io.each do |line|
case io.lineno
when 1...4
# ignore
when 4..7
puts line
else
break
end
end
end

# 3
File.foreach "foo.dat" do |line|
case $.
when 1...4
# ignore
when 4..7
puts line
else
break
end
end

Kind regards

robert
 
K

Ken Bloom

The real question is: does readline do any buffering?

It must. There's no POSIX call that can read until the end of a line, so
you have to read(2) a bunch of data, look for a newline, and if there's
no newline in it you have to read more. If there is a newline in it, then
you have to buffer everything you read that comes after the newline.
That's life with POSIX.

The standard C library has fgets(3) which can find a newline, butit
probably does its own buffering internally, for the same reasons that
other POSIX apps would.

Ruby uses fread(3), the C library's equivalent of read(2), so ruby has to
do its own buffering.
What about
each()? If a file has ten lines in it, does ruby access the file ten
times? Or, does ruby read some reasonable amount of data into a buffer?

rb_io_each_line implements IO#each_line and IO#each. It boils down to a
loop:

while (!NIL_P(str = rb_io_getline(rs, io))) {
rb_yield(str);
}

and rb_io_getline reads only as much as it feels is necessary to find
that newline. It doesn't put the whole file in memory at once.

--Ken
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top