regexing a file's contents without reading the whole thing?

R

Roger Pack

I see that it is possible currently to parse through a file without
reading the whole thing into RAM, a la

a = File.open('a', 'r')
a.lines{|line|
if line =~ /some regex/
...
end
}

But what if I can to do something like
a = File.read('a').scan /some regex/

is that possible?

Thanks.
-r
 
J

Joel VanderWerf

Roger said:
I see that it is possible currently to parse through a file without
reading the whole thing into RAM, a la

a = File.open('a', 'r')
a.lines{|line|
if line =~ /some regex/
...
end
}

But what if I can to do something like
a = File.read('a').scan /some regex/

is that possible?

Thanks.
-r

File.open('/usr/share/dict/words').grep /ruby/i
 
R

Robert Klemme

2009/11/30 Roger Pack said:
I see that it is possible currently to parse through a file without
reading the whole thing into RAM, a la

a =3D File.open('a', 'r')
a.lines{|line|
=A0if line =3D~ /some regex/
=A0 =A0...
=A0end
}

But what if I can to do something like
a =3D File.read('a').scan /some regex/

is that possible?

If you know that matches will never cross line breaks you can do

a =3D []
File.foreach("a") do |line|
line.scan /regex/ do |m|
a << m
end
# alternative:
a.concat(line.scan(/regex/))
end

If matches can cross line breaks the whole store becomes more
complicated and your solution with File.read is probably the simplest
way to do it (if files aren't too large).

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
C

Caleb Clausen

I see that it is possible currently to parse through a file without
reading the whole thing into RAM, a la

a = File.open('a', 'r')
a.lines{|line|
if line =~ /some regex/
...
end
}

But what if I can to do something like
a = File.read('a').scan /some regex/

is that possible?

The library which makes this possible is sequence. I'm coding this
from memory, so I'm likely to get something wrong, but the equivalent
in sequence looks more or less like this:

require 'rubygems'
require 'sequence'
require 'sequence/file'

seq=Sequence.new(File.open('a'))
seq.scan_until(/some regex/)

Keep the following in mind:
1) Sequence#scan works like StringScanner#scan, not String#scan.
2) The pattern to be matched must have a max length (4k by default, I
think; it can be changed).
3) If your pattern is guaranteed to not contain a nl, you're better
off with readline, as robert said.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,021
Latest member
AkilahJaim

Latest Threads

Top