Record-separator is a regular expression

W

William James

=begin

Unlike Gawk and Mawk, Ruby won't accept a regular expression as a
record-separator. Let's fix that. The substring matched by the
record-separator is automatically removed from the record, but it
can be obtained by RecSep#terminator.

Typical usage:

File.open("stuff.txt"){|handle|
reader = RecSep.new( handle, /^\d+\.\n/ )
reader.each {|x| p x }
}

=end


class RecSep

def initialize( file_handle, record_separator )
@handle = file_handle
@buffer = ""
@rec_sep = record_separator
@terminator = nil
end

def get_rec
## The record-separator may be something like /\n\s*\n/,
## so we read enough to let it match as much as possible.
loop do
@rec_sep.match( @buffer )
break if $~ && $~.post_match.size > 0
s = @handle.gets( "\n" )
break if not s
@buffer << s
end

if $~
@buffer = $~.post_match
@terminator = $~.to_s
$~.pre_match
else
@terminator = nil
return nil if "" == @buffer
s, @buffer = @buffer, ""
s
end
end

def each
while s = self.get_rec
yield s
end
end

def terminator
@terminator
end

end
 
G

Gavin Kistner

Unlike Gawk and Mawk, Ruby won't accept a regular expression as a
record-separator.

Er, huh?

data = <<ENDDATA
name-----age-----size
Gavin 32 33
ENDDATA

p data.split( /-+| +|\n/ )
#=> ["name", "age", "size", "Gavin", "32", "33"]
 
J

James Edward Gray II

Unlike Gawk and Mawk, Ruby won't accept a regular expression as a
record-separator.

Er, huh?

data = <<ENDDATA
name-----age-----size
Gavin 32 33
ENDDATA

p data.split( /-+| +|\n/ )
#=> ["name", "age", "size", "Gavin", "32", "33"]

William is talking about the separator used by IO objects, $/.

James Edward Gray II
 
R

Robert Klemme

William James said:
=begin

Unlike Gawk and Mawk, Ruby won't accept a regular expression as a
record-separator. Let's fix that. The substring matched by the
record-separator is automatically removed from the record, but it
can be obtained by RecSep#terminator.

Typical usage:

File.open("stuff.txt"){|handle|
reader = RecSep.new( handle, /^\d+\.\n/ )
reader.each {|x| p x }
}

I'd prefer something integrated with IO, e.g.

File.open("foo") {|io| io.each_chunk(/:/) {|ch| p ch}}

module RegularIOChunks
def each_chunk(rx, read_buffer = 1024)
buff = ""
loop do
until ( match = ( rx.match( buff ) ) )
part = read(read_buffer)

if part.nil?
yield buff
return self
end

buff << part
end

yield match.pre_match
buff = match.post_match
end
end
end

class IO
include RegularIOChunks
end

Kind regards

robert
 
D

Daniel Berger

Robert said:
I'd prefer something integrated with IO, e.g.

File.open("foo") {|io| io.each_chunk(/:/) {|ch| p ch}}

module RegularIOChunks
def each_chunk(rx, read_buffer = 1024)
buff = ""
loop do
until ( match = ( rx.match( buff ) ) )
part = read(read_buffer)

if part.nil?
yield buff
return self
end

buff << part
end

yield match.pre_match
buff = match.post_match
end
end
end

class IO
include RegularIOChunks
end

Kind regards

robert

This would *not* be easy to implement. Consider backtracking (do we put it
back in the stream?) and greediness (how much do we read?). Unless you want to
forbid greedy regular expressions and ignore backtracking (not to mention
certain switches), this gets real ugly, real quick.

This has come up wrt Perl as well on p5p. Take a look here for one thread in
midstream:

http://www.nntp.perl.org/group/perl.perl5.porters/64830

Rumor has it that setting $/ to a regex will be legal in Perl 6, but I think
there will be several restrictions.

Regards,

Dan
 
R

Robert Klemme

Daniel said:
This would *not* be easy to implement. Consider backtracking (do we
put it back in the stream?) and greediness (how much do we read?).
Unless you want to forbid greedy regular expressions and ignore
backtracking (not to mention certain switches), this gets real ugly,
real quick.

Right! My main point was that I'd prefer a solution that is integrated
with IO, i.e. no extra instance needs to be created (at least not
explicitely). Just a question of usability.

One implementation option would be to continue reading not until the first
match but until matches don't differ any more. That would deal at least
with cases like /a{3,10}/ where the sequence is cut in the middle of a
sequence of 10 "a"'s. And you would get a match for the first half while
you wanted to match the whole sequence.
This has come up wrt Perl as well on p5p. Take a look here for one
thread in midstream:

http://www.nntp.perl.org/group/perl.perl5.porters/64830

Rumor has it that setting $/ to a regex will be legal in Perl 6, but
I think there will be several restrictions.

As you mention, the general problem with applying regexps is a conceptual
one: because of greedy quantifiers in the worst case the whole file is
read into memory (just consider using /.+/ as delimiter) which doesn't fit
well with the streaming approach. :)

Kind regards

robert
 
W

William James

This version reads farther ahead in an attempt to cope
with greedy regular expressions.

=begin

Unlike Gawk and Mawk, Ruby won't accept a regular expression as a
record-separator. Let's fix that. The substring matched by the
record-separator is automatically removed from the record, but it
can be obtained by RecSep#terminator.

Typical usage:

File.open("stuff.txt"){|handle|
reader = RecSep.new( handle, /^\d+\.\n/ )
reader.each {|x| p x }
}

=end


class RecSep

def initialize( file_handle, record_separator, chunk_size=10_000 )
@handle = file_handle
@rec_sep = record_separator
@chunk_size = chunk_size
@buffer = ""
@terminator = nil
end

attr_reader :terminator, :buffer

def get_rec
## The record-separator may be something like /\n\s*\n/,
## so we read until there's something left over in the buffer
## after the match.
loop do
@rec_sep.match( @buffer )
break if $~ && $~.post_match.size > 0
s = @handle.read( @chunk_size )
break if not s
@buffer << s
end

if $~
@buffer = $~.post_match
@terminator = $~.to_s
$~.pre_match
else
@terminator = nil
return nil if "" == @buffer
s, @buffer = @buffer, ""
s
end
end

def each
while s = self.get_rec
yield s
end
end

end
 
W

William James

Third version. And here's an example of using it to remove
all html tags from a file:

File.open("data1.htm"){|handle|
reader = RecSep.new( handle, /<.*?>/m )
reader.each {|x| print x }
}

-----------------------------------------------------------

=begin

Unlike Gawk and Mawk, Ruby won't accept a regular expression as a
record-separator. Let's fix that. The substring matched by the
record-separator is automatically removed from the record, but it
can be obtained by RecSep#terminator.

Typical usage:

File.open("stuff.txt"){|handle|
reader = RecSep.new( handle, /^\d+\.\n/ )
reader.each {|x| p x }
}

Sometimes it may be necessary to keep the regular expression from
matching less than it should by increasing the look-ahead distance
(measured in characters):

File.open("stuff.txt"){|handle|
reader = RecSep.new( handle, /(^.*\n)\1+/m, 4096 )
reader.each {|x| p x }
}

=end

class RecSep

def initialize( file_handle, record_separator,
minimal_look_ahead = 1024 )
@handle = file_handle
@rec_sep = record_separator
@min_look_ahead = minimal_look_ahead
@buffer = ""
@terminator = nil
@count = 0
end

attr_reader :terminator, :count, :buffer

def get_rec
## Make sure the buffer has a reasonable amount of material.
if @buffer.size < (3 * @min_look_ahead / 2) && [email protected]?
@buffer << @handle.read( 2 * @min_look_ahead - @buffer.size)
end
## To cope with all kinds of greedy regular expressions,
## we read until there are at least @min_look_ahead bytes
## left over in the buffer after the match.
loop do
@rec_sep.match( @buffer )
break if $~ && $~.post_match.size >= @min_look_ahead
s = @handle.read( @min_look_ahead )
break if not s
@buffer << s
end

if $~
@buffer = $~.post_match
@terminator = $~.to_s
@count += 1
$~.pre_match
else
@terminator = nil
return nil if "" == @buffer
@count += 1
s, @buffer = @buffer, ""
s
end
end

def each
while s = self.get_rec
yield s
end
end

end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top