Simple regexp question

T

Tom

--=====================_17971640==.ALT
Content-Type: text/plain; charset="us-ascii"; format=flowed

I'm having trouble doing something with regular expressions in Ruby
that should be simple.

All I want to do is find each successive regexp, and its offset in a
string. The regexp may have multiple capture groups in it.

The obvious answers of split/scan/index/match all fail, as each of
them fail to return some necessary piece of data.

This is trivial to do in other languages, so I feel I must be missing
something.

Example:

If I have a string like " blah blah 7pm something happens 8pm
something else happens 9pm something different"

I want to use the times to split, and get the text between the times.

Why scan doesn't work
I have no trouble getting the times

string = " blah blah 7pm something happens 8pm something else happens
9pm something different"
timepattern = /(\d{1,2}):)\d\d)?\s?([aApP]\.?[mM]\.?)/

irb(main):005:0> string.scan(timepattern)
=> [["7", nil, "pm"], ["8", nil, "pm"], ["9", nil, "pm"]]

This gives me exactly what I want about the times, but no way to find
what was between the matches

Why split doesn't work

If I use split, I can get everything, but in a format that is useless
to me (and to anybody, I'd guess).

irb(main):006:0> string.split(timepattern)
=> [" blah blah ", "7", "pm", " something happens ", "8", "pm", "
something else happens ", "9", "pm", " something different"]

This gives me everything mixed together, but since some capture
groups are not there, you can't figure out which part is regexp
match, and which part is text between regexps.

Why index() doesn't work

Using string.index(timepattern) allows me to walk through the string
by passing the offset, but doesn't return the regexp, so I can get
the data, but no times.

Why match doesn't work
timepattern.match(string) returns the regexp, so I get the times, and
I get a starting offset, so I can find the data, but I can't figure
out how to do a "next match", since match doesn't take an offset, so
this is of no use. This is where I really feel I must be missing
something, since it's hard to believe something so fundamental is missing.

The java equivalent of MatchData has a next match function, it's
commonly used, so I don't quite understand why it's missing.

What's wrong with post_match & slice
One can traverse the matches like this

def reg_split r , string
while match = r.match(string)
next_match = r.match(match.post_match)
if (next_match)
length = next_match.begin(0)
else
length = match.post_match.length
end
text = match.post_match.slice(0,length)
yield(match, text)
string = match.post_match.slice(length,
match.post_match.length - length)
end
end

But each slice is creating (I believe) a new string object, so you
are going to get n*n/2 performance. Horrible with any large strings

What I'd really like
If the Regexp class did a yield on matches, if would be a very nice
thing. It would be more ruby-like, and would give people an easy way
to iterate through matches.

For example:
r = /foo/
r.match(string) { | matchdata | puts matchdata[0]}

Or even just a regex.match(string, offset)

any suggestions?
--=====================_17971640==.ALT--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top