problems with oniguruma lookahead

X

Xiong Chiamiov

Ruby 1.8.6 with Oniguruama installed and working (everywhere else, this
seems to be my problem).

Let me preface this by saying that I am new to Ruby (and kinda jumped
in, rather than learning it properly), and regexes are not my thing -
that why I have nifty regex-checkers.

I am trying to extract some parts out of a string
("<p><b>'Algebra'</b><br>") that I scraped from some html. I'm getting
nil returned from the expression:

Oniguruma::ORegexp.new("(?<=<p><b>').*(?='</b><br>)").scan(scraped_html)

with scraped_html being the string mentioned above.

Doing some experimenting, I have found that the first part works just as
planned (eg, everything except the lookahead). Using wildcards (. and
*) works as well:

Oniguruma::ORegexp.new("(?<=<p><b>').*(?=.)").scan(scraped_html)

returns [#<MatchData "Foo'</b><br">, #<MatchData "Bar'</b><br">], as
expected. However, anything else (<, b, \w, etc.) causes the regex to
not match.

I am quite befuddled about this, though I (almost certainly) know it is
my fault. Any help would be much appreciated.

Also, if I am violating any mailing-list netiquette, I would like to
know as well.
 
R

Robert Klemme

2008/9/10 Xiong Chiamiov said:
Ruby 1.8.6 with Oniguruama installed and working (everywhere else, this
seems to be my problem).

Let me preface this by saying that I am new to Ruby (and kinda jumped
in, rather than learning it properly), and regexes are not my thing -
that why I have nifty regex-checkers.

I am trying to extract some parts out of a string
("<p><b>'Algebra'</b><br>") that I scraped from some html. I'm getting
nil returned from the expression:

Oniguruma::ORegexp.new("(?<=<p><b>').*(?='</b><br>)").scan(scraped_html)

with scraped_html being the string mentioned above.

Doing some experimenting, I have found that the first part works just as
planned (eg, everything except the lookahead). Using wildcards (. and
*) works as well:

Oniguruma::ORegexp.new("(?<=<p><b>').*(?=.)").scan(scraped_html)

returns [#<MatchData "Foo'</b><br">, #<MatchData "Bar'</b><br">], as
expected. However, anything else (<, b, \w, etc.) causes the regex to
not match.

I am quite befuddled about this, though I (almost certainly) know it is
my fault. Any help would be much appreciated.

With 1.9:

irb(main):001:0> s="<p><b>'Algebra'</b><br>"
=> "<p><b>'Algebra'</b><br>"
irb(main):002:0> s.scan %r{(?<=<p><b>').*(?='</b><br>)}
=> []
irb(main):003:0> s.scan %r{(?<=<p><b>').*?(?='</b><br>)}
=> ["Algebra"]

Note the non greedy match. I usually rather do this in those cases:

irb(main):005:0> s.scan %r{<p><b>'(.*?)'</b><br>}
=> [["Algebra"]]

I.e. use groups to extract the part that I am interested in.

Kind regards

robert
 
X

Xiong Chiamiov

Robert said:
With 1.9:

irb(main):001:0> s="<p><b>'Algebra'</b><br>"
=> "<p><b>'Algebra'</b><br>"
irb(main):002:0> s.scan %r{(?<=<p><b>').*(?='</b><br>)}
=> []
irb(main):003:0> s.scan %r{(?<=<p><b>').*?(?='</b><br>)}
=> ["Algebra"]

Note the non greedy match. I usually rather do this in those cases:

irb(main):005:0> s.scan %r{<p><b>'(.*?)'</b><br>}
=> [["Algebra"]]

I.e. use groups to extract the part that I am interested in.

Kind regards

robert

Ah, thank you very much. My regex learning was with PHP (PCRE, not
POSIX), which has some odd rules, especially regarding
greedy/non-greedy, so I'm still trying to recover from that.

Thanks again.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top