problems with oniguruma lookahead

Discussion in 'Ruby' started by Xiong Chiamiov, Sep 9, 2008.

  1. Ruby 1.8.6 with Oniguruama installed and working (everywhere else, this
    seems to be my problem).

    Let me preface this by saying that I am new to Ruby (and kinda jumped
    in, rather than learning it properly), and regexes are not my thing -
    that why I have nifty regex-checkers.

    I am trying to extract some parts out of a string
    ("<p><b>'Algebra'</b><br>") that I scraped from some html. I'm getting
    nil returned from the expression:

    Oniguruma::ORegexp.new("(?<=<p><b>').*(?='</b><br>)").scan(scraped_html)

    with scraped_html being the string mentioned above.

    Doing some experimenting, I have found that the first part works just as
    planned (eg, everything except the lookahead). Using wildcards (. and
    *) works as well:

    Oniguruma::ORegexp.new("(?<=<p><b>').*(?=.)").scan(scraped_html)

    returns [#<MatchData "Foo'</b><br">, #<MatchData "Bar'</b><br">], as
    expected. However, anything else (<, b, \w, etc.) causes the regex to
    not match.

    I am quite befuddled about this, though I (almost certainly) know it is
    my fault. Any help would be much appreciated.

    Also, if I am violating any mailing-list netiquette, I would like to
    know as well.
    --
    Posted via http://www.ruby-forum.com/.
     
    Xiong Chiamiov, Sep 9, 2008
    #1
    1. Advertising

  2. 2008/9/10 Xiong Chiamiov <>:
    > Ruby 1.8.6 with Oniguruama installed and working (everywhere else, this
    > seems to be my problem).
    >
    > Let me preface this by saying that I am new to Ruby (and kinda jumped
    > in, rather than learning it properly), and regexes are not my thing -
    > that why I have nifty regex-checkers.
    >
    > I am trying to extract some parts out of a string
    > ("<p><b>'Algebra'</b><br>") that I scraped from some html. I'm getting
    > nil returned from the expression:
    >
    > Oniguruma::ORegexp.new("(?<=<p><b>').*(?='</b><br>)").scan(scraped_html)
    >
    > with scraped_html being the string mentioned above.
    >
    > Doing some experimenting, I have found that the first part works just as
    > planned (eg, everything except the lookahead). Using wildcards (. and
    > *) works as well:
    >
    > Oniguruma::ORegexp.new("(?<=<p><b>').*(?=.)").scan(scraped_html)
    >
    > returns [#<MatchData "Foo'</b><br">, #<MatchData "Bar'</b><br">], as
    > expected. However, anything else (<, b, \w, etc.) causes the regex to
    > not match.
    >
    > I am quite befuddled about this, though I (almost certainly) know it is
    > my fault. Any help would be much appreciated.


    With 1.9:

    irb(main):001:0> s="<p><b>'Algebra'</b><br>"
    => "<p><b>'Algebra'</b><br>"
    irb(main):002:0> s.scan %r{(?<=<p><b>').*(?='</b><br>)}
    => []
    irb(main):003:0> s.scan %r{(?<=<p><b>').*?(?='</b><br>)}
    => ["Algebra"]

    Note the non greedy match. I usually rather do this in those cases:

    irb(main):005:0> s.scan %r{<p><b>'(.*?)'</b><br>}
    => [["Algebra"]]

    I.e. use groups to extract the part that I am interested in.

    Kind regards

    robert

    --
    use.inject do |as, often| as.you_can - without end
     
    Robert Klemme, Sep 10, 2008
    #2
    1. Advertising

  3. Robert Klemme wrote:
    > With 1.9:
    >
    > irb(main):001:0> s="<p><b>'Algebra'</b><br>"
    > => "<p><b>'Algebra'</b><br>"
    > irb(main):002:0> s.scan %r{(?<=<p><b>').*(?='</b><br>)}
    > => []
    > irb(main):003:0> s.scan %r{(?<=<p><b>').*?(?='</b><br>)}
    > => ["Algebra"]
    >
    > Note the non greedy match. I usually rather do this in those cases:
    >
    > irb(main):005:0> s.scan %r{<p><b>'(.*?)'</b><br>}
    > => [["Algebra"]]
    >
    > I.e. use groups to extract the part that I am interested in.
    >
    > Kind regards
    >
    > robert


    Ah, thank you very much. My regex learning was with PHP (PCRE, not
    POSIX), which has some odd rules, especially regarding
    greedy/non-greedy, so I'm still trying to recover from that.

    Thanks again.
    --
    Posted via http://www.ruby-forum.com/.
     
    Xiong Chiamiov, Sep 10, 2008
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    7
    Views:
    529
  2. Michael Powe

    regexp lookahead

    Michael Powe, May 3, 2006, in forum: Java
    Replies:
    3
    Views:
    3,409
    Jussi Piitulainen
    May 4, 2006
  3. Jelle Smet
    Replies:
    2
    Views:
    754
    Helmut Jarausch
    Nov 23, 2009
  4. MRAB
    Replies:
    0
    Views:
    925
  5. vbgunz
    Replies:
    6
    Views:
    158
    vbgunz
    Nov 28, 2007
Loading...

Share This Page