str.scan

Discussion in 'Ruby' started by Colin Summers, Jun 15, 2007.

  1. I have a page of html, the usual thing. It has an ordered list. So it has
    <ol>
    <li>item</li>
    <li>item</li>
    <li>item</li>
    <li>item</li>
    </ol>

    Well, I am going through this my usual way, which is just brute force
    string manipulation. It's still my first day with Ruby. Then I see

    str.scan
    Both forms iterate through str, matching the pattern (which may be a
    Regexp or a String). For each match, a result is generated and either
    added to the result array or passed to the block. If the pattern
    contains no groups, each individual result consists of the matched
    string, $&. If the pattern contains groups, each individual result is
    itself an array containing one entry per group.



    And I think, oooh, I bet that would be cool to use here. But my regexp
    is rusty and I'm not sure how I would set it up
    items = page.scan('<li>*</li>')
    something like that? Then items would be an array of the text in the items?

    Looked cool, anyway. I love how terse it can be.

    There's probably also an html/xml parsing library, but I don't have
    THAT much of this stuff to do, so I think a little manual work is
    probably simpler/easier to learn.

    --Colin
     
    Colin Summers, Jun 15, 2007
    #1
    1. Advertising

  2. Colin Summers

    Peter Szinek Guest

    Colin,

    But my regexp
    > is rusty and I'm not sure how I would set it up
    > items = page.scan('<li>*</li>')
    > something like that? Then items would be an array of the text in the items?


    Yes, they will be.

    However, first things first:

    1) items = page.scan('<li>*</li>')

    I believe you want instead is

    items = page.scan('<li>.*</li>')

    ( or maybe items = page.scan('<li>.+</li>') if you are not interested in
    empty <li>s)

    2) What I really believe you want is

    items = page.scan('<li>.*?</li>')

    ? adds greediness to your regexp - so instead of matching the first
    <li>. then matching as much as possible of anything, then matching the
    *last* </li>, 2) will match as less as possible.

    Let's try:

    stuff = <<HTML
    <li>aaa</li>
    <li>bbb</li>
    HTML

    >> stuff.scan(/<li>.*?<\/li>/)

    => ["<li>aaa</li>", "<li>bbb</li>"]

    3) Maybe you want even this:

    >> stuff.scan(/<li>(.*?)<\/li>/)

    => [["aaa"], ["bbb"]]

    or, even more friendly:

    >> stuff.scan(/<li>(.*?)<\/li>/).flatten

    => ["aaa", "bbb"]

    HTH,
    Peter
    _
    http://www.rubyrailways.com :: Ruby and Web2.0 blog
    http://scrubyt.org :: Ruby web scraping framework
    http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.
     
    Peter Szinek, Jun 15, 2007
    #2
    1. Advertising

  3. Colin Summers

    Peter Szinek Guest

    Peter Szinek, Jun 15, 2007
    #3
  4. Colin Summers

    Phrogz Guest

    On Jun 15, 12:13 am, Peter Szinek <> wrote:
    > 2) What I really believe you want is
    >
    > items = page.scan('<li>.*?</li>')
    >
    > ? adds greediness to your regexp - so instead of matching the first
    > <li>. then matching as much as possible of anything, then matching the
    > *last* </li>, 2) will match as less as possible.


    Minor pedantic correction: .* is greedy (it grabs as much as it can).
    The question mark makes it non-greedy (stop as soon as you've found a
    match).
     
    Phrogz, Jun 15, 2007
    #4
  5. stuff.scan(/<li>(.*?)<\/li>/).flatten

    is exactly what I was hoping for. I peeked at scRUBYt and I know that
    I am duplicating work in there, but I am trying to a bunch of things
    at once and one is learning Ruby. scRUBYt is doing so much work for me
    that I wouldn't learn very much.

    The tcl code that stuff.scan(/<li>(.*?)<\/li>/).flatten is so long.
    That's great.

    Day 2: Have my pickaxe. Bought Pine's book because it was fun to read
    on the web and I like having books. Bought another copy of Lenz' Rails
    book because a friend like it so much he took it. 115 lines and I am
    ahead of where the professional consultant was with the .NET
    application (after a month of programming).

    Thanks,
    --Colin
     
    Colin Summers, Jun 15, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David
    Replies:
    2
    Views:
    479
    Thomas G. Marshall
    Aug 3, 2003
  2. Trevor

    sizeof(str) or sizeof(str) - 1 ?

    Trevor, Apr 3, 2004, in forum: C Programming
    Replies:
    9
    Views:
    633
    CBFalconer
    Apr 10, 2004
  3. Sullivan WxPyQtKinter

    It is fun.the result of str.lower(str())

    Sullivan WxPyQtKinter, Mar 7, 2006, in forum: Python
    Replies:
    5
    Views:
    340
    Tim Roberts
    Mar 9, 2006
  4. Stefan Ram

    str.equals(null) or str==null ?

    Stefan Ram, Jul 31, 2006, in forum: Java
    Replies:
    21
    Views:
    14,718
    Oliver Wong
    Aug 3, 2006
  5. maestro
    Replies:
    1
    Views:
    305
    Chris
    Aug 11, 2008
Loading...

Share This Page