listing all the html links

Discussion in 'Ruby' started by Dado, May 3, 2006.

  1. Dado

    Dado Guest

    how can I use ruby to list all the html links on a site, ?

    Tahnks
     
    Dado, May 3, 2006
    #1
    1. Advertising

  2. Dado

    Dado Guest

    after running this code I get


    :~$ ruby list.rb
    list.rb:5: Invalid char `\302' in expression
    list.rb:5: Invalid char `\240' in expression
    list.rb:5: Invalid char `\302' in expression
    list.rb:5: Invalid char `\240' in expression
    list.rb:5: Invalid char `\302' in expression
    list.rb:5: Invalid char `\240' in expression
    list.rb:6: Invalid char `\302' in expression
    list.rb:6: Invalid char `\240' in expression
    list.rb:6: Invalid char `\302' in expression
    list.rb:6: Invalid char `\240' in expression
    list.rb:6: Invalid char `\302' in expression
    list.rb:6: Invalid char `\240' in expression
    list.rb:6: Invalid char `\302' in expression
    list.rb:6: Invalid char `\240' in expression
    list.rb:6: Invalid char `\302' in expression
    list.rb:6: Invalid char `\240' in expression
    list.rb:7: Invalid char `\302' in expression
    list.rb:7: Invalid char `\240' in expression
    list.rb:7: Invalid char `\302' in expression
    list.rb:7: Invalid char `\240' in expression
    list.rb:7: Invalid char `\302' in expression
    list.rb:7: Invalid char `\240' in expression
    list.rb:7: Invalid char `\302' in expression
    list.rb:7: Invalid char `\240' in expression
    list.rb:7: Invalid char `\302' in expression
    list.rb:7: Invalid char `\240' in expression
    list.rb:8: Invalid char `\302' in expression
    list.rb:8: Invalid char `\240' in expression
    list.rb:8: Invalid char `\302' in expression
    list.rb:8: Invalid char `\240' in expression
    list.rb:8: Invalid char `\302' in expression
    list.rb:8: Invalid char `\240' in expression
    list.rb:8: Invalid char `\302' in expression
    list.rb:8: Invalid char `\240' in expression
    list.rb:8: Invalid char `\302' in expression
    list.rb:8: Invalid char `\240' in expression
    list.rb:9: Invalid char `\302' in expression
    list.rb:9: Invalid char `\240' in expression
    list.rb:9: Invalid char `\302' in expression
    list.rb:9: Invalid char `\240' in expression
    list.rb:9: Invalid char `\302' in expression
    list.rb:9: Invalid char `\240' in expression
    list.rb:9: Invalid char `\302' in expression
    list.rb:9: Invalid char `\240' in expression
    list.rb:9: Invalid char `\302' in expression
    list.rb:9: Invalid char `\240' in expression
    list.rb:9: Invalid char `\302' in expression
    list.rb:9: Invalid char `\240' in expression
    list.rb:9: Invalid char `\302' in expression
    list.rb:9: Invalid char `\240' in expression
    list.rb:10: Invalid char `\302' in expression
    list.rb:10: Invalid char `\240' in expression
    list.rb:10: Invalid char `\302' in expression
    list.rb:10: Invalid char `\240' in expression
    list.rb:10: Invalid char `\302' in expression
    list.rb:10: Invalid char `\240' in expression
    list.rb:10: Invalid char `\302' in expression
    list.rb:10: Invalid char `\240' in expression
    list.rb:10: Invalid char `\302' in expression
    list.rb:10: Invalid char `\240' in expression
    list.rb:10: Invalid char `\302' in expression
    list.rb:10: Invalid char `\240' in expression
    list.rb:10: Invalid char `\302' in expression
    list.rb:10: Invalid char `\240' in expression
    list.rb:11: Invalid char `\302' in expression
    list.rb:11: Invalid char `\240' in expression
    list.rb:11: Invalid char `\302' in expression
    list.rb:11: Invalid char `\240' in expression
    list.rb:11: Invalid char `\302' in expression
    list.rb:11: Invalid char `\240' in expression
    list.rb:11: Invalid char `\302' in expression
    list.rb:11: Invalid char `\240' in expression
    list.rb:11: Invalid char `\302' in expression
    list.rb:11: Invalid char `\240' in expression
    list.rb:12: Invalid char `\302' in expression
    list.rb:12: Invalid char `\240' in expression
    list.rb:12: Invalid char `\302' in expression
    list.rb:12: Invalid char `\240' in expression
    list.rb:12: Invalid char `\302' in expression
    list.rb:12: Invalid char `\240' in expression

    Jeffrey Schwab wrote:

    > Dado wrote:
    >> how can I use ruby to list all the html links on a site, ?

    >
    > require 'open-uri'
    >
    > def scrape(url)
    > open(url) do |uri|
    > href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
    > m = href.match(uri.read)
    > while m
    > puts m[1]
    > m = href.match(m.post_match)
    > end
    > end
    > end
    >
    > scrape('http://www.ruby-lang.org/en/')
     
    Dado, May 4, 2006
    #2
    1. Advertising

  3. Dado

    anne001 Guest

    require 'open-uri'
    def scrape(url)
    open(url) do |uri|
    href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
    m = href.match(uri.read)
    while m
    puts m[1]
    m = href.match(m.post_match)
    end
    end
    end

    scrape('http://www.ruby-lang.org/en/')
    works for me

    regular expression: href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
    what is it saying? \s is space, () retrieves a group...[]identifies
    character sets...

    how does the loop work?
    I found post_match, programming ruby page 538

    I put some puts
    first time around
    m and m[1]
    href="mailto:"
    "mailto:"
    why is the second line m[1]...? Is it because of the set of
    parenthesis?

    thanks for your help
     
    anne001, May 5, 2006
    #3
  4. Dado

    Ross Bamford Guest

    On Wed, 03 May 2006 22:27:35 +0100, Dado <> wrote:

    > how can I use ruby to list all the html links on a site, ?
    >


    An alternative to the regexp approach, if you don't mind using external
    libraries:

    require 'open-uri'
    require 'rubyful_soup' # [1]
    page = BeautifulSoup.new(URI('http://ruby-lang.org').read)
    page.find_all('a').each { |l| puts l['href'] }

    require 'mechanize' # [2]
    m = WWW::Mechanize.new
    page = m.get('http://ruby-lang.org')
    page.links.each { |l| puts l.href }

    --
    [1] http://www.crummy.com/software/RubyfulSoup/
    [2] http://mechanize.rubyforge.org/

    Ross Bamford -
     
    Ross Bamford, May 5, 2006
    #4
  5. require 'open-uri'
    URI.extract(open(<url>).read)
     
    Vincent Foley, May 5, 2006
    #5
  6. Dado

    Ross Bamford Guest

    On Fri, 05 May 2006 19:16:05 +0100, Vincent Foley <> wrote:

    > require 'open-uri'
    > URI.extract(open(<url>).read)
    >


    Unfortunately, you pull a lot of false positives, and it doesn't
    differentiate between links and other uris (e.g. link src elements, DTD
    refs, etc).

    pp URI.extract(URI('http://www.google.com').read)
    ["font-family:arial,sans-serif;",
    "font-size:",
    "color:#0000cc;",
    "http://www.google.co.uk/ig%3Fhl%3Den",
    "https://www.google.com/accounts/Login?continue=http://www.google.co.uk/&hl=en",
    "http://groups.google.co.uk/grphp?hl=en&tab=wg&ie=UTF-8",
    "http://news.google.co.uk/nwshp?hl=en&tab=wn&ie=UTF-8",
    "http://froogle.google.co.uk/frghp?hl=en&tab=wf&ie=UTF-8",
    "Search:",
    "http://www.google.com/ncr"]


    --
    Ross Bamford -
     
    Ross Bamford, May 5, 2006
    #6
  7. Dado

    anne001 Guest

    thank you for your clear explanations
     
    anne001, May 6, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bad Beagle

    Listing All computer accounts

    Bad Beagle, Dec 23, 2005, in forum: ASP .Net
    Replies:
    0
    Views:
    379
    Bad Beagle
    Dec 23, 2005
  2. java_seek
    Replies:
    4
    Views:
    638
    Andrei Kouznetsov
    Dec 10, 2004
  3. Replies:
    1
    Views:
    498
    Oscar kind
    Mar 9, 2005
  4. Piyush
    Replies:
    2
    Views:
    439
    John Harrison
    Aug 2, 2004
  5. Kabindra
    Replies:
    3
    Views:
    197
    David Mark
    Jan 21, 2010
Loading...

Share This Page