listing all the html links

D

Dado

after running this code I get


:~$ ruby list.rb
list.rb:5: Invalid char `\302' in expression
list.rb:5: Invalid char `\240' in expression
list.rb:5: Invalid char `\302' in expression
list.rb:5: Invalid char `\240' in expression
list.rb:5: Invalid char `\302' in expression
list.rb:5: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:6: Invalid char `\302' in expression
list.rb:6: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:7: Invalid char `\302' in expression
list.rb:7: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:8: Invalid char `\302' in expression
list.rb:8: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:9: Invalid char `\302' in expression
list.rb:9: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:10: Invalid char `\302' in expression
list.rb:10: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:11: Invalid char `\302' in expression
list.rb:11: Invalid char `\240' in expression
list.rb:12: Invalid char `\302' in expression
list.rb:12: Invalid char `\240' in expression
list.rb:12: Invalid char `\302' in expression
list.rb:12: Invalid char `\240' in expression
list.rb:12: Invalid char `\302' in expression
list.rb:12: Invalid char `\240' in expression

Jeffrey said:
Dado said:
how can I use ruby to list all the html links on a site, ?

require 'open-uri'

def scrape(url)
open(url) do |uri|
href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
m = href.match(uri.read)
while m
puts m[1]
m = href.match(m.post_match)
end
end
end

scrape('http://www.ruby-lang.org/en/')
 
A

anne001

require 'open-uri'
def scrape(url)
open(url) do |uri|
href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
m = href.match(uri.read)
while m
puts m[1]
m = href.match(m.post_match)
end
end
end

scrape('http://www.ruby-lang.org/en/')
works for me

regular expression: href = /href\s*=(\s*(?:"(.*?)"|[^>\s]))/
what is it saying? \s is space, () retrieves a group...[]identifies
character sets...

how does the loop work?
I found post_match, programming ruby page 538

I put some puts
first time around
m and m[1]
href="mailto:[email protected]"
"mailto:[email protected]"
why is the second line m[1]...? Is it because of the set of
parenthesis?

thanks for your help
 
R

Ross Bamford

how can I use ruby to list all the html links on a site, ?

An alternative to the regexp approach, if you don't mind using external
libraries:

require 'open-uri'
require 'rubyful_soup' # [1]
page = BeautifulSoup.new(URI('http://ruby-lang.org').read)
page.find_all('a').each { |l| puts l['href'] }

require 'mechanize' # [2]
m = WWW::Mechanize.new
page = m.get('http://ruby-lang.org')
page.links.each { |l| puts l.href }
 
R

Ross Bamford

require 'open-uri'
URI.extract(open(<url>).read)

Unfortunately, you pull a lot of false positives, and it doesn't
differentiate between links and other uris (e.g. link src elements, DTD
refs, etc).

pp URI.extract(URI('http://www.google.com').read)
["font-family:arial,sans-serif;",
"font-size:",
"color:#0000cc;",
"http://www.google.co.uk/ig?hl=en",
"https://www.google.com/accounts/Login?continue=http://www.google.co.uk/&hl=en",
"http://groups.google.co.uk/grphp?hl=en&tab=wg&ie=UTF-8",
"http://news.google.co.uk/nwshp?hl=en&tab=wn&ie=UTF-8",
"http://froogle.google.co.uk/frghp?hl=en&tab=wf&ie=UTF-8",
"Search:",
"http://www.google.com/ncr"]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top