website screen scraping with Mechanize or Rubyful Soup

Dan Kohn · Sep 12, 2005

I'm trying to get some website screen scraping working, but I'm
suffering from a lack of examples and documentation for either
WWW::Mechanize or Rubyful Soup.

With WWW::Mechanize, the only example I found was
http://www.zenspider.com/pipermail/ruby/2005-July/002068.html. I tried
to simplify this to the script below, but it just prints out "My wife
is ".

Rubyful Soup <http://www.crummy.com/software/RubyfulSoup/> also seems
like a great library, but there doesn't seem to be a single example
(only Python ones
<http://www.crummy.com/software/BeautifulSoup/examples.html>).

#!/usr/bin/env ruby

require 'mechanize'

agent = WWW::Mechanize.new
agent.user_agent_alias = 'Windows IE 6'

# get first page
page = agent.get('http://www.dankohn.com/')
md = page.body.match /My wife, (\w+\s\w+)<\/a>/m

printf "My wife is ", md

Thanks in advance for any help you can offer.

Lyndon Samson · Sep 12, 2005

------=_Part_8218_13399008.1126508342542
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

No quite what you wanted but

This can turn HTML in well formed XML

http://rubyforge.org/projects/tidy/

which is much easier to parse.

Maybe worth a look.

------=_Part_8218_13399008.1126508342542--

James Britt · Sep 12, 2005

Dan said:
I'm trying to get some website screen scraping working, but I'm
suffering from a lack of examples and documentation for either
WWW::Mechanize or Rubyful Soup.

With WWW::Mechanize, the only example I found was
http://www.zenspider.com/pipermail/ruby/2005-July/002068.html. I tried
to simplify this to the script below, but it just prints out "My wife
is ".

What are you actually trying to accomplish?

James

--

http://www.ruby-doc.org - The Ruby Documentation Site
http://www.rubyxml.com - News, Articles, and Listings for Ruby & XML
http://www.rubystuff.com - The Ruby Store for Ruby Stuff
http://www.jamesbritt.com - Playing with Better Toys

Dan Kohn · Sep 12, 2005

My ultimate goal is to create a series of screen scrapers that are able
to access airline websites (including entering username and password,
dealing with redirects, etc.), find my mileage and recent flights,
parse the data, put it in some variables, and save it to MySQL (with
rails).

I was trying to start with baby steps to understand the methods these
libraries support. Specifically, I was trying to fetch my own web
page, and then use a regex to match to my wife's name, "Julie Pullen",
since I have link text on www.dankohn.com saying "My wife, Julie
Pullen". I was then going to gradually increase the complexity of the
scraping.

Thanks in advance for any example scripts or documentation that you can
provide showing web scraping in ruby.

And Lyndon, I'm a huge fan of Tidy for cleaning up my own web pages,
but I'm not sure it's helpful here, as was aiming to use regexes to
parse the HTML rather than the DOM.

Michel Martens · Sep 12, 2005

Hi

I'm trying to get some website screen scraping working, but I'm
suffering from a lack of examples and documentation for either
WWW::Mechanize or Rubyful Soup.
=20
With WWW::Mechanize, the only example I found was
http://www.zenspider.com/pipermail/ruby/2005-July/002068.html. I tried
to simplify this to the script below, but it just prints out "My wife
is ".
=20
Rubyful Soup <http://www.crummy.com/software/RubyfulSoup/> also seems
like a great library, but there doesn't seem to be a single example
(only Python ones
<http://www.crummy.com/software/BeautifulSoup/examples.html>).
=20
#!/usr/bin/env ruby
=20
require 'mechanize'
=20
agent =3D WWW::Mechanize.new
agent.user_agent_alias =3D 'Windows IE 6'
=20
# get first page
page =3D agent.get('http://www.dankohn.com/')
md =3D page.body.match /My wife, (\w+\s\w+)<\/a>/m
=20
printf "My wife is ", md

print "My wife is ", md[1]

Michel.

Dan Kohn · Sep 12, 2005

Thanks, Michel.

Michel Martens · Sep 12, 2005

You're welcome!

Lyndon Samson · Sep 12, 2005

------=_Part_11287_19310825.1126568412932
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

=20
=20
And Lyndon, I'm a huge fan of Tidy for cleaning up my own web pages,
but I'm not sure it's helpful here, as was aiming to use regexes to
parse the HTML rather than the DOM.

Well, DOM allows you to use XPath, which is a powerfull query mechanism.

This=20
http://www-128.ibm.com/developerworks/java/library/j-jtp03225.html?ca=3Ddgr=
-jw26XQueryis
XQuery specific, but relies
on XPath.

And example from the article

//td[contains(a/small/text(), "New York, NY")]

------=_Part_11287_19310825.1126568412932--

dave.burt · Sep 13, 2005

If you can use a Windows box, Watir might be the best-fit.

Install (using Gem or the Windows setup program that is available at
http://wtr.rubyforge.org/) and IRB this:

require 'watir'
ie = Watir::IE.new
ie.goto "dankohn.com"
julie_lines = ie.text.scan(/.*Julie.*/)
link = ie.link

text, /Julie/)
link.click

Cheers,
Dave

Dan Kohn · Sep 13, 2005

Looks cool, thanks. I'm developing on a Windows machine, but plan to
move to a "real" machine for production. So it looks like Mechanize,
REXML, and XQuery will be the best bet.

Scraping with Nokogiri while using Mechanize	2	Mar 10, 2011
problems with mechanize and inheritance	1	Mar 3, 2010
Encoding problem ? in Mechanize /NET::Http.get	1	Feb 24, 2009
Rubyful Soup v0.8	3	Aug 19, 2005
Problem with Mechanize	2	Oct 1, 2008
help with mechanize	5	Aug 6, 2008
Bounty: $250 - Successfully log in to AOL Webmail with Mechanize	11	Oct 30, 2007
clicking links in mechanize with :text=> nokogiri.css('a.l')	1	Jan 29, 2009

website screen scraping with Mechanize or Rubyful Soup

Dan Kohn

Lyndon Samson

James Britt

Dan Kohn

Michel Martens

Dan Kohn

Michel Martens

Lyndon Samson

dave.burt

Dan Kohn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads