Screen scraping an aspx site with Mechanize

  • Thread starter Sofie Willander
  • Start date
S

Sofie Willander

Hi,

I've been googleing for over a week now, but I can't find out how to
screen scrape an aspx-site with Mechanize (and I must use mechanize).

The site (https://portal.if.se/kopforsakring/ProductList.aspx?linkID=3D4)=

contains a button with the text "Bil" that I want to click, and after
that click the continue button (green one with the text "Forts=C3=A4tt").=


I've tried using Firebug to find viewstate, but I don't know what to do
with it once I've found it.

Am I one the right track?
Can anybody help me?

-- =

Posted via http://www.ruby-forum.com/.=
 
A

Alex Stahl

[Note: parts of this message were removed to make it a legal post.]

Have you looked into nokogiri at all? You can use mechanize for the
server interaction (GET, POST, etc), then parse the response
object's .body with nokogiri.

As long as you don't have to deliberately replicate a "click", this
would work fine (use Selenium if you actually need a click event).
Otherwise, GETing and POSTing to the links produces the same results.
Make sense?

req = Mechanize.new
resp = req.get("/path/to/desired/url")
page = Nokogiri::HTML resp.body
link = page.xpath("//xpath/to/link")
resp = req.get(link)



________________________________________________________________________

Alex Stahl | Sr. Quality Engineer | hi5 Networks, Inc. | (e-mail address removed)
| m: 415.710.6961
 
M

Mike Dalessio

Hi,

I've been googleing for over a week now, but I can't find out how to
screen scrape an aspx-site with Mechanize (and I must use mechanize).

The site (https://portal.if.se/kopforsakring/ProductList.aspx?linkID=3D4)
contains a button with the text "Bil" that I want to click, and after
that click the continue button (green one with the text "Forts=E4tt").

I've tried using Firebug to find viewstate, but I don't know what to do
with it once I've found it.

Am I one the right track?
Can anybody help me?

You should watch Ryan Bates's excellent screencast on scraping data with
Mechanize:

http://railscasts.com/episodes/191-mechanize
 
S

Sofie Willander

Mike Dalessio wrote in post #965776:
You should watch Ryan Bates's excellent screencast on scraping data with
Mechanize:

http://railscasts.com/episodes/191-mechanize

I've already watched that railscast (and the one about screen scraping
with nokogiri) and it worked fine on a non-aspx site, but did not work
at all on an aspx site (got posted pack to the same page over and
over..). Have you screen scraped an aspx site with the method Ryan Bates
shows?
 
S

Sofie Willander

Thank you for your reply! I haven't gotten it to work yet though. I get
an error on the following:

Alex Stahl wrote in post #965773:
resp = req.get(link)

The error read:
Mechanize::ResponseCodeError: 400 => Net::HTTPBadRequest
from
/usr/local/lib/ruby/gems/1.8/gems/mechanize-1.0.0/lib/mechanize.rb:259:in
`get'
from (irb):8
from :0

It seems to find the button (I assumed that the xpath to the link was
actually the button's xpath. Correct?). I get the button as an object,
can I use req.get() on it? What am I not doing correctly?

I've attached a textfile with the output of the two last commands. I
would be so glad if you could help me once again.

Attachments:
http://www.ruby-forum.com/attachment/5509/HTTPBadRequest.txt
 
A

Alex Stahl

[Note: parts of this message were removed to make it a legal post.]

Glad I could help... a few things to know:

-Mechanize throws an exception on any response which is not an HTTP 200
or 302. So the error you're receiving, HTTP 400, is not handled by
mechanize and needs to be by your client.
-#get takes a URL as its parameter, so link should be a URL string.
(Actually, there's more than one way to pass the URL - check the
following link if that's not what you want:
http://mechanize.rubyforge.org/mechanize/Mechanize.html#M000231)
-Starting an xpath with "//*" causes the parser to look at *every*
element until it finds one which has the @id you supplied. Better to
replace "*" with the actual HTML element.

Based on the xpath in the error at the link, you're not extracting a URL
- you're getting an HTML object (or, more specifically, an XML
node/nodeset). Instead, what you want is the "href" property of the
<a> tag located at the xpath. (In the below example, '//path/to' would
be the unique HTML element(s) which is/are the parent of the anchor
tag). Access the property like so:

link = page.xpath("//path/to/a/@href").to_s
p link

It's also helpful to output the link prior to using it as a param to
#get to see what you'll ask for.


________________________________________________________________________

Alex Stahl | Sr. Quality Engineer | hi5 Networks, Inc. | (e-mail address removed)
| m: 415.710.6961
 
J

Josh Cheek

[Note: parts of this message were removed to make it a legal post.]

Hi,

I've been googleing for over a week now, but I can't find out how to
screen scrape an aspx-site with Mechanize (and I must use mechanize).
I don't think Mechanize will work for this. Mechanize can't process
JavaScript. I'd recommend Watir or Selenium, which actually launch and drive
instances of the browser.
 
S

Sofie Willander

Alex Stahl wrote in post #965922:
Based on the xpath in the error at the link, you're not extracting a URL
- you're getting an HTML object (or, more specifically, an XML
node/nodeset). Instead, what you want is the "href" property of the
<a> tag located at the xpath. (In the below example, '//path/to' would
be the unique HTML element(s) which is/are the parent of the anchor
tag). Access the property like so:

link = page.xpath("//path/to/a/@href").to_s
p link

Now I'm even more confused.. Have you got any examples to show me?
How do I find the href for a button?
 
A

Alex Stahl

[Note: parts of this message were removed to make it a legal post.]

Sorry, I haven't looked too closely at the site you're scraping and had
assumed the button was wrapped by a link. But upon closer inspection
that doesn't appear to be the case. Looking at the page source, the
button is a form input element which doesn't actually cause a request to
be sent. In this case, as I noted in my first email, you will in fact
need to generate a click event. Unfortunately, that's not really what
nokogiri is for. Since you need that specific event to fire, as was
previously recommended, watir or selenium are the more appropriate
tools.

Another option would be to use wireshark to sniff for any requests which
are sent, and then try to reconstruct and send those requests via
mechanize. But this would be a little more complex than just using the
right tools.

Of course, I just checked the mechanize docs again... and there is a
#click_button method in the form object, so that could be a solution as
well. (http://mechanize.rubyforge.org/mechanize/Mechanize/Form.html)


________________________________________________________________________

Alex Stahl | Sr. Quality Engineer | hi5 Networks, Inc. | (e-mail address removed)
|
 
P

Piyush Ranjan

Even though you can probably do this with mechanize, I'd advice against it.

Aspx website keep lots of data in view state params and other things. Use
either
1. celerity in jruby or
2. watir or selenium.

You can also dump mechanize and use only nokogiri to parse and post to the
https://portal.if.se/kopforsakring/ProductList.aspx?linkID=3D4 with all the
form data. That could be messy, though.

My 2 cents

Piyush
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,045
Latest member
DRCM

Latest Threads

Top