Mechanize file save on generated link

D

Dan Mansfield

Hi there,
I'm working on a project to automate retrieval of content and the
download of a pdf bill from an ISPs website. I have 30 connections with
this ISP and each has its own username and password. So far I have been
able to get the content I need that is actually stored within the page.

You have to login and go to a specific page to be able to click on the
link.
The link itself isn't the pdf as its generated on the fly:
source code of the link:
<a href="/be-portal/downloadPdf"
onclick="cmCreatePageElementTag('Download PDF', 'Member Centre');"
class="btn_Link">Download PDF</a>

I'm selecting it with:
link = invoice_page.links_with:)href => "/be-portal/downloadPdf")

how can I click this link to download the pdf and store it to the
filesystem?

btw, I only started with Ruby and Mechanize less than 24 hours ago.
TIA
Regards,
Dan
 
D

Dan Mansfield

thanks, so my script is now:
....
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with:)href => "/be-portal/downloadPdf")

File.open('myfile', 'w+') do |file|
file << agent.get_file(link['href'])
end

results in:
C:/ruby/test.rb:27:in `[]': can't convert String into Integer
(TypeError)
from C:/ruby/test.rb:27:in `block in <main>'
from C:/ruby/test.rb:26:in `open'
from C:/ruby/test.rb:26:in `<main>'

what is the proper method to see the response back from clicking the
link?
 
M

Mike Dalessio

[Note: parts of this message were removed to make it a legal post.]

thanks, so my script is now:
....
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with:)href => "/be-portal/downloadPdf")

File.open('myfile', 'w+') do |file|
file << agent.get_file(link['href'])
end

results in:
C:/ruby/test.rb:27:in `[]': can't convert String into Integer
(TypeError)
from C:/ruby/test.rb:27:in `block in <main>'
from C:/ruby/test.rb:26:in `open'
from C:/ruby/test.rb:26:in `<main>'

what is the proper method to see the response back from clicking the
link?

links_with returns an array. Try using .first to pick out the first result,
so:

link = invoice_page.links_with:)href => "/be-portal/downloadPdf").first
 
D

Dan Mansfield

Mike said:
C:/ruby/test.rb:27:in `[]': can't convert String into Integer
(TypeError)
from C:/ruby/test.rb:27:in `block in <main>'
from C:/ruby/test.rb:26:in `open'
from C:/ruby/test.rb:26:in `<main>'

what is the proper method to see the response back from clicking the
link?

links_with returns an array. Try using .first to pick out the first
result,
so:

link = invoice_page.links_with:)href => "/be-portal/downloadPdf").first

Ok, so some progress:
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with:)href => "/be-portal/downloadPdf").first
page = agent.click(link)
#pp page
File.open('myfile.pdf', 'w+') do |file|
file << page
end

If I look at the content of page now it contains the stream of PDF data
as well as:
@code="200",
@filename="Invoice_14051844_08/09/2010.pdf",
@response=
{"date"=>"Wed, 15 Sep 2010 19:13:04 GMT",
"server"=>"Apache",
"expires"=>"Wed, 15 Sep 2010 19:14:04 GMT",
"cache-control"=>"max-age=60",
"content-disposition"=>
"attachment;filename=Invoice_14051844_08/09/2010.pdf",
"vary"=>"Accept-Encoding",
"content-encoding"=>"gzip",
"content-length"=>"7950",
"keep-alive"=>"timeout=5, max=92",
"connection"=>"Keep-Alive",
"content-type"=>"application/octet-stream"},
@uri=
#<URI::HTTPS:0x335bc38
URL:https://www.bethere.co.uk/be-portal/downloadPdf>>
 
D

Dan Mansfield

I tried this too:
File.open('myfile.pdf', 'w+') do |file|
file << page.body
end

Which almost works but presents a corrupt pdf file. I can see the
document properties of the pdf file but there is no content.

I have also tried using:
agent.pluggable_parser.pdf = Mechanize::FileSaver
agent.click(link)

which did not produce an error but also did not produce a pdf file
either.
 
D

Dan Mansfield

Dan said:
I tried this too:
File.open('myfile.pdf', 'w+') do |file|
file << page.body
end

Which almost works but presents a corrupt pdf file. I can see the
document properties of the pdf file but there is no content.

SO CLOSE!

By doing a binary compare of a working version downloaded through the
browser and the one through Mechanize, I have found that it is saving
the line breaks as 0D 0A in hex versus just 0A in the working file.

Whilst I dig around to find how to avoid Mechanise/Ruby using that
behaviour. Has anyone else come across this and have a solution?
Thanks
 
D

Dan Mansfield

Thanks to everyone who helped. Writing the file in Binary mode did the
trick.

In case anyone has this problem in the future here is my full script:

require 'rubygems'
require 'mechanize'

URL_LOGIN =
'https://www.bethere.co.uk/cas/login?service=https://www.bethere.co.uk/c/portal/login'
URL_BILLING = 'https://www.bethere.co.uk/group/beportal/billsandpayment'

abort "Usage: #{$0} <username> <password>" unless ARGV.length == 2

agent = Mechanize.new
agent.follow_meta_refresh = true
agent.redirect_ok = true
agent.user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6;
en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6'
login_page = agent.get(URL_LOGIN)

login_form = login_page.forms.first
login_form.username = ARGV[0]
login_form.password = ARGV[1]

redirect_page = agent.submit(login_form)

invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with:)href => "/be-portal/downloadPdf").first
page = agent.click(link)

File.open(page.filename.gsub("/","_"), 'w+b') do |file|
file << page.body.strip
end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,591
Members
45,100
Latest member
MelodeeFaj
Top