Mechanize file save on generated link

Dan Mansfield · Sep 12, 2010

Hi there,
I'm working on a project to automate retrieval of content and the
download of a pdf bill from an ISPs website. I have 30 connections with
this ISP and each has its own username and password. So far I have been
able to get the content I need that is actually stored within the page.

You have to login and go to a specific page to be able to click on the
link.
The link itself isn't the pdf as its generated on the fly:
source code of the link:
<a href="/be-portal/downloadPdf"
onclick="cmCreatePageElementTag('Download PDF', 'Member Centre');"
class="btn_Link">Download PDF</a>

I'm selecting it with:
link = invoice_page.links_with

href => "/be-portal/downloadPdf")

how can I click this link to download the pdf and store it to the
filesystem?

btw, I only started with Ruby and Mechanize less than 24 hours ago.
TIA
Regards,
Dan

Andrea Dallera · Sep 12, 2010

Hi Dan,

try with

File.open('myfile', 'w+') do |file|
file << agent.get_file(link['href'])
end

where agent is the mechanize agent you used to log in and get the link.

--

Andrea Dallera
http://github.com/bolthar/freightrain
http://usingimho.wordpress.com

Il 12/09/2010 22:53, Dan Mansfield ha scritto:

Dan Mansfield · Sep 12, 2010

thanks, so my script is now:
....
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with

href => "/be-portal/downloadPdf")

File.open('myfile', 'w+') do |file|
file << agent.get_file(link['href'])
end

results in:
C:/ruby/test.rb:27:in `[]': can't convert String into Integer
(TypeError)
from C:/ruby/test.rb:27:in `block in <main>'
from C:/ruby/test.rb:26:in `open'
from C:/ruby/test.rb:26:in `<main>'

what is the proper method to see the response back from clicking the
link?

Mike Dalessio · Sep 13, 2010

[Note: parts of this message were removed to make it a legal post.]

thanks, so my script is now:
....
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_withhref => "/be-portal/downloadPdf")

File.open('myfile', 'w+') do |file|
file << agent.get_file(link['href'])
end

results in:
C:/ruby/test.rb:27:in `[]': can't convert String into Integer
(TypeError)
from C:/ruby/test.rb:27:in `block in <main>'
from C:/ruby/test.rb:26:in `open'
from C:/ruby/test.rb:26:in `<main>'

what is the proper method to see the response back from clicking the
link?

links_with returns an array. Try using .first to pick out the first result,
so:

link = invoice_page.links_with

href => "/be-portal/downloadPdf").first

Dan Mansfield · Sep 15, 2010

Mike said:
C:/ruby/test.rb:27:in `[]': can't convert String into Integer
(TypeError)
from C:/ruby/test.rb:27:in `block in <main>'
from C:/ruby/test.rb:26:in `open'
from C:/ruby/test.rb:26:in `<main>'

what is the proper method to see the response back from clicking the
link?

Click to expand...

links_with returns an array. Try using .first to pick out the first
result,
so:

link = invoice_page.links_withhref => "/be-portal/downloadPdf").first

Ok, so some progress:
invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with

href => "/be-portal/downloadPdf").first
page = agent.click(link)
#pp page
File.open('myfile.pdf', 'w+') do |file|
file << page
end

If I look at the content of page now it contains the stream of PDF data
as well as:
@code="200",
@filename="Invoice_14051844_08/09/2010.pdf",
@response=
{"date"=>"Wed, 15 Sep 2010 19:13:04 GMT",
"server"=>"Apache",
"expires"=>"Wed, 15 Sep 2010 19:14:04 GMT",
"cache-control"=>"max-age=60",
"content-disposition"=>
"attachment;filename=Invoice_14051844_08/09/2010.pdf",
"vary"=>"Accept-Encoding",
"content-encoding"=>"gzip",
"content-length"=>"7950",
"keep-alive"=>"timeout=5, max=92",
"connection"=>"Keep-Alive",
"content-type"=>"application/octet-stream"},
@uri=
#<URI::HTTPS:0x335bc38
URL:https://www.bethere.co.uk/be-portal/downloadPdf>>

Dan Mansfield · Sep 15, 2010

I tried this too:
File.open('myfile.pdf', 'w+') do |file|
file << page.body
end

Which almost works but presents a corrupt pdf file. I can see the
document properties of the pdf file but there is no content.

I have also tried using:
agent.pluggable_parser.pdf = Mechanize::FileSaver
agent.click(link)

which did not produce an error but also did not produce a pdf file
either.

Dan Mansfield · Sep 20, 2010

Dan said:
I tried this too:
File.open('myfile.pdf', 'w+') do |file|
file << page.body
end

Which almost works but presents a corrupt pdf file. I can see the
document properties of the pdf file but there is no content.

SO CLOSE!

By doing a binary compare of a working version downloaded through the
browser and the one through Mechanize, I have found that it is saving
the line breaks as 0D 0A in hex versus just 0A in the working file.

Whilst I dig around to find how to avoid Mechanise/Ruby using that
behaviour. Has anyone else come across this and have a solution?
Thanks

Dan Mansfield · Sep 20, 2010

Thanks to everyone who helped. Writing the file in Binary mode did the
trick.

In case anyone has this problem in the future here is my full script:

require 'rubygems'
require 'mechanize'

URL_LOGIN =
'https://www.bethere.co.uk/cas/login?service=https://www.bethere.co.uk/c/portal/login'
URL_BILLING = 'https://www.bethere.co.uk/group/beportal/billsandpayment'

abort "Usage: #{$0} <username> <password>" unless ARGV.length == 2

agent = Mechanize.new
agent.follow_meta_refresh = true
agent.redirect_ok = true
agent.user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6;
en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6'
login_page = agent.get(URL_LOGIN)

login_form = login_page.forms.first
login_form.username = ARGV[0]
login_form.password = ARGV[1]

redirect_page = agent.submit(login_form)

invoice_page = agent.get(URL_BILLING)
link = invoice_page.links_with

href => "/be-portal/downloadPdf").first
page = agent.click(link)

File.open(page.filename.gsub("/","_"), 'w+b') do |file|
file << page.body.strip
end

Digital Signature field form in PDF generated document from HTML	5	Nov 16, 2022
Creating a direct download div link for pdf file	3	Mar 19, 2023
[ANN] Mechanize 2.0.pre.2	0	Apr 18, 2011
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
A website that I couldn't make a screenshot of it nor save any page from.	1	Oct 29, 2023
How to automate repetitive tasks on firefox ???	6	Dec 25, 2022
Mechanize/Nokogiri from file	0	Sep 16, 2009
mechanize - problem with downloading csv file	0	Feb 24, 2009

Mechanize file save on generated link

Dan Mansfield

Andrea Dallera

Dan Mansfield

Mike Dalessio

Dan Mansfield

Dan Mansfield

Dan Mansfield

Dan Mansfield

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads