What to operate the function "links()"

P

PP

In watir there is a function named links(). It returns a Links object .
I want to put a certain links of one web page into an array and visit
web pages by these links. My codes are as follows,the result sugguests
that the "a2[j]" stores something but not links. Can anyone help me to
check out the errors? Best regards
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
require'Watir'
ie=Watir::IE.new
ie.goto("www.baidu.com")
n=ie.links.length
puts n
$i=1
$j=1
$k=1
a1=Array.new #a1 is used to store all the links in the

#page
a2=Array.new #a2 is used to store the certain links
#that contains the string 'baidu'
while $i<=n
a1[$i]=ie.links[$i].to_s
if /(www.baidu.com)/.matches(a1[$i])
a2[$j]=ie.links[$i]
$j=$j+1
end
$i=$i+1
end
while a2[$k]
ie.goto(a2[$k])
ie.back
$k=$k+1
end
 
C

ChrisH

Links returns a Links object which mixes in Enumerable, so should be
able to get the array by using to_a:

require'Watir'
ie=Watir::IE.new
ie.goto("www.baidu.com")
linksArray = ie.links.to_a
baiduArray = linksArray.select {|x| /(www.baidu.com)/.matches(x.to_s)}
baiduArray.each{|link|
ie.goto(link)
ie.back
}
 
P

PP

Your codes are terser than mine. But after I have tried it didn't work
as of old.if put the "linksArray" to the screen,we can see that it
contains not only a url but also the id, name, value, innertext and
type. I think should all of this make the parameter unavailing to the
function"ie.goto()"
Thanks for your replys and expecting the advice about the problem.
Best wishes
 
B

Bret Pettichord

links itself is a collection of evenescent link/com objects on the
current page. these references become stale as soon as a new page is
loaded.

require'Watir'
ie=Watir::IE.new
ie.goto("www.baidu.com")

hrefs = Array.new
ie.links.each do |link|
hrefs << link.href if /(www.baidu.com)/ =~ link.href
end

hrefs.each do |href|
ie.goto(href)
end

However, i think WWW::Mechanize may be a better tool (faster at least)
if you are only interested in link checking.

Bret
 
P

PP

Actually what I want is saving the web pages whose urls contains a
certain string. Whether to show the web page is not important. Thanks
for your advice, Best reagars.
 
C

ChrisH

HI PP, saw your other post re saving files i\via IE and the WIn32 api.

If the point really is to just download the pages than doing it via
Waitr/Win32 is a bit like using a lever and pullys move a sheet of
paper

It can be done much easier, simpler and faster via one of the HTML
libraries (i,e, Mechanize mentioned above) or even using the Standard
library Net::HTTP, URI and OpenURI

cheers
 
P

PP

HI ChrisH, Thank you for giving me so much wonderful advice. My
purpose is just to download some pages whose url contain a certain
string. As I got in touched with ruby and watir just 3 weeks ago, the
methods I have found out are all make the program act just like a human
does.

Can you show me some information about the library "Net" and the
embodier the methods the way to my purpose?
Thanks
 
C

ChrisH

Your welcome PP,

Here is a quick example, pulls the links off www.baidu.com and prints
to standard output.
Note it downloads a GIF file that is linked, so if you only want HTML
files will need to add some filtering

Also note the last link it tries to process (for me anyway) is
http://www.baidu.com
and it gets an error:
d:/ruby/lib/ruby/1.8/net/http.rb:1556:in `read_status_line': wrong
status line: "<!DOCTYPE HTML PUBLIC \...."
Not sure why

Cheers
Chris

require 'net/http'
require 'uri'

h = Net::HTTP.new('www.baidu.com')
resp = h.get('/', nil)
if resp.message == "OK"
URI.extract(resp.body,['http']){|lnk|
if /www\.baidu\.com/ =~ lnk
p "LINK: #{lnk}"
urilnk = URI.parse(lnk)
p "PATH: #{urilnk.path}"
r = Net::HTTP.new(urilnk.host).get(urilnk.path||'/')
if r.message == "OK"
p "START",r.body,"END"
else
p "BAD LINK #{lnk.to_s}"
end
end
}
end
 
P

PP

Hi Chris

I have tried your codes and have the same result with yours. In my
opinion it's almost the last step of my job. I have got the urls whith
your help. now a model "webfetcher" has show me some way to save the
pages. some codes as follow can easily save the page to "E:\inText"
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
require'webfetcher'
book = WebFetcher::page.url('http://wtr.rubyforge.org/rdoc/index.html')
book.recurse.save("E:/inText")
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
but a question still exists. The url can only support the protocol of
http, the url of the page I want to save is "https://*****"。What can
I do with this problem. Can any function change the protocol "https" as
"http"? I have tried to use "http" to visit these pages and save them.
the results are acceptable. What do you think of this?
Best regards and expect for your response.
 
C

ChrisH

HTTPS applys encryption to the traffic between the client and
webserver.
If HTTP also works, than the only question is how important is the
security for the info?
Since you are just copying the files, I'd guess you are not submitting
any sensitive info (like a userid & password) so HTTP should be fine.

'webfetcher' looks nice, better than writing our own, eh?

Cheers

PS there is a Net:HTTPS but it seems to be totally undocumented...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top