how come this code doesnt work as designed?

A

an an

Hi,

I found this web crawler code online using mechanize,

require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get('http://example.com/')

stack = page.links
counter = 0;

out = File.open("out.txt", "w")
while l = stack.pop
begin
next unless l.uri.host == agent.history.first.uri.host
if not agent.visited? l.href
counter += 1
out.puts l.href
stack.push(*(agent.click(l).links))
end
rescue
#puts "Error encountered"
end
end

puts "Total unique links: " + counter.to_s

So I gave it a try, and although it seemed to be working, I noticed that
the stack size quickly rocketed, and after examining the output, I
noticed that there are several duplicates (For example, one output file
had over 50k urls, but when I removed the duplicates, there was only a
bit over 9k urls). So I modified the code using a Hash to avoid
duplicates (although this design means that I am storing multiple copies
of all the urls), but the same thing happened, so I was wondering if
anyone could figure out what I am doing wrong. Here is the modified
code:

require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get('http://example.com/')

stack = page.links
hash = Hash.new
counter = 0;

out = File.open("out.txt", "w")
while l = stack.pop
begin
next unless l.uri.host == agent.history.first.uri.host
if not agent.visited? l.href
counter += 1
out.puts "url:1 " + l.href
agent.click(l).links.each do |link|
if(hash[link] == nil)
hash.store(link,link)
stack.push(link)
end
end
#stack.push(*(agent.click(l).links))
end
rescue
#puts "Error encountered"
end
end

puts "Total unique links: " + counter.to_s

Note: I am aware that crawling sites at random is not accepted, and this
script is not intended for that, I am crawling personal sites
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top