spidering a website to build a sitemap

B

Bill Guindon

I need to spider a site and build a sitemap for it. I've looked
around on rubyforge, and RAA, and don't see an exact match. Has
anybody done this, or is there a library out there that I missed?

--=20
Bill Guindon (aka aGorilla)
 
S

Shad Sterling

I have a site mapping tool I'm working on which does not yet read
remote files but does map links between local files.

http://sterfish.com/lab/sitemapper/

I've been putting off announcing it until I have an actual page there,
but I guess I'm too slow.


- Shad



I need to spider a site and build a sitemap for it. I've looked
around on rubyforge, and RAA, and don't see an exact match. Has
anybody done this, or is there a library out there that I missed?
=20


--=20
 
B

Bill Guindon

I have a site mapping tool I'm working on which does not yet read
remote files but does map links between local files.
=20
http://sterfish.com/lab/sitemapper/
=20
I've been putting off announcing it until I have an actual page there,
but I guess I'm too slow.

Thanks much. I need one that works remotely, but I'll certainly poke
around in there, and see what I can do with it.

=20
- Shad
=20
=20
=20

=20
=20
--
=20
----------
=20
Please do not send personal (non-list-related) mail to this address.
Personal mail should be sent to (e-mail address removed).
=20
=20


--=20
Bill Guindon (aka aGorilla)
 
S

Shad Sterling

=20
Thanks much. I need one that works remotely, but I'll certainly poke
around in there, and see what I can do with it.
=20


Yeah. I made this to help me work on a site I'm now maintaining,
which was a hideous mess when I got to it. I do plan to make it map
remote pages as well, but it will probably be awhile.

=20
=20
=20


--=20
 
B

Bill Guindon

I'll throw my little snippet in, in case anyone finds it useful.
=20
I just wrote this up to spider my rails app to give me a list of all
the urls so I can use them later in a stress test.
=20
Not terribly advanced, but gives you the format of:
=20
http://www.blah.com/foo.html
{tab} http://www.blah.com/bar.html
=20
Where tabbed out children of the foo.html are pages foo.html points to.
=20
http://snippets.textdrive.com/posts/show/74

Good stuff! It's missing a couple of features for stock sites
(handling javascript:, mailto:, #name links etc.), but those can
easily be added.

Thanks much for posting it.
=20
-Matt
=20
=20


--=20
Bill Guindon (aka aGorilla)
 
B

Bill Guindon

i noticed webfetcher in RPAbase, haven't had a chance to play with it:

Should've thought to scan RPA. Wish it was still being updated, I
sure do miss it.

Gave it a couple test drives, and it's quite nice. The following gave
me exactly what I was looking for.

require 'webfetcher'

page =3D WebFetcher::page.url('http://www.somedomain.com/')
links =3D page.recurse.links
File.open('links.txt', 'w+') {|f| f.puts links.uniq}

Thanks much for tracking it down.

--=20
Bill Guindon (aka aGorilla)
 
G

Gene Tani

Try it: you'll get Errno::ECONNREFUSED (Net::HTTP) or it will time out
(open-uri) on a lot of large commercial websites, like ruby-lang.org.

So either I have to rewrite headers to emulate, say, Mozilla browser,
or throttle down number of GETs it's firing out, so as not to offend
the websites' firewalls. Not clear from stdlib doc link in Marcus'
post.
 
B

Bill Pennington

When detecting 404's watch out for the servers that return a 200 code
with a pretty "Not found" page. Those can throw a real curve ball
depending on what your are trying to do.


I was looking for somehting to trap 404-type errors, kind of like
Mertz' code (but in ruby):

http://gnosis.cx/TPiP/069.code

does this sound familiar to anybody?



- Bill
 
G

Gene Tani

Right, the point of Mertz' code is to parse <TITLE>, <META>, <BODY> for
phrases like "not found", "not available", "does not exist" when the
HTTP/FTP lib gives you a 200. But at this point I'd settle for
responses different from "timed out" or Errno::ECONNREFUSED.

I have 2 apps, one is simply validating personal bookmarks, one is
commercial. For the commercial one, I'd be happy to register a spider
per O'reilly's "Spidering Hacks". For my bookmarks, I figured this
would be easy...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top