spidering script

D

David Waizer

Hello..

I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.

For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..

Any suggestions would be appreciated.

David
 
J

Jonathan Curran

Hello..

I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.

For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..

Any suggestions would be appreciated.

David

David, this is a touchy topic but whatever :p Look into sgmllib, and you can
filter on the "A" tag. The book 'Dive Into Python' covers it quite nicely:
http://www.diveintopython.org/html_processing/index.html

Jonathan
 
B

Bernard

4 easy steps to get the links:

1. Download BeautifulSoup and import it in your script file.
2. Use urllib2 to download the html of the url.
3. mash the html using BeautifulSoup
4.
Code:
for tag in BeautifulSoupisedHTML.findAll('a'):
        print tag

David Waizer a écrit :
 
N

Nikita the Spider

"David Waizer said:
Hello..

I'm looking for a script (perl, python, sh...)or program (such as wget)
that will help me get a list of ALL the links on a website.

For example ./magicscript.pl www.yahoo.com and outputs it to a file, it
would be kind of like a spidering software..

David,
In addition to others' suggestions about Beautiful Soup, you might also
want to look at the HTMLData module:

http://oregonstate.edu/~barnesc/htmldata/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top