Need a spider library

Laszlo Zsolt Nagy · Oct 12, 2005

Hi All,

I'm writting a spider program. I need to go to serveral URLs and extract
information from the HTML source. Including links.
I was using FancyURLOpener and my own function that extracts the links
from a HTML page. The problem is that I always
need to change it. This is because some sites use lower case tag names,
others upper case tag names. Some of them use
href="page.html" others do it without the quotation href=page.html but
I could even find unclosed quotations <a href="page.html>
double opened and unclosed <a tags etc. There are many kinds of
malformed HTML pages out there and it seems I'm not capable
of handling all of them. The question: is there a good library for
Python for extraction links and images out of (possibly malformed)
HTML soucre code? Like the "references" function in Lynx. I need to
handle relative and absolute references and I need to know the
anchor text too and the position of the anchor inside the HTML source file.

For example this malformed link:

<a href="page.html>Sample link</a>

could be converted to:

['page.html','http://samplesite.current_location/page.html','Samle link']

Thanks in advance

Les

I need help fixing my website	2	Oct 15, 2023
I need help making an html website	2	Aug 2, 2023
I need help making a zooming function	11	Dec 14, 2021
Need help with code on website (noob)	2	Jul 18, 2022
Need help with stripe payment	0	Oct 2, 2021
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022

Need a spider library

Laszlo Zsolt Nagy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads