HTML Structure Extraction

D

dayzman

Hi,

I'm going to write a program that extracts the structure of HTML
documents. The structure would be in the form of a tree, separating the
tags and grouping the start and end tags. I think I will use
htmllib.HTMLParser, is it appropriate for my application? If so, I
believe I will need to keep track of the depth reached.

Any tips for such application will be much appreciated.

Cheers,
Michael
 
F

Fredrik Lundh

I'm going to write a program that extracts the structure of HTML
documents. The structure would be in the form of a tree, separating the
tags and grouping the start and end tags. I think I will use
htmllib.HTMLParser, is it appropriate for my application? If so, I
believe I will need to keep track of the depth reached.

you mean like:

http://www.crummy.com/software/BeautifulSoup/
http://effbot.org/zone/element-tidylib.htm
http://utidylib.berlios.de/
http://www.xmlsoft.org/
http://effbot.org/zone/pythondoc-elementtree-HTMLTreeBuilder.htm

and a few dozen others?

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,040
Latest member
papereejit

Latest Threads

Top