libxml2dom - parsing maligned html

B

bruce

Hi...

I'm using quick test with libxml2dom

===============
import libxml2dom

aa=libxml2dom.parseString(foo)
ff=libxml2dom.toString(aa)

print ff
===============

----------------------------------
when i start, foo is:
<html>
<body>
</body>
</html>

<html>
<body>
..
..
..
</body>
</html>
-------------------------------
when i print ff it's:
<html>
<body>
</body>
</html>
-------------------------------

so it's as if the parseString only reads the initial "html" tree. i've
reviewed as much as i can find regarding libxml2dom to try to figure out how
i can get it to read/parse/handle both html trees/nodes.

i know, the html is maligned/screwed-up, but i can't seem to find any app
(tidy/beautifulsoup) that can "know" which one of the html trees to throw
out/remove!!

technically, both html trees are valid, it's just that they both shouldn't
be in the file!!!

thoughts/comments appreciated

thanks
 
P

Paul Boddie

so it's as if the parseString only reads the initial "html" tree. i've
reviewed as much as i can find regarding libxml2dom to try to figure out how
i can get it to read/parse/handle both html trees/nodes.

Maybe there's some possibility to have libxml2 read directly from a
file descriptor and to stop after parsing the first document, leaving
the descriptor open; currently, this isn't supported by libxml2dom,
however. Another possibility is to feed text to libxml2 until it can
return a well-formed document, which I do as part of the
libxml2dom.xmpp module, but I don't really support this feature in the
public API.

Again, improvements to libxml2dom may happen if I find the time to do
them.

Paul
 
S

Stefan Behnel

bruce said:
I'm using quick test with libxml2dom

===============
import libxml2dom

aa=libxml2dom.parseString(foo)
ff=libxml2dom.toString(aa)

print ff
===============

----------------------------------
when i start, foo is:
<html>
<body>
</body>
</html>

<html>
<body>
.
.
.
</body>
</html>
-------------------------------
when i print ff it's:
<html>
<body>
</body>
</html>
-------------------------------

so it's as if the parseString only reads the initial "html" tree. i've
reviewed as much as i can find regarding libxml2dom to try to figure out how
i can get it to read/parse/handle both html trees/nodes.

i know, the html is maligned/screwed-up, but i can't seem to find any app
(tidy/beautifulsoup) that can "know" which one of the html trees to throw
out/remove!!

technically, both html trees are valid, it's just that they both shouldn't
be in the file!!!

What about splitting the string on "<html" and them parsing each part on its own?

Stefan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top