Help with libxml2dom

Nuno Santos · Aug 19, 2009

I have just started using libxml2dom to read html files and I have some
questions I hope you guys can answer me.

The page I am working on (teste.htm):
<html>
<head>
<title>
Title
</title>
</head>
<body bgcolor = 'FFFFF'>
<table>
<tr bgcolor="#EEEEEE">
<td nowrap="nowrap">
<font size="2" face="Tahoma, Arial"> <a name="1375048"></a>
</font>
</td>
<td nowrap="nowrap">
<font size="-2" face="Verdana"> 8/15/2009</font>
</td>
</tr>
</table>
</body>
u'a'

It seems like sometimes there are some text elements 'hidden'. This is
probably a standard in DOM I simply am not familiar with this and I
would very much appreciate if anyone had the kindness to explain me this.

Thanks.

Diez B. Roggisch · Aug 19, 2009

Nuno said:
I have just started using libxml2dom to read html files and I have some
questions I hope you guys can answer me.

The page I am working on (teste.htm):
<html>
<head>
<title>
Title
</title>
</head>
<body bgcolor = 'FFFFF'>
<table>
<tr bgcolor="#EEEEEE">
<td nowrap="nowrap">
<font size="2" face="Tahoma, Arial"> <a name="1375048"></a>
</font>
</td>
<td nowrap="nowrap">
<font size="-2" face="Verdana"> 8/15/2009</font>
</td>
</tr>
</table>
</body>

u'a'

It seems like sometimes there are some text elements 'hidden'. This is
probably a standard in DOM I simply am not familiar with this and I
would very much appreciate if anyone had the kindness to explain me this.

Without a schema or something similar, a parser can't tell if whitespace is
significant or not. So if you have

<root>
<child/>
</root>

you will have not 2, but 4 nodes - root, text containing a newline + 2
spaces, child, and again a text with a newline.

You have to skip over those that you are not interested in, or use a
different XML-library such as ElementTree (e.g. in the form of lxml) that
has a different approach about text-nodes.

Diez

Paul Boddie · Aug 19, 2009

I have just started using libxml2dom to read html files and I have some
questions I hope you guys can answer me.
[...]

>>> table = body.firstChild
>>> table.nodeName
u'text' #?! Why!? Shouldn't it be a table? (1)

You answer this yourself just below.

>>> table = body.firstChild.nextSibling #why this works? is there a
text element hidden? (2)
>>> table.nodeName
u'table'

Yes, in the DOM, the child nodes of elements include text nodes, and
even though one might regard the whitespace before the first child
element and that appearing after the last child element as
unimportant, the DOM keeps it around in case it really is important.

[...]

It seems like sometimes there are some text elements 'hidden'. This is
probably a standard in DOM I simply am not familiar with this and I
would very much appreciate if anyone had the kindness to explain me this.

Well, the nodes are actually there: they're whitespace used to provide
the indentation in your example. I recommend using XPath to get actual
elements:

table = body.xpath("*")[0] # get child elements and then select the
first

Although people make a big "song and dance" about the DOM being a
nasty API, it's quite bearable if you use it together with XPath
queries.

Paul

Help with Visual Lightbox: Scripts	2	May 3, 2023
Help with my responsive home page	2	Dec 14, 2022
I need help fixing my website	2	Oct 15, 2023
Can anyone please help? HTML - two tables applying different styles	4	Dec 1, 2020
Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Javascript DOM	1	Mar 29, 2023
HTML Table Issue	1	Aug 29, 2022
Help with code	0	Jun 12, 2022

Help with libxml2dom

Nuno Santos

Diez B. Roggisch

Paul Boddie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads