Why treat text nodes as nodes?

Xamle Eng · May 13, 2005

One of the things I find most unnatural about most XML APIs is that
they try to abstract both elements and text into some kind of "node"
object when they have virtually nothing in common. The reason these
APIs do it is to make it possible for both text and elements to be
children of elements.

But there is another way.

The XPath/XQuery data model does not allow two consecutive text nodes.
As far as I can tell, most XML processing software automatically merges
consecutive text nodes. This means that the number of text segments
directly under an element is bound by the number of sub-elements plus 1
(PIs and comments may be treated as "pseudo-elements" for this
purpose). As a result, it is always possible to associate each text
segment with the element immediately preceding it within the parent and
associate the first text element with the parent itself.

No more text nodes.

The only API I know that uses this trick is the ElementTree API for
Python by Fredrik Lundh (http://effbot.org/zone/element-index.htm).
Each Element object has a text and tail property for the text
immediately inside the element and text following it within its parent
element. Elements always have a tag, attributes and and zero or more
children - which are always other elements. No mixed types. The text
and tail attributes are always strings. This model should be very
convenient for statically-typed languages like Java or C++. I find it
ironic that this idea is probably used only in Python- a dynamically
typed language that is much more comfortable with mixed data types.

This form of API is very suitable for data-oriented XML applications
that don't use mixed elements: for leaf elements just use the .text
attribute and ignore everything else. Container elements use the
element's children which are always other elements. The text attribute
of an element can be ignore if it has children. No need to explicitly
skip it. Tails are always ignored, unless used to indent the output,
which can be done easily without disturbing the rest of the data.

For document-oriented XML it may be slightly awkward to look at both
the text and tail but I don't think it should be any more difficult
than dealing with mixed data types.

The only real downside seems to be that this API is non-standard. But
the advantages can easily compensate for that.

Would you like to see an API like this in Java? Do you know of any
implementations of this idea in any language other than Python?

XE

Richard Tobin · May 13, 2005

Xamle Eng said:
For document-oriented XML it may be slightly awkward to look at both
the text and tail but I don't think it should be any more difficult
than dealing with mixed data types.

It seems very unnatural to me. If you have

<p>See <a href="...">my page</a> for more details</p>

why on earth would you want to associate the test " for more details"
with the <a> element preceding it? The usual way of handling it -
some text, followed by an <a> element, followed by some more text - is
exactly right.

There are some applications where whitespace can be usefully be
associated with the preceding element, but a general-purpose API
should not assume even that.

-- Richard

Xamle Eng · May 14, 2005

Richard said:
It seems very unnatural to me. If you have

<p>See <a href="...">my page</a> for more details</p>

why on earth would you want to associate the test " for more details"
with the <a> element preceding it?

As I said, this model is probably more natural for data-oriented XML,
but I think it's perfectly usable for document-oriented XML, too. It
preserves the structural information and makes it accessible to your
code in a form where everything has exactly one type, known in advance
at compile time. The tail association is totally arbitrary but it works
very well in practice. Try it. Write some code. Don't always trust your
initial gut reaction. I find that code using the ElementTree API if far
shorter and easier to read than with DOM or DOM-like APIs.

There are some applications where whitespace can be usefully be
associated with the preceding element, but a general-purpose API
should not assume even that.

It doesn't assume that. And it it isn't "usefully" associated - it's
just a place to put it that is consistent, easy to access when you need
it and easier to ignore when you don't.

XE

Richard Tobin · May 15, 2005

Try it. Write some code.

I don't think so. I have perfectly good interfaces already, I'm not going
to switch to an obviously silly interface because someone says "try it".

It doesn't assume that. And it it isn't "usefully" associated - it's
just a place to put it that is consistent, easy to access when you need
it and easier to ignore when you don't.

How is it "easy to access" when I have to keep hold of the previous item
to access it? And I have to do something different for the first text node
then all the others.

-- Richard

Soren Kuula · May 16, 2005

Xamle said:
One of the things I find most unnatural about most XML APIs is that
they try to abstract both elements and text into some kind of "node"
object when they have virtually nothing in common. The reason these
APIs do it is to make it possible for both text and elements to be
children of elements.

With seven node types (element, attribute, text, NS node, comment, PI
and document/root), it won't be that much of a cleanup to remove one?

But there is another way.

The XPath/XQuery data model does not allow two consecutive text nodes.
As far as I can tell, most XML processing software automatically merges
consecutive text nodes. This means that the number of text segments
directly under an element is bound by the number of sub-elements plus 1
(PIs and comments may be treated as "pseudo-elements" for this
purpose). As a result, it is always possible to associate each text
segment with the element immediately preceding it within the parent and
associate the first text element with the parent itself.

....then the first text segment is sort of semantically different from
the rest? It will be found on the parent -- the rest on its children?

This model should be very
convenient for statically-typed languages like Java or C++. I find it
ironic that this idea is probably used only in Python- a dynamically
typed language that is much more comfortable with mixed data types.

Yes the general Node type can make things look clumsy sometimes.
Polymorphism is for solving that ..., or generics:

Iterator<Element> children()

For document-oriented XML it may be slightly awkward to look at both
the text and tail but I don't think it should be any more difficult
than dealing with mixed data types.

It could get confusing that the first text element under a parent gets
different from the rest -- you have to look it up on the parent.

The only real downside seems to be that this API is non-standard. But
the advantages can easily compensate for that.

Instead of mixed representation types in mixed contents, don't you just
get a pile of .tail references that you have to check for nullity as you
iterate over element contents? Not all that much better, I think

(and
harder to describe).

Would you like to see an API like this in Java? Do you know of any
implementations of this idea in any language other than Python?

No, don't know. But the idea of replacing some parent to child
relationships in trees by sibling to sibling relationships is not at all
new

Soren

Andy Dingley · May 17, 2005

As a result, it is always possible to associate each text
segment with the element immediately preceding it within the parent and
associate the first text element with the parent itself.

I'll hold him down, someone else can break his fingers.

That's the most fuckwittedly stupid idea I've read on the whole of
usenet in the last week.

The web is a great thing. Even "internet time" is quite fun, when it's
all rolling along nicely. But can we _please_ do without the clueless
muppet teenage genius code-jockeys who don't have the first bloody clue
about what's a good design and what's blecherous. Back in the day you'd
have written maybe 100k+ lines of something before you even got near
writing anything as fun as DOM-walking code. You might not be an expert
yet, but you gained some sense of smell for stinking bad designs.

Now any bloody idiot thinks they can re-invent important back-end
components, IE can't work out how to render a simple rectangular box and
my credit card gets pwned by Ukrainians because some muppet thought that
raw PHP made for a k00l file include mechanism.

Peter Flynn · May 27, 2005

Xamle said:
One of the things I find most unnatural about most XML APIs is that
they try to abstract both elements and text into some kind of "node"
object when they have virtually nothing in common. The reason these
APIs do it is to make it possible for both text and elements to be
children of elements.

It's because computer scientists feel compelled to treat the world as
tree-shaped

I agree it's wholly unnatural if you consider the
classical text document (a book) but XML -- unlike SGML -- isn't just
for text documents any more. This has had the unfortunate effect that
many otherwise level-headed people find it fashionable now to pretend
that XML isn't used for text documents at all any more, so they need
not be taken into consideration. You will even find programmers being
shocked to discover XML can be used for text documents

But there is another way.

The XPath/XQuery data model does not allow two consecutive text nodes.

Worse, the wholly extraordinary decision in XSLT to elide white-space
nodes between adjacent element nodes *in mixed content* as part of the
"strip-space" feature is very strongly to be deprecated, as it breaks
the model of almost any heavily-marked text document.

[...]

No more text nodes.

The only API I know that uses this trick is the ElementTree API for
Python by Fredrik Lundh (http://effbot.org/zone/element-index.htm).
Each Element object has a text and tail property for the text
immediately inside the element and text following it within its parent
element. Elements always have a tag, attributes and and zero or more
children - which are always other elements. No mixed types.

This has been tried many times and found wanting. The most notorious
was perhaps the EuroMath DTD, which was possibly the only project to
implement it successfully!

[...]

Would you like to see an API like this in Java? Do you know of any
implementations of this idea in any language other than Python?

I think there are many other things I'd rather see first. YMMV.

///Peter

Fredrik Lundh · May 28, 2005

clueless muppet teenage genius code-jockeys

lovely ;-)

mind if I quote you on the elementtree page?

</F>

Fredrik Lundh · May 28, 2005

How is it "easy to access" when I have to keep hold of the previous item

to access it? And I have to do something different for the first text node
then all the others.

if you don't understand how it works, how can you be so sure that it's
"obviously silly".

</F>

Inserting Nodes between Nodes	4	Jan 21, 2008
xslt help needed with element nodes embedded in text node	4	Feb 22, 2012
JavaScript in Acrobat Save As Found Text	3	Nov 11, 2021
Genetic algoritm generating the text	0	Aug 18, 2023
How Do I Set text on an Image and use the image as a border?	7	Mar 16, 2023
Text File Only Programming	1	May 10, 2023
XSLT Extract Text from Nodes	9	Oct 10, 2006
xhtml <body> w/ text() nodes-- why?	1	Jun 5, 2004

Why treat text nodes as nodes?

Xamle Eng

Richard Tobin

Xamle Eng

Richard Tobin

Soren Kuula

Andy Dingley

Peter Flynn

Fredrik Lundh

Fredrik Lundh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads