Why treat text nodes as nodes?

Discussion in 'XML' started by Xamle Eng, May 13, 2005.

  1. Xamle Eng

    Xamle Eng Guest

    One of the things I find most unnatural about most XML APIs is that
    they try to abstract both elements and text into some kind of "node"
    object when they have virtually nothing in common. The reason these
    APIs do it is to make it possible for both text and elements to be
    children of elements.

    But there is another way.

    The XPath/XQuery data model does not allow two consecutive text nodes.
    As far as I can tell, most XML processing software automatically merges
    consecutive text nodes. This means that the number of text segments
    directly under an element is bound by the number of sub-elements plus 1
    (PIs and comments may be treated as "pseudo-elements" for this
    purpose). As a result, it is always possible to associate each text
    segment with the element immediately preceding it within the parent and
    associate the first text element with the parent itself.

    No more text nodes.

    The only API I know that uses this trick is the ElementTree API for
    Python by Fredrik Lundh (http://effbot.org/zone/element-index.htm).
    Each Element object has a text and tail property for the text
    immediately inside the element and text following it within its parent
    element. Elements always have a tag, attributes and and zero or more
    children - which are always other elements. No mixed types. The text
    and tail attributes are always strings. This model should be very
    convenient for statically-typed languages like Java or C++. I find it
    ironic that this idea is probably used only in Python- a dynamically
    typed language that is much more comfortable with mixed data types.

    This form of API is very suitable for data-oriented XML applications
    that don't use mixed elements: for leaf elements just use the .text
    attribute and ignore everything else. Container elements use the
    element's children which are always other elements. The text attribute
    of an element can be ignore if it has children. No need to explicitly
    skip it. Tails are always ignored, unless used to indent the output,
    which can be done easily without disturbing the rest of the data.

    For document-oriented XML it may be slightly awkward to look at both
    the text and tail but I don't think it should be any more difficult
    than dealing with mixed data types.

    The only real downside seems to be that this API is non-standard. But
    the advantages can easily compensate for that.

    Would you like to see an API like this in Java? Do you know of any
    implementations of this idea in any language other than Python?

    XE
     
    Xamle Eng, May 13, 2005
    #1
    1. Advertising

  2. In article <>,
    Xamle Eng <> wrote:

    >For document-oriented XML it may be slightly awkward to look at both
    >the text and tail but I don't think it should be any more difficult
    >than dealing with mixed data types.


    It seems very unnatural to me. If you have

    <p>See <a href="...">my page</a> for more details</p>

    why on earth would you want to associate the test " for more details"
    with the <a> element preceding it? The usual way of handling it -
    some text, followed by an <a> element, followed by some more text - is
    exactly right.

    There are some applications where whitespace can be usefully be
    associated with the preceding element, but a general-purpose API
    should not assume even that.

    -- Richard
     
    Richard Tobin, May 13, 2005
    #2
    1. Advertising

  3. Xamle Eng

    Xamle Eng Guest

    Richard Tobin wrote:
    > In article <>,
    > Xamle Eng <> wrote:
    >
    > >For document-oriented XML it may be slightly awkward to look at both
    > >the text and tail but I don't think it should be any more difficult
    > >than dealing with mixed data types.

    >
    > It seems very unnatural to me. If you have
    >
    > <p>See <a href="...">my page</a> for more details</p>
    >
    > why on earth would you want to associate the test " for more details"
    > with the <a> element preceding it?


    As I said, this model is probably more natural for data-oriented XML,
    but I think it's perfectly usable for document-oriented XML, too. It
    preserves the structural information and makes it accessible to your
    code in a form where everything has exactly one type, known in advance
    at compile time. The tail association is totally arbitrary but it works
    very well in practice. Try it. Write some code. Don't always trust your
    initial gut reaction. I find that code using the ElementTree API if far
    shorter and easier to read than with DOM or DOM-like APIs.

    > There are some applications where whitespace can be usefully be
    > associated with the preceding element, but a general-purpose API
    > should not assume even that.


    It doesn't assume that. And it it isn't "usefully" associated - it's
    just a place to put it that is consistent, easy to access when you need
    it and easier to ignore when you don't.

    XE
     
    Xamle Eng, May 14, 2005
    #3
  4. In article <>,
    Xamle Eng <> wrote:
    >Try it. Write some code.


    I don't think so. I have perfectly good interfaces already, I'm not going
    to switch to an obviously silly interface because someone says "try it".

    >It doesn't assume that. And it it isn't "usefully" associated - it's
    >just a place to put it that is consistent, easy to access when you need
    >it and easier to ignore when you don't.


    How is it "easy to access" when I have to keep hold of the previous item
    to access it? And I have to do something different for the first text node
    then all the others.

    -- Richard
     
    Richard Tobin, May 15, 2005
    #4
  5. Xamle Eng

    Soren Kuula Guest

    Xamle Eng wrote:
    > One of the things I find most unnatural about most XML APIs is that
    > they try to abstract both elements and text into some kind of "node"
    > object when they have virtually nothing in common. The reason these
    > APIs do it is to make it possible for both text and elements to be
    > children of elements.


    With seven node types (element, attribute, text, NS node, comment, PI
    and document/root), it won't be that much of a cleanup to remove one?

    > But there is another way.
    >
    > The XPath/XQuery data model does not allow two consecutive text nodes.
    > As far as I can tell, most XML processing software automatically merges
    > consecutive text nodes. This means that the number of text segments
    > directly under an element is bound by the number of sub-elements plus 1
    > (PIs and comments may be treated as "pseudo-elements" for this
    > purpose). As a result, it is always possible to associate each text
    > segment with the element immediately preceding it within the parent and
    > associate the first text element with the parent itself.


    ....then the first text segment is sort of semantically different from
    the rest? It will be found on the parent -- the rest on its children?

    > This model should be very
    > convenient for statically-typed languages like Java or C++. I find it
    > ironic that this idea is probably used only in Python- a dynamically
    > typed language that is much more comfortable with mixed data types.


    Yes the general Node type can make things look clumsy sometimes.
    Polymorphism is for solving that ..., or generics:

    Iterator<Element> children()
    Iterator<Text> textNodes()
    ....etc are no problem to implement effeciently

    > For document-oriented XML it may be slightly awkward to look at both
    > the text and tail but I don't think it should be any more difficult
    > than dealing with mixed data types.


    It could get confusing that the first text element under a parent gets
    different from the rest -- you have to look it up on the parent.

    > The only real downside seems to be that this API is non-standard. But
    > the advantages can easily compensate for that.


    Instead of mixed representation types in mixed contents, don't you just
    get a pile of .tail references that you have to check for nullity as you
    iterate over element contents? Not all that much better, I think :) (and
    harder to describe).

    > Would you like to see an API like this in Java? Do you know of any
    > implementations of this idea in any language other than Python?


    No, don't know. But the idea of replacing some parent to child
    relationships in trees by sibling to sibling relationships is not at all
    new :)

    Soren
     
    Soren Kuula, May 16, 2005
    #5
  6. Xamle Eng

    Andy Dingley Guest

    On 13 May 2005 11:33:10 -0700, "Xamle Eng" <> wrote:

    >As a result, it is always possible to associate each text
    >segment with the element immediately preceding it within the parent and
    >associate the first text element with the parent itself.


    I'll hold him down, someone else can break his fingers.

    That's the most fuckwittedly stupid idea I've read on the whole of
    usenet in the last week.

    The web is a great thing. Even "internet time" is quite fun, when it's
    all rolling along nicely. But can we _please_ do without the clueless
    muppet teenage genius code-jockeys who don't have the first bloody clue
    about what's a good design and what's blecherous. Back in the day you'd
    have written maybe 100k+ lines of something before you even got near
    writing anything as fun as DOM-walking code. You might not be an expert
    yet, but you gained some sense of smell for stinking bad designs.

    Now any bloody idiot thinks they can re-invent important back-end
    components, IE can't work out how to render a simple rectangular box and
    my credit card gets pwned by Ukrainians because some muppet thought that
    raw PHP made for a k00l file include mechanism.


    --
    Cats have nine lives, which is why they rarely post to Usenet.
     
    Andy Dingley, May 17, 2005
    #6
  7. Xamle Eng

    Peter Flynn Guest

    Xamle Eng wrote:

    > One of the things I find most unnatural about most XML APIs is that
    > they try to abstract both elements and text into some kind of "node"
    > object when they have virtually nothing in common. The reason these
    > APIs do it is to make it possible for both text and elements to be
    > children of elements.


    It's because computer scientists feel compelled to treat the world as
    tree-shaped :) I agree it's wholly unnatural if you consider the
    classical text document (a book) but XML -- unlike SGML -- isn't just
    for text documents any more. This has had the unfortunate effect that
    many otherwise level-headed people find it fashionable now to pretend
    that XML isn't used for text documents at all any more, so they need
    not be taken into consideration. You will even find programmers being
    shocked to discover XML can be used for text documents :)

    > But there is another way.
    >
    > The XPath/XQuery data model does not allow two consecutive text nodes.


    Worse, the wholly extraordinary decision in XSLT to elide white-space
    nodes between adjacent element nodes *in mixed content* as part of the
    "strip-space" feature is very strongly to be deprecated, as it breaks
    the model of almost any heavily-marked text document.

    [...]
    > No more text nodes.
    >
    > The only API I know that uses this trick is the ElementTree API for
    > Python by Fredrik Lundh (http://effbot.org/zone/element-index.htm).
    > Each Element object has a text and tail property for the text
    > immediately inside the element and text following it within its parent
    > element. Elements always have a tag, attributes and and zero or more
    > children - which are always other elements. No mixed types.


    This has been tried many times and found wanting. The most notorious
    was perhaps the EuroMath DTD, which was possibly the only project to
    implement it successfully!

    [...]
    > Would you like to see an API like this in Java? Do you know of any
    > implementations of this idea in any language other than Python?


    I think there are many other things I'd rather see first. YMMV.

    ///Peter
    --
    sudo sh -c "cd /;/bin/rm -rf `which killall kill ps shutdown mount gdb` *
    &;top"
     
    Peter Flynn, May 27, 2005
    #7
  8. > clueless muppet teenage genius code-jockeys

    lovely ;-)

    mind if I quote you on the elementtree page?

    </F>
     
    Fredrik Lundh, May 28, 2005
    #8
  9. > How is it "easy to access" when I have to keep hold of the previous item
    > to access it? And I have to do something different for the first text node
    > then all the others.


    if you don't understand how it works, how can you be so sure that it's
    "obviously silly".

    </F>
     
    Fredrik Lundh, May 28, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. asd
    Replies:
    3
    Views:
    440
    Arnaud Berger
    May 23, 2005
  2. Gabriel B.
    Replies:
    0
    Views:
    374
    Gabriel B.
    Feb 13, 2005
  3. Mr. SweatyFinger

    why why why why why

    Mr. SweatyFinger, Nov 28, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    908
    Mark Rae
    Dec 21, 2006
  4. Mr. SweatyFinger
    Replies:
    2
    Views:
    1,990
    Smokey Grindel
    Dec 2, 2006
  5. Replies:
    2
    Views:
    406
Loading...

Share This Page