lxml/ElementTree and .tail

C

Chas Emerick

I looked around for an ElementTree-specific mailing list, but found
none -- my apologies if this is too broad a forum for this question.

I've been using the lxml variant of the ElementTree API, which I
understand works in much the same way (with some significant
additions). In particular, it shares the use of a .tail attribute.
I ran headlong into this aspect of the API while doing some DOM
manipulations, and it's got me pretty confused.

Example:
>>> from lxml import etree as ET
>>> frag = ET.XML('<a>head<b>inside</b>tail</a>')
>>> b = frag.xpath('//b')[0]
>>> b
>>> b.text 'inside'
>>> b.tail 'tail'
>>> frag.remove(b)
>>> ET.tostring(frag)
'<a>head</a>'

As you can see, the .tail text is removed as part of the <b> element
-- but it IS NOT part of the <b> element. I understand the use of
the .tail attribute given the desire to simplify the API by avoiding
pure text nodes, but it seems entirely inappropriate for the tail
text to disappear into the ether when what is technically a sibling
node is removed.

Performing the same operations with the Java DOM api (crimson, in
this case it turns out) yields what I would expect (here I'm using
JPype to access a v1.4.2 JVM through python -- which makes things
somewhat less painful):
u'<a>headtail</a>'

(Sorry for the Java comparison, but that's where I first cut my teeth
on XML, and that's where my expectations were formed.)

That's a pretty significant mismatch in functionality. I certainly
understand the motivations of Mr. Lundh to make the ET API as
pythonic as possible, but ET's behaviour in this specific context is
flatly wrong as far as I can see. I would have expected that a
removal operation would have appended <b>'s tail text to the text of
<a> (or perhaps to the tail text of <b>'s closest preceding sibling)
-- something that I think I'm going to have to do in order to
continue using lxml / ElementTree.

I ran this issue past a few people I know who've worked with and
written about ElementTree, and their response to this apparent
divergence between the ET DOM API and "standard" DOM APIs was
roughly: "that's just the way it is".

Comments, thoughts?

Chas Emerick
Founder, Snowtide Informatics Systems
Enterprise-class PDF content extraction

(e-mail address removed)
http://snowtide.com | +1 413.519.6365
 
S

Stefan Behnel

Hi,

Chas said:
I looked around for an ElementTree-specific mailing list, but found none
-- my apologies if this is too broad a forum for this question.

The lxml mailing list is always happy to receive feedback, but it's fine to
ask here if it's not lxml specific.

I've been using the lxml variant of the ElementTree API.
it shares the use of a .tail attribute. I
ran headlong into this aspect of the API while doing some DOM
manipulations, and it's got me pretty confused.

Example:
from lxml import etree as ET
frag = ET.XML('<a>head<b>inside</b>tail</a>')
b = frag.xpath('//b')[0]
b
b.text 'inside'
b.tail 'tail'
frag.remove(b)
ET.tostring(frag)
'<a>head</a>'

As you can see, the .tail text is removed as part of the <b> element --
but it IS NOT part of the <b> element.

Yes, it is. Just look at the API. It's an attribute of an Element, isn't it?
What other API do you know where removing an element from a data structure
leaves part of the element behind?

If you want to copy part of of removed element back into the tree, feel free
to do so.

Performing the same operations with the Java DOM api
(Sorry for the Java comparison, but that's where I first cut my teeth on
XML, and that's where my expectations were formed.)

That's a pretty significant mismatch in functionality.

IMHO, DOM has a pretty significant mismatch with Python.

I ran this issue past a few people I know who've worked with and written
about ElementTree, and their response to this apparent divergence
between the ET DOM API and "standard" DOM APIs was roughly: "that's just
the way it is".

It's just a matter of understanding (or getting used to) the API. You might
want to stop thinking in terms of '<' and '>' and rather embrace the API
itself as a way to work with the XML Infoset (rather than the XML DOM).

Stefan
 
P

Paul Boddie

[Remove an element, remove following nodes]
Yes, it is. Just look at the API. It's an attribute of an Element, isn't it?
What other API do you know where removing an element from a data structure
leaves part of the element behind?

I guess it depends on what you regard an element to be...

[...]
IMHO, DOM has a pretty significant mismatch with Python.

....in the DOM or otherwise:

http://www.w3.org/TR/2006/REC-xml-20060816/#sec-logical-struct

Paul
 
F

Fredrik Lundh

Paul said:
I guess it depends on what you regard an element to be...

Stefan said "Element", not "element".

"Element" is a class in the "ElementTree" module, which can be used to
*represent* an XML element in an XML infoset, including all the data
*inside* the XML element, and any data *between* that XML element and
the next one (which is always character data, of course).

It's not very difficult, really; especially if you, as Stefan said,
think in infoset terms rather "a sequence of little piggies" terms.

</F>
 
P

Paul Boddie

Fredrik said:
It's not very difficult, really; especially if you, as Stefan said,
think in infoset terms rather "a sequence of little piggies" terms.

Are piggies part of the infoset too? Does the Piggie class represent a
piggie from the infoset plus a stretch of the road to the market? ;-)

Paul
 
C

Chas Emerick

Thanks for the comments and thoughts. I must admit that I have an
overwhelming feeling of having just stepped into the middle of a
complex, heated conversation without having heard the preamble.

(FYI, this reply is only an attempt to help those that come
afterwards -- I'm not looking to advocate much of anything here.)

Fredrik's invocation of the "infoset" term led me to a couple of
quick searches that clarified the state of play. Here he sets the
stage for the .tail behaviour that I originally posted about:

http://effbot.org/zone/element-infoset.htm

And it looks like there have been tussles over other mismatches in
expectations before, specifically around how namespaces are handled:

http://groups.google.com/group/comp.lang.python/browse_thread/thread/
31b2e9f4a8f7338c
http://nixforums.org/ntopic43901.html

From what I can see, there are more than a few people that have
stumbled with ElementTree's API because of their preexisting
expectations, which others have probably correctly bucketed as
"implementation details". This comes as quite a shock to those who
have stumbled (including myself) who have, lo these many years, come
to view those details as the only standard that matters (perhaps
simply because those details have been so consistent in our experience).

Which, in my view, is just fine -- different strokes for different
folks, and all that. When I originally started poking around the
python xml world, I was somewhat confused as to why 4suite/Domlette
existed, as it seemed pretty clear that ElementTree had crystallized
a lot of mindshare, and has a very attractive API to boot.
Thankfully, I can now see its appeal, and am very glad it's around,
as it seems to have all of those comfortable implementation details
that I've been looking for. :)

As for the infoset vs. "sequence of piggies" nut: if ElementTree's
infoset approach is technically correct, then wouldn't it also be
correct to use a .head attribute instead of a .tail attribute? Example:

<a>first<b>middle</b>last</a>

might be represented as:

<Element a: head='', text='last'>
<Element b: head='first', text='middle'>

If I'm wrong, just chalk it up to the fact that this is the first
time I've ever looked at the Infoset spec, and I'm simply confused.
If that IS a technically-valid way to represent the above xml
fragment . . . then I guess I'll make sure to tread more carefully in
the future around tools that work in infoset terms. For me, it turns
out that sequences of piggies really are important, at least in
contexts where XML is merely a means to an end (either because of the
attractiveness of the toolsets or because we must cope with what
we're provided as input) and where consistency with existing tools
(like those that adhere to DOM level 2/3) and expectations are
critical. I think this is what Paul was nodding towards with his
original response to Stefan's response.

Cheers,

- Chas
 
F

Fredrik Lundh

Chas said:
> might be represented as:
>
> <Element a: head='', text='last'>
> <Element b: head='first', text='middle'>

sure, and you could use a text subtype instead that kept track of the
elements above it, and let the elements be sequences of their siblings
instead of their children, and perhaps stuff everything in a dictionary.
such a construct would also be able to hold the same data, and be very
hard to use in most normal situations.
If I'm wrong, just chalk it up to the fact that this is the first
time I've ever looked at the Infoset spec, and I'm simply confused.

the Infoset spec *is* the essence of XML; if you don't realize that an
XML document is just a serialization of a very simple data model, you're
bound to be fighting with XML all the time.

but ET doesn't implement the Infoset spec as it is, of course: it uses a
*simplified* model, carefully optimized for the large percentage of all
XML formats that simply doesn't use mixed content. if you're doing
document-style processing, you sometimes need to add an extra assignment
or two, but unless you're doing *only* document-style processing, ET's
API gives you a net win. (and even if you're doing only document-style
processing, ET's speed and memory footprint gives you a net win over
most competing technologies).

</F>
 
C

Chas Emerick

the Infoset spec *is* the essence of XML; if you don't realize that an
XML document is just a serialization of a very simple data model,
you're
bound to be fighting with XML all the time.

The principle and the practice diverge significantly in our neck of
the woods. The current project involves consuming and making sense
of extraordinarily (and typically unnecessarily) complex XHTML. Of
course, as you say, those documents are still serializations of a
simple data model, but the types of manipulations we do happen to
butt up very uncomfortably with the way ET does things.
but ET doesn't implement the Infoset spec as it is, of course: it
uses a
*simplified* model, carefully optimized for the large percentage of
all
XML formats that simply doesn't use mixed content. if you're doing
document-style processing, you sometimes need to add an extra
assignment
or two, but unless you're doing *only* document-style processing, ET's
API gives you a net win. (and even if you're doing only document-
style
processing, ET's speed and memory footprint gives you a net win over
most competing technologies).

Yeah, documents are all we do -- XML just happens to be a pleasant
intermediate format, and something we need to consume. The notion of
an nicely-formatted XML is entirely foreign to the work that we do --
in fact, our current focus is (in part) dragging decidedly
unstructured data out of those XHTML documents (among other source
formats) and putting them into a reasonable, useful structure.

I took some time last night to bang out some functions that squeezed
ET's model (via lxml) into doing what we need, and it ended up
requiring a lot more B&D than I like. At that point, I swung over to
4suite, which dropped into place quite nicely.

*shrug* I guess we're just in the minority with regard to our API
requirements -- we happen to live in the corner cases. I'm certainly
glad to have made the detour on a different path for a bit though.

- Chas
 
F

Fredrik Lundh

Chas said:
The principle and the practice diverge significantly in our neck of
the woods. The current project involves consuming and making sense
of extraordinarily (and typically unnecessarily) complex XHTML.

wasn't your original complaint that ET didn't do the "right thing" when
you removed elements from a mixed-content tree? (something than can be
trivially handled with a 2-line helper function)

why mutate the tree if all you want is to extract information from it?
doesn't sound very efficient to me...

</F>
 
C

Chas Emerick

wasn't your original complaint that ET didn't do the "right thing"
when
you removed elements from a mixed-content tree? (something than can be
trivially handled with a 2-line helper function)

Yes, that was the initial issue, but the delta between Elements and
DOM-style elements leads to other issues. There's no doubt that the
needed helpers are simple, but all things being equal, not having to
carry them around anywhere we're doing DOM manipulations is a big plus.
why mutate the tree if all you want is to extract information from it?
doesn't sound very efficient to me...

Because we're far from doing anything that is regular or one-off in
nature. We're systematizing the extraction of data from functionally
unstructured content, and it's flatly necessary to normalize the
XHTML into something that can be easily consumed by the processes
we've built that can do that content->data extraction/conversion from
plain text, XML, PDF, and now XHTML.

Remember, corner cases. :)

- Chas
 
S

Stefan Behnel

Chas said:
the delta between Elements and DOM-style elements leads to other issues.
There's no doubt that the needed helpers are simple, but all things being
equal, not having to carry them around anywhere we're doing DOM
manipulations is a big plus.

Because we're far from doing anything that is regular or one-off in nature.
We're systematizing the extraction of data from functionally unstructured
content, and it's flatly necessary to normalize the XHTML into something
that can be easily consumed by the processes we've built that can do that
content->data extraction/conversion from plain text, XML, PDF, and now
XHTML.

Remember, corner cases. :)

Hmm, then I really don't get why you didn't just write a customised XHTML API
on top of lxml's custom Element classes feature. Hiding XML language specific
behaviour directly in the Element classes really helps in getting your code
clean, especially in larger code bases.

Stefan
 
F

Fredrik Lundh

Paul said:
Are piggies part of the infoset too? Does the Piggie class represent a
piggie from the infoset plus a stretch of the road to the market? ;-)

no, they just appear in serialized XML. if you want concrete piggies, you have
to wrap ET's iterparse function, or perhaps the XMLParser class.

</F>
 
U

Uche Ogbuji

Fredrik said:
the Infoset spec *is* the essence of XML; if you don't realize that an
XML document is just a serialization of a very simple data model, you're
bound to be fighting with XML all the time.

I certainly have never liked the aspects of the ElementTree API under
present discussion. But that's not as important as the fact that I
think the above statement is misleading. There has always been a
battle in XML between the people who think the serialization is
preeminent, and those who believe some data model is preeminent, but
the reality is that XML 1.0 (an 1.1) is a spec *defined* by its
serialization. Infoset is a secondary and optional spec. In fact, I
think it's clear that Infoset is not even the preeminent *data model*
of the XML world. That distinction goes to the XPath data model, which
is quite different from the Infoset.
 
F

Fredrik Lundh

Uche said:
I certainly have never liked the aspects of the ElementTree API under
present discussion. But that's not as important as the fact that I
think the above statement is misleading. There has always been a
battle in XML between the people who think the serialization is
preeminent, and those who believe some data model is preeminent, but
the reality is that XML 1.0 (an 1.1) is a spec *defined* by its
serialization.

sure, the computing world is and has always been full of people who want
the simplest thing to look a lot harder than it actually is. after all,
*they* spent lots of time reading all the specifications, they've bought
all the books, and went to all the seminars, so it's simply not fair
when others are cheating.

in reality, *all* interchange formats are easier to understand and use
if you focus on a (complete or intentionally simplified) data model of
the things being interchanged, and treat various artifacts of the
byte-stream used by the wire format as artifacts, historical accidents
based on what specification happened to be written before the other, or
what some guy did or did not do in the seventies, as accidents, and
esoteric arcana disseminated on limited-distribution mailing lists as
about as relevant for your customer as last week's episode of American Idol.

(XML is a bit unusual in this respect, but that's probably just some
variation of the bikeshed effect. it's just text, and everyone with
a keyboard knows what that is, so we don't need to use established
software engineering practices, or think about security *at all*
(Billion laughs? XXE?) or, for that matter, learn from people who's
been doing data interchange in other domains since the dawn of time.
and when they do appear anyway, and mess with our technology in ways
that we haven't authorized, without reading our books or going to our
seminars or subscribing to our mailing lists, we can write them off as
"clueless muppet teenage genius code-jockeys", and keep patting our-
selves on the back, while the rest of the world is busy routing around
us, switching to well-understood XML subsets or other serialization
formats, simpler and more flexible data models, simpler API:s, and
more robust code. and Python ;-)

</F>
 
P

Paul McGuire

Fredrik Lundh said:
(XML is a bit unusual in this respect, but that's probably just some
variation of the bikeshed effect. it's just text, and everyone with
a keyboard knows what that is, so we don't need to use established
software engineering practices, or think about security *at all* (Billion
laughs? XXE?) or, for that matter, learn from people who's
been doing data interchange in other domains since the dawn of time. and
when they do appear anyway, and mess with our technology in ways that we
haven't authorized, without reading our books or going to our seminars or
subscribing to our mailing lists, we can write them off as "clueless
muppet teenage genius code-jockeys", and keep patting our- selves on the
back, while the rest of the world is busy routing around us, switching to
well-understood XML subsets or other serialization formats, simpler and
more flexible data models, simpler API:s, and
more robust code. and Python ;-)

maybe time to switch to decaf... :)
 
P

Paul McGuire

Fredrik Lundh said:
do you disagree with my characterization of the state of the XML universe?

</F>
Thankfully, I'm largely on the periphery of that universe (except for being
a sometimes victim). But it is certainly frustrating to see many of the OMG
concepts of the 90's reimplemented in Java services, and then again in
XML/SOAP, with no detectable awareness that these messaging and
serialization problems have been considered before, and much more
thoroughly.

I liked XML when I could read it and hack it out in Notepad. I like
attributes, which puts me on the outs with most XML zealots who forswear the
use of attributes on purely academic grounds (they defeat the future
possible expansion of an attribute's value into more complex substructure).
I dislike namespaces, especially the default xmlns kind, as they make me
take extra steps when retrieving nodes via Xpaths; and everyone seems to
think their application needs namespaces, when there is no threat that these
tags will ever get mixed up with anyone else's.

No, I was mostly amused (which I thought was your intent, given the trailing
smiley) at your breathless, quasi-rant against the XML milieu in general - I
think your one sentence went on for about 15 lines!

-- Paul
 
C

Chas Emerick

Uche said:
I certainly have never liked the aspects of the ElementTree API under
present discussion. But that's not as important as the fact that I
think the above statement is misleading. There has always been a
battle in XML between the people who think the serialization is
preeminent, and those who believe some data model is preeminent, but
the reality is that XML 1.0 (an 1.1) is a spec *defined* by its
serialization.

sure, the computing world is and has always been full of people who
want
the simplest thing to look a lot harder than it actually is. after
all,
*they* spent lots of time reading all the specifications, they've
bought
all the books, and went to all the seminars, so it's simply not fair
when others are cheating.
[snip]

and keep patting our-
selves on the back, while the rest of the world is busy routing around
us, switching to well-understood XML subsets or other serialization
formats, simpler and more flexible data models, simpler API:s, and
more robust code. and Python ;-)

That's flatly unrealistic. If you'll remember, I'm not one of "those
people" that are specification-driven -- I hadn't even *heard* of
Infoset until earlier this week! However, I am driven to ensure that
the code I (and we) write works *as others expect* when confronted by
any of the billions of XML documents out there. Simpler is better,
and better is better (thus why I am in python-land), unless that
simplicity makes it difficult to play nicely with others. Shrugging
off the way everyone else does things reminds me of various CSS
fanatics I know of that simply won't use tables or IE CSS
compatibility hacks, even if that's what's needed to get things to work.

I've never been involved in any "XML battles", but to Uche's point, I
would speculate (only on the basis of personal interactions and
anecdotes) that some overwhelming majority of the developers out
there care for nothing but the serialization, simply because that's
how one plays nicely with others. I would count myself in that group
as well, although I do recognize that there is a worthy academic
exercise in exploring the data-model-centric XML worldview.

OT: Uche, 4suite XML is tops! Thank you very much for that.

- Chas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top