How to parse a XML doc with HTML tags within the texts

Francesco Moi · Feb 20, 2005

Hi.

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi My name is Jerry</message>
</item>
</doc>
--------

When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi

Any suggestion to get the whole text? I'm using Xerces+Perl.
Thank you very much.

Martin Honnen · Feb 20, 2005

Francesco Moi wrote:

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi My name is Jerry</message>
</item>
</doc>

That is not XML as it is not well-formed, there needs to be a closing

When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi

That is odd, if you really parse with an XML parser then you shouldn't
get to a DOM at all, parsing should throw an error.

Andy Dingley · Feb 20, 2005

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi My name is Jerry</message>
</item>
</doc>
--------

That's not a well-formed XML document.

I assume that <message> is from your own schema, and that you want to
embed some HTML fragment within it. At this point I usually start
wondering if I can use RSS instead, and save myself a lot of effort.

Your failure here is that the HTML fragment isn't a well-fomed XML
fragment.. You have several choices:

- Use XHTML instead of HTML. This _might_ work, but you still need to
only include balanced and well-formed fragments. If it's generated
within your own system it might be workable, but it's not a general
solution to reading other people's content (which will always break
sometime).

- Write a parser that can handle tag soup. This is what you need to do
when reading other people's RSS feeds, because they're so often
mis-formed.

- Use HTML, but mangle into well-formed XML (i.e. becomes
 ) This is ugly, worse than using XHTML and has nothing to
commend it.

- Embed the HTML into the XML, either by encoding it, or by using a
CDATA section.

Read the infamous RSS versions note
http://diveintomark.org/archives/2004/02/04/incompatible-rss
It gives some useful background on these issues.

As well as tag / element formation issues, watch out for HTML entity
references that aren't in core XML (like &eacute

and for embedded
CDATA sections too.

Malte · Feb 20, 2005

Francesco said:
Hi.

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi My name is Jerry</message>
</item>
</doc>
--------

When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi

Any suggestion to get the whole text? I'm using Xerces+Perl.
Thank you very much.

We have an application that ouputs this kind of rubbish (rubbish being
!xhtml ;-)).
I had to take out all the unbalanced tags before being able to parse the
results.
Much easier, if you can enforce xhtml, IMHO.

francescomoi · Feb 20, 2005

Sorry, it's a instead of .
-----------------------
<doc>
<item>
<name>Jerry</name>
<message>Hi My name is Jerry</message>
</item>
</doc>
----------------------

William Park · Feb 21, 2005

Sorry, it's a instead of .
-----------------------
<doc>
<item>
<name>Jerry</name>
<message>Hi My name is Jerry</message>
</item>
</doc>
----------------------

sed 's, ,,g'

Andy Dingley · Feb 21, 2005

Sorry, it's a instead of .

It's not a parsing problem either, it's a DOM problem.

"Hi" is the first child of <message>, that's what you asked for,
that's what you got.

item(0) & getFirstChild are effectively duplicates here. So instead
of getting the content of the first <message>, you're getting the
first member (one text node) of this content.

To get "the whole text" is a common requirement, but not particularly
meaningful in a pure XML sense. So it's not part of the standard DOM.
You can usually use a .text property, or else you'll have to iterate /
collect all the text nodes yourself and concatenate them.

Johannes Koch · Feb 21, 2005

Andy said:
To get "the whole text" is a common requirement, but not particularly
meaningful in a pure XML sense. So it's not part of the standard DOM.

In DOM3 Core (W3C Recommendation since 07 April 2004) there is
textContent
<http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Node3-textContent>.
But I don't know about its implementation in XML parsers.

Martin Honnen · Feb 21, 2005

Johannes Koch wrote:

In DOM3 Core (W3C Recommendation since 07 April 2004) there is
textContent
<http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Node3-textContent>.
But I don't know about its implementation in XML parsers.

The XML parser in Java 1.5 (alias Java 5) has support for that, and I
think it is based on Xerces Java from Apache.
Mozilla has no DOM Level 3 Core support in general but has textContent
support.
Not sure whether the Xerces C++ that the OP uses with Perl is also up to
DOM Level 3 Core.

How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
How to remove an empty line which is created when i deleted a element from my xml file?	0	Oct 1, 2016
Strange structure to be parsed	4	Feb 16, 2005
add document tags to xml doc	3	Sep 15, 2010
how to $doc->createElement with XML::LibXML	2	Feb 22, 2010
Help with Creating a Looping Procedure	1	Dec 10, 2007
What's the best way to parse this HTML tag?	3	Mar 11, 2012
How to parse an HTTP file with Xerces-Perl	0	Nov 13, 2004

How to parse a XML doc with HTML tags within the texts

Francesco Moi

Martin Honnen

Andy Dingley

Malte

francescomoi

William Park

Andy Dingley

Johannes Koch

Martin Honnen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads