How to parse a XML doc with HTML tags within the texts

F

Francesco Moi

Hi.

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br>My name is Jerry</message>
</item>
</doc>
--------

When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi

Any suggestion to get the whole text? I'm using Xerces+Perl.
Thank you very much.
 
M

Martin Honnen

Francesco Moi wrote:

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br>My name is Jerry</message>
</item>
</doc>

That is not XML as it is not well-formed, there needs to be a closing
When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi

That is odd, if you really parse with an XML parser then you shouldn't
get to a DOM at all, parsing should throw an error.
 
A

Andy Dingley

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br>My name is Jerry</message>
</item>
</doc>
--------

That's not a well-formed XML document.

I assume that <message> is from your own schema, and that you want to
embed some HTML fragment within it. At this point I usually start
wondering if I can use RSS instead, and save myself a lot of effort.

Your failure here is that the HTML fragment isn't a well-fomed XML
fragment.. You have several choices:

- Use XHTML instead of HTML. This _might_ work, but you still need to
only include balanced and well-formed fragments. If it's generated
within your own system it might be workable, but it's not a general
solution to reading other people's content (which will always break
sometime).

- Write a parser that can handle tag soup. This is what you need to do
when reading other people's RSS feeds, because they're so often
mis-formed.

- Use HTML, but mangle into well-formed XML (i.e. <br> becomes
<br />) This is ugly, worse than using XHTML and has nothing to
commend it.

- Embed the HTML into the XML, either by encoding it, or by using a
CDATA section.


Read the infamous RSS versions note
http://diveintomark.org/archives/2004/02/04/incompatible-rss
It gives some useful background on these issues.

As well as tag / element formation issues, watch out for HTML entity
references that aren't in core XML (like &eacute;) and for embedded
CDATA sections too.
 
M

Malte

Francesco said:
Hi.

I must parse this XML document:
--------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br>My name is Jerry</message>
</item>
</doc>
--------

When I try to get the 'message' value by using:
getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

I get only:
Hi

Any suggestion to get the whole text? I'm using Xerces+Perl.
Thank you very much.

We have an application that ouputs this kind of rubbish (rubbish being
!xhtml ;-)).
I had to take out all the unbalanced tags before being able to parse the
results.
Much easier, if you can enforce xhtml, IMHO.
 
F

francescomoi

Sorry, it's a <br/> instead of <br>.
-----------------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br/>My name is Jerry</message>
</item>
</doc>
----------------------
 
W

William Park

Sorry, it's a <br/> instead of <br>.
-----------------------
<doc>
<item>
<name>Jerry</name>
<message>Hi<br/>My name is Jerry</message>
</item>
</doc>
----------------------

sed 's,<br/>,,g'
 
A

Andy Dingley

Sorry, it's a <br/> instead of <br>.

It's not a parsing problem either, it's a DOM problem.

"Hi" is the first child of <message>, that's what you asked for,
that's what you got.

item(0) & getFirstChild are effectively duplicates here. So instead
of getting the content of the first <message>, you're getting the
first member (one text node) of this content.

To get "the whole text" is a common requirement, but not particularly
meaningful in a pure XML sense. So it's not part of the standard DOM.
You can usually use a .text property, or else you'll have to iterate /
collect all the text nodes yourself and concatenate them.
 
M

Martin Honnen

Johannes Koch wrote:

In DOM3 Core (W3C Recommendation since 07 April 2004) there is
textContent
<http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Node3-textContent>.
But I don't know about its implementation in XML parsers.

The XML parser in Java 1.5 (alias Java 5) has support for that, and I
think it is based on Xerces Java from Apache.
Mozilla has no DOM Level 3 Core support in general but has textContent
support.
Not sure whether the Xerces C++ that the OP uses with Perl is also up to
DOM Level 3 Core.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top