How to parse a XML doc with HTML tags within the texts

Discussion in 'XML' started by Francesco Moi, Feb 20, 2005.

  1. Hi.

    I must parse this XML document:
    --------------
    <doc>
    <item>
    <name>Jerry</name>
    <message>Hi<br>My name is Jerry</message>
    </item>
    </doc>
    --------

    When I try to get the 'message' value by using:
    getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;

    I get only:
    Hi

    Any suggestion to get the whole text? I'm using Xerces+Perl.
    Thank you very much.
     
    Francesco Moi, Feb 20, 2005
    #1
    1. Advertising

  2. Francesco Moi wrote:


    > I must parse this XML document:
    > --------------
    > <doc>
    > <item>
    > <name>Jerry</name>
    > <message>Hi<br>My name is Jerry</message>
    > </item>
    > </doc>


    That is not XML as it is not well-formed, there needs to be a closing
    </br> tag.

    > When I try to get the 'message' value by using:
    > getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;
    >
    > I get only:
    > Hi


    That is odd, if you really parse with an XML parser then you shouldn't
    get to a DOM at all, parsing should throw an error.

    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
     
    Martin Honnen, Feb 20, 2005
    #2
    1. Advertising

  3. Francesco Moi

    Andy Dingley Guest

    On 20 Feb 2005 06:32:22 -0800, (Francesco Moi)
    wrote:

    >I must parse this XML document:
    >--------------
    ><doc>
    ><item>
    ><name>Jerry</name>
    ><message>Hi<br>My name is Jerry</message>
    ></item>
    ></doc>
    >--------


    That's not a well-formed XML document.

    I assume that <message> is from your own schema, and that you want to
    embed some HTML fragment within it. At this point I usually start
    wondering if I can use RSS instead, and save myself a lot of effort.

    Your failure here is that the HTML fragment isn't a well-fomed XML
    fragment.. You have several choices:

    - Use XHTML instead of HTML. This _might_ work, but you still need to
    only include balanced and well-formed fragments. If it's generated
    within your own system it might be workable, but it's not a general
    solution to reading other people's content (which will always break
    sometime).

    - Write a parser that can handle tag soup. This is what you need to do
    when reading other people's RSS feeds, because they're so often
    mis-formed.

    - Use HTML, but mangle into well-formed XML (i.e. <br> becomes
    <br />) This is ugly, worse than using XHTML and has nothing to
    commend it.

    - Embed the HTML into the XML, either by encoding it, or by using a
    CDATA section.


    Read the infamous RSS versions note
    http://diveintomark.org/archives/2004/02/04/incompatible-rss
    It gives some useful background on these issues.

    As well as tag / element formation issues, watch out for HTML entity
    references that aren't in core XML (like &eacute;) and for embedded
    CDATA sections too.

    --
    Smert' spamionam
     
    Andy Dingley, Feb 20, 2005
    #3
  4. Francesco Moi

    Malte Guest

    Francesco Moi wrote:
    > Hi.
    >
    > I must parse this XML document:
    > --------------
    > <doc>
    > <item>
    > <name>Jerry</name>
    > <message>Hi<br>My name is Jerry</message>
    > </item>
    > </doc>
    > --------
    >
    > When I try to get the 'message' value by using:
    > getElementsByTagName('message')->item(0)->getFirstChild->getNodeValue;
    >
    > I get only:
    > Hi
    >
    > Any suggestion to get the whole text? I'm using Xerces+Perl.
    > Thank you very much.


    We have an application that ouputs this kind of rubbish (rubbish being
    !xhtml ;-)).
    I had to take out all the unbalanced tags before being able to parse the
    results.
    Much easier, if you can enforce xhtml, IMHO.
     
    Malte, Feb 20, 2005
    #4
  5. Francesco Moi

    Guest

    Sorry, it's a <br/> instead of <br>.
    -----------------------
    <doc>
    <item>
    <name>Jerry</name>
    <message>Hi<br/>My name is Jerry</message>
    </item>
    </doc>
    ----------------------
     
    , Feb 20, 2005
    #5
  6. Francesco Moi

    William Park Guest

    wrote:
    > Sorry, it's a <br/> instead of <br>.
    > -----------------------
    > <doc>
    > <item>
    > <name>Jerry</name>
    > <message>Hi<br/>My name is Jerry</message>
    > </item>
    > </doc>
    > ----------------------


    sed 's,<br/>,,g'

    --
    William Park <>, Toronto, Canada
    Slackware Linux -- because I can type.
     
    William Park, Feb 21, 2005
    #6
  7. Francesco Moi

    Andy Dingley Guest

    On 20 Feb 2005 14:00:03 -0800, wrote:

    >Sorry, it's a <br/> instead of <br>.


    It's not a parsing problem either, it's a DOM problem.

    "Hi" is the first child of <message>, that's what you asked for,
    that's what you got.

    item(0) & getFirstChild are effectively duplicates here. So instead
    of getting the content of the first <message>, you're getting the
    first member (one text node) of this content.

    To get "the whole text" is a common requirement, but not particularly
    meaningful in a pure XML sense. So it's not part of the standard DOM.
    You can usually use a .text property, or else you'll have to iterate /
    collect all the text nodes yourself and concatenate them.

    --
    Die Gotterspammerung - Junkmail of the Gods
     
    Andy Dingley, Feb 21, 2005
    #7
  8. Andy Dingley wrote:

    > To get "the whole text" is a common requirement, but not particularly
    > meaningful in a pure XML sense. So it's not part of the standard DOM.


    In DOM3 Core (W3C Recommendation since 07 April 2004) there is
    textContent
    <http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Node3-textContent>.
    But I don't know about its implementation in XML parsers.

    --
    Johannes Koch
    In te domine speravi; non confundar in aeternum.
    (Te Deum, 4th cent.)
     
    Johannes Koch, Feb 21, 2005
    #8
  9. Johannes Koch wrote:


    > In DOM3 Core (W3C Recommendation since 07 April 2004) there is
    > textContent
    > <http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#Node3-textContent>.
    > But I don't know about its implementation in XML parsers.


    The XML parser in Java 1.5 (alias Java 5) has support for that, and I
    think it is based on Xerces Java from Apache.
    Mozilla has no DOM Level 3 Core support in general but has textContent
    support.
    Not sure whether the Xerces C++ that the OP uses with Perl is also up to
    DOM Level 3 Core.


    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
     
    Martin Honnen, Feb 21, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ranganath

    Custom Tags within Custom Tags.

    Ranganath, Oct 17, 2003, in forum: Java
    Replies:
    2
    Views:
    499
    Ranganath
    Oct 21, 2003
  2. Matt
    Replies:
    3
    Views:
    557
    Tor Iver Wilhelmsen
    Sep 17, 2004
  3. Kwasi
    Replies:
    13
    Views:
    614
    Kwasi Yeboah via JavaKB.com
    Dec 1, 2004
  4. Donald Firesmith

    html tags within meta tags allowed?

    Donald Firesmith, Jan 5, 2005, in forum: XML
    Replies:
    5
    Views:
    950
    Andy Dingley
    Jan 8, 2005
  5. Rob Hunter
    Replies:
    2
    Views:
    154
    Keith Fahlgren
    Aug 31, 2007
Loading...

Share This Page