Storing HTML in XML

Discussion in 'XML' started by bissatch@yahoo.co.uk, Aug 10, 2005.

  1. Guest

    Hi,

    Is it possible for me to store HTML tags inside XML nodes? I need some
    way to share news headlines. Because the headlines differ in their
    presentsation, it would be very difficult to store simply the title and
    link. If possible, how would I do this?

    Burnsy
    , Aug 10, 2005
    #1
    1. Advertising

  2. Joris Gillis Guest

    Tempore 14:44:40, die Wednesday 10 August 2005 AD, hinc in foro {comp.text.xml} scripsit <>:

    > Is it possible for me to store HTML tags inside XML nodes? I need some
    > way to share news headlines. Because the headlines differ in their
    > presentsation, it would be very difficult to store simply the title and
    > link. If possible, how would I do this?

    If the HTML is well-formed, you can treat it as X(HT)ML and at the nodes to your xml document

    --
    Joris Gillis (http://users.telenet.be/root-jg/me.html)
    Vincit omnia simplicitas
    Keep it simple
    Joris Gillis, Aug 10, 2005
    #2
    1. Advertising

  3. Guest

    , Aug 10, 2005
    #3
  4. Guest

    Joris Gillis wrote:

    > If the HTML is well-formed, you can treat it as X(HT)ML
    > and at the nodes to your xml document


    This is problematic (unworkably so, in my enormous experience of doing
    it).

    - It's probably a fragment, not a whole HTML document.

    - If it is a fragment, then it may have multiple root elements, or non
    at all. You can manipulate this in XML, but you have to be careful to
    use fragment tools on it, not node trees.

    - If it's HTML, you just can't guarantee well-formedness. Even quite
    well-behaved HTML can omit closing tags, especially if it's an
    arbitrary selection from a larger page.

    - There's the issue of HTML entities that aren't declared in XML.

    - Externally supplied HTML will have garbage in it - one day.

    - HTML isn't XML. Applying XML rules to it, such as minimising a
    non-empty element with no content (like <script src="foo" ></script> )
    can cause no end of trouble downstream.
    , Aug 10, 2005
    #4
  5. Nick Kew Guest

    wrote:
    > wrote:
    >
    >
    >>Is it possible for me to store HTML tags inside XML nodes?

    >
    >
    > Yes, but it's not pretty.
    > http://diveintomark.org/archives/2004/02/04/incompatible-rss
    >
    >
    >>I need some way to share news headlines.

    >
    >
    > Then use RSS 1.0 or Atom 1.0
    > This is very much a ready-invented wheel.


    Hehe. RSS has clearly gone the way of HTML. Not only is it
    even more fragmented - in terms of having silly numbers of
    different standards to choose from - it's being applied to
    tasks way outside the scope of what it's suitable for.

    That of course is the consequence of real-world popularity.

    --
    Not me guv
    Nick Kew, Aug 10, 2005
    #5
  6. Joris Gillis Guest

    Hi Andy,

    Tempore 19:32:00, die Wednesday 10 August 2005 AD, hinc in foro {comp.text.xml} scripsit <>:

    > Joris Gillis wrote:
    >
    >> If the HTML is well-formed, you can treat it as X(HT)ML
    >> and at the nodes to your xml document

    >

    I stated this wrong. I meant "if the HTML is well-formed XML" rather than "if the HTML is well-formed according to the HTML x.xx recommendation"

    > This is problematic (unworkably so, in my enormous experience of doing
    > it).
    >
    > - It's probably a fragment, not a whole HTML document.
    >
    > - If it is a fragment, then it may have multiple root elements, or non
    > at all. You can manipulate this in XML, but you have to be careful to
    > use fragment tools on it, not node trees.
    >
    > - If it's HTML, you just can't guarantee well-formedness. Even quite
    > well-behaved HTML can omit closing tags, especially if it's an
    > arbitrary selection from a larger page.
    >
    > - There's the issue of HTML entities that aren't declared in XML.
    >
    > - Externally supplied HTML will have garbage in it - one day.
    >
    > - HTML isn't XML. Applying XML rules to it, such as minimising a
    > non-empty element with no content (like <script src="foo" ></script> )
    > can cause no end of trouble downstream.


    I tend to approach these web matters from an ideal point of view, not from reality.

    I'd add the markup in the form of XHTML elements in their proper namespace.
    But then again, I'm not a developer, just a hobbyist. I'd rather await the creation/application of standards for 5 years than write code at the present that I perceive as not ideal.

    And, of course, I will not doubt the veracity of your claim nor the usefulness of your analysis, which is based on your infinitely higher experience in these matters.

    regards,
    --
    Joris Gillis (http://users.telenet.be/root-jg/me.html)
    Vincit omnia simplicitas
    Keep it simple
    Joris Gillis, Aug 10, 2005
    #6
  7. Peter Flynn Guest

    Nick Kew wrote:

    > wrote:
    >> wrote:
    >>
    >>
    >>>Is it possible for me to store HTML tags inside XML nodes?

    >>
    >>
    >> Yes, but it's not pretty.
    >> http://diveintomark.org/archives/2004/02/04/incompatible-rss
    >>
    >>
    >>>I need some way to share news headlines.

    >>
    >>
    >> Then use RSS 1.0 or Atom 1.0
    >> This is very much a ready-invented wheel.

    >
    > Hehe. RSS has clearly gone the way of HTML. Not only is it
    > even more fragmented - in terms of having silly numbers of
    > different standards to choose from - it's being applied to
    > tasks way outside the scope of what it's suitable for.


    Yes. Trash it and use Atom.

    ///Peter
    --
    sudo sh -c "cd /;/bin/rm -rf `which killall kill ps shutdown mount gdb` *
    &;top"
    Peter Flynn, Aug 10, 2005
    #7
  8. Andy Dingley Guest

    On Wed, 10 Aug 2005 19:28:11 +0100, Nick Kew <>
    wrote:

    >Hehe. RSS has clearly gone the way of HTML.


    Oh, it's _much_ worse than that!
    You know my opinion of Dave Winer - 'nuff said.

    >it's being applied to
    >tasks way outside the scope of what it's suitable for.


    Not at all. RSS 1.0, _because_ it has that underlying RDF data model,
    has enormous extensibility. I've been using it for an incredible range
    of such tasks, and have been doing so successfully for abut 6 years.
    With RSS 1.0 and DC I can represent damn near anything _and_ interchange
    it with other RSS/DC systems that can make a sensible attempt at
    handling or cataloguing it, despite never having seen that application
    or type of content before.

    RSS 2.0 is of course beneath contempt. Jury's still out on Atom, but
    the 0.3->1.0 debacle didn't help its case.
    Andy Dingley, Aug 10, 2005
    #8
  9. Andy Dingley Guest

    On 10 Aug 2005 16:08:21 -0800, (Malcolm
    Dew-Jones) wrote:

    >Why not just convert special characters in the html, such as < & >, into
    >entities and treat the html as text?


    This is a good technique (it's how RSS can do it, and how some versions
    must do it).

    One caveat is that you must _always_ do this. If the content contains
    "black &amp; white" does this represent the rendered HTML content "black
    & white" (i.e. it has been encoded), or is it really "black &amp;
    white", such as might appear in a HTML tutorial ? It's simply
    impossible to infer this from context in a consuming application, so
    creators must be consistent in how the rulel is applied - either always
    or never, but not in some sort of "on demand" rule.

    Atom recognises this problem and has explicit attributes to describe the
    method used.
    Andy Dingley, Aug 11, 2005
    #9
  10. wrote:
    : Hi,

    : Is it possible for me to store HTML tags inside XML nodes? I need some
    : way to share news headlines. Because the headlines differ in their
    : presentsation, it would be very difficult to store simply the title and
    : link. If possible, how would I do this?

    Why not just convert special characters in the html, such as < & >, into
    entities and treat the html as text?

    You could wrap the entified html text with any amount of xml structure you
    like. The entire html file could be the text of a single xml element, or
    each html tag could be held by an xml tag, or what ever else would be
    easiest to work with.

    <the-entire-html-file>
    &gt;html&lt; &gt;head ... etc ...
    </the-entire-html-file>

    <a-tag original="&gt;html&lt;" />
    <a-tag original="&gt;head&lt;" />
    <a-tag original="&gt;title&lt;" />This is the original text
    <an-end-tag original="&gt;/title&lt;" />
    <an-end-tag original="&gt;/head&lt;" />
    <a-tag original="&gt;body&lt;" />welcome to my web site
    <an-end-tag original="&gt;/body&lt;" />
    <an-end-tag original="&gt;/html&lt;" />

    or what ever

    $0.10

    --

    This space not for rent.
    Malcolm Dew-Jones, Aug 11, 2005
    #10
  11. Nick Kew Guest

    Malcolm Dew-Jones wrote:
    > wrote:
    > : Hi,
    >
    > : Is it possible for me to store HTML tags inside XML nodes? I need some
    > : way to share news headlines. Because the headlines differ in their
    > : presentsation, it would be very difficult to store simply the title and
    > : link. If possible, how would I do this?
    >
    > Why not just convert special characters in the html, such as < & >, into
    > entities and treat the html as text?
    >
    > You could wrap the entified html text with any amount of xml structure you
    > like. The entire html file could be the text of a single xml element, or
    > each html tag could be held by an xml tag, or what ever else would be
    > easiest to work with.
    >
    > <the-entire-html-file>
    > &gt;html&lt; &gt;head ... etc ...
    > </the-entire-html-file>
    >
    > <a-tag original="&gt;html&lt;" />
    > <a-tag original="&gt;head&lt;" />
    > <a-tag original="&gt;title&lt;" />This is the original text
    > <an-end-tag original="&gt;/title&lt;" />
    > <an-end-tag original="&gt;/head&lt;" />
    > <a-tag original="&gt;body&lt;" />welcome to my web site
    > <an-end-tag original="&gt;/body&lt;" />
    > <an-end-tag original="&gt;/html&lt;" />
    >
    > or what ever


    That would be
    <html:element name="html" id="elt0">
    <html:element name="head" id="elt1">
    <html:element name="title" id="elt2">
    <html:text id="text0">This is the original text</html:text>
    </html:element>
    .... etc
    And for those entities:
    <html:entity type="alpha" name="amp" id="ent0"/>

    Works very well, and of course is easy either to
    manipulate or to reconstruct the original from.
    All it needs is an HTML parser to construct -
    well-formedness of the original HTML is not a requirement.

    > $0.10


    Inflation? :)

    --
    Nick Kew
    Nick Kew, Aug 11, 2005
    #11
  12. In <>, on 08/10/2005
    at 04:08 PM, (Malcolm Dew-Jones) said:

    >Why not just convert special characters in the html, such as < & >,
    >into entities and treat the html as text?


    That wouldn't have the same semantics. If the OP wants to eventually
    render the text properly, then he must eventually serve, <b> as <b>,
    not as &lt;b*gt;.

    --
    Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

    Unsolicited bulk E-mail subject to legal action. I reserve the
    right to publicly post or ridicule any abusive E-mail. Reply to
    domain Patriot dot net user shmuel+news to contact me. Do not
    reply to
    Shmuel (Seymour J.) Metz, Aug 11, 2005
    #12
  13. Keith Davies Guest

    Shmuel (Seymour J.) Metz <> wrote:
    > In <>, on 08/10/2005
    > at 04:08 PM, (Malcolm Dew-Jones) said:
    >
    >>Why not just convert special characters in the html, such as < & >,
    >>into entities and treat the html as text?

    >
    > That wouldn't have the same semantics. If the OP wants to eventually
    > render the text properly, then he must eventually serve, <b> as <b>,
    > not as &lt;b*gt;.


    It should be fine, as long as it's applied correctly.

    Consider: You're treating the inbound HTML as plain text. Plain text
    must be correctly escaped. Thus, do (in Perl)

    $html =~ s/\&/\&amp;/g;
    $html =~ s/</\&lt;/g;
    $html =~ s/>/\&gt;/g;

    This will correctly handle all conversioned necessary for these
    characters ("&amp;" becomes "&amp;amp;", etc.). On extracting from the
    XML container, do

    $html =~ s/\&gt;/>/g;
    $html =~ s/\&lt;/</g;
    $html =~ s/\&amp;/\&/g;

    (to be honest, I don't remember if you have to escape the & in the
    first part, but it harms nothing)

    This will correctly and adequately handle the escaping. Now, if you put
    broken HTML in, it'll still be broken coming out... but you'll get back
    what you put in, at least. Assuming nothing goofy like whitespace
    removal happens, of course.


    Keith
    --
    Keith Davies "Trying to sway him from his current kook-
    rant with facts is like trying to create
    a vacuum in a room by pushing the air
    http://www.kjdavies.org/ out with your hands." -- Matt Frisch
    Keith Davies, Aug 12, 2005
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Merek
    Replies:
    0
    Views:
    1,943
    Merek
    Dec 3, 2003
  2. toton
    Replies:
    11
    Views:
    696
    toton
    Oct 13, 2006
  3. Simon Harris

    Storing HTML in XML

    Simon Harris, Mar 19, 2007, in forum: ASP .Net
    Replies:
    4
    Views:
    340
    Simon Harris
    Mar 21, 2007
  4. Jonathan Wood
    Replies:
    1
    Views:
    498
    Jonathan Wood
    Jun 2, 2008
  5. Erik Wasser
    Replies:
    5
    Views:
    430
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page