Converting HTML elements into XML/RSS

Discussion in 'XML' started by mickjames@gmail.com, Jan 6, 2005.

  1. Guest

    Hi,

    I'd like to include the whole web page content (as opposed to just the
    headlines) into RSS/XML to enable people to read them via rss feed
    readers.

    Question: how to convert HTML elements such as href, img, b, p, etc
    into XML?
    I've seen someone use the following in their RSS feed but I don't like
    it because <pre> doesn't produce a nice format:

    <content:encoded><![CDATA[
    <PRE>
    blah blah blah..

    Here is a sample HTML code. What would be the best way to put it into
    XML, more specifically, convert those HTML elements.

    ----------------
    <b>CAESAR</b> Et tu, Brute! Then fall,
    <a
    href=http://www.epilepsiemuseum.de/raum6/caesar.jpg>Caesar</a>.<br>
    Dies
    <p>
    <b>CINNA</b> Liberty! Freedom! Tyranny is dead!
    Run hence, proclaim, cry it about the streets.
    <a href=http://www.shakespeare-online.com/>Read more</a>.
    -----------------

    Thanks for all the help!

    Mick James
     
    , Jan 6, 2005
    #1
    1. Advertising

  2. Andy Dingley Guest

    Andy Dingley, Jan 6, 2005
    #2
    1. Advertising

  3. Guest

    Thanks.

    So all the HTML needs to be enclosed in <description> and tags need to
    be escaped with &amp;lt; and &amp;gt;?
     
    , Jan 6, 2005
    #3
  4. Nick Kew Guest

    In article <>,
    writes:

    > I'd like to include the whole web page content (as opposed to just the
    > headlines) into RSS/XML to enable people to read them via rss feed
    > readers.


    Uh, that's a lot of content for what users are expecting to be a summary.
    Why use a feed if it doesn't save your users anything?

    > Question: how to convert HTML elements such as href, img, b, p, etc
    > into XML?


    Bearing in mind the above, freely mix it, just using namespaces to
    distinguish the elements. Since you're already breaking the purpose
    of a feed, working normally with conventional client software presumably
    isn't an issue.

    > Here is a sample HTML code. What would be the best way to put it into


    Looks more like tag-soup to me.

    --
    Nick Kew
     
    Nick Kew, Jan 7, 2005
    #4
  5. Guest

    Thanks for your reply. Yes, I understand that RSS is meant for summary,
    not the whole content, but a lot of readers ask for the whole thing.
    One imagines, they prefer to read using an rss feed reader instead of
    using a web browser.

    One question I didn't get the answer to in all my searching is: how to
    code HTML tags such as href, img, p, b, etc when converting an HTML
    page to .rss page?

    Putting everything in CDATA or is there a better way?
    A short example would be helpful.

    Thanks a lot!
     
    , Jan 7, 2005
    #5
  6. Andy Dingley Guest

    On 6 Jan 2005 15:15:54 -0800, wrote:

    >So all the HTML needs to be enclosed in <description> and tags need to
    >be escaped with &amp;lt; and &amp;gt;?


    Yes. Ampersands might also cause problems and should already have been
    escaped, but it's common in HTML that they aren't.

    You should also "fix" any entitity references that are in the HTML,
    such as &eacute; or &nbsp; This needs to be done whether there are
    tags involved or not - they're one of the most common intermittent
    reasons for an RSS feed to become invalid. Such entities are defined
    in HTML, but aren't already defined in XML or RSS.

    "Fixing" them can be either replacing the initial ampersand with &amp;
    or replacing the "named" form of the entity reference with the
    corresponding numeric form. The numeric form is probably best to use,
    because that will render correctly even if the consumer doesn't
    properly expand the encoded entities.

    --
    Smert' spamionam
     
    Andy Dingley, Jan 7, 2005
    #6
  7. Andy Dingley Guest

    On Fri, 7 Jan 2005 01:25:36 +0000, (Nick Kew)
    wrote:

    >Why use a feed if it doesn't save your users anything?


    Why do you assume the function of my RSS feed ? I've built many
    feeds that are anything but "newsfeeds". I think my record was 20MB
    content size in a <description> element, for a very
    application-specific intranet task. However it's still perfectly
    compliant RSS 1.0

    >> Question: how to convert HTML elements such as href, img, b, p, etc
    >> into XML?

    >
    >Bearing in mind the above, freely mix it, just using namespaces to
    >distinguish the elements.


    You can't use namespacing, because the content is HTML rather than
    XHTML. Apart from the standards-based argument and the fact that
    namespacing just doesn't make sense for HTML, it's also impractical to
    expect the incoming HTML content to be well-formed as an XML fragment
    (or even valid HTML!).

    Remember that RSS is a _feed_, not a one-off document (I wish Winer
    would recognise this). Like all layered protocols you have to be very
    careful that your implementations are not only correct for one
    demonstration example, they have to be demonstrably correct for all
    possible inputs.


    > Since you're already breaking the purpose of a feed,


    Rubbish. RSS does _NOT_ define any notion of "purpose", or what's
    "appropriate" to use it for. Besides which, the notion of content
    encoding HTML fragments within the <description> element is very well
    established.


    --
    Smert' spamionam
     
    Andy Dingley, Jan 7, 2005
    #7
  8. Nick Kew Guest

    In article <>,
    writes:
    > One imagines, they prefer to read using an rss feed reader instead of
    > using a web browser.


    Hmmm. I think it should be the job of the Client to present it
    sensibly. An RSS feed is to the Web as a newsgroup or mail folder
    listing (from, subject, date) is to Usenet or Email. IMHO.

    (you've presumably seen how Opera presents RSS feeds?)

    > One question I didn't get the answer to in all my searching is: how to
    > code HTML tags such as href, img, p, b, etc when converting an HTML
    > page to .rss page?


    The core Site Valet tools offer options to present reports as RDF.
    Since these are markup analysis tools, the more verbose options
    embed the original markup, so all system messages can be properly
    referenced to it. This uses a namespace to describe it, and
    looks a little like XSLT with things like:
    <ml:element name="a">
    <ml:attribute name="href">foo</ml:attribute>

    > Putting everything in CDATA or is there a better way?
    > A short example would be helpful.


    I don't think the above reply is really relevant to your question:
    I was solving a different problem! But you already have Andy's reply.

    --
    Nick Kew
     
    Nick Kew, Jan 7, 2005
    #8
  9. Colin Guest

    Hey,

    >I'd like to include the whole web page content (as opposed to just the
    >headlines) into RSS/XML to enable people to read them via rss feed
    >readers.
    >
    >Question: how to convert HTML elements such as href, img, b, p, etc
    >into XML?


    Why don't you just use software to create the feed that will convert it for you
    so that you don't have to worry about it. There are a couple of options, I know
    FeedForAll http://www.feedforall.com has a WYSWIG editor that will do this.

    Best,
    Colin
     
    Colin, Jan 7, 2005
    #9
  10. Guest

    WYSIWIG is not an option. I need to do it via script on Linux.

    Would someone tell me how the following HTML snippet should be encoded
    in an RSS file:

    <b>This is a test.</a>
    <a href=foo.html>Bar</a>.
    <img src=baz.jpg>
    <p>

    I tried using &amp;lt; etc but RSS readers simply display the
    equivalent HTML, rather then rendering it.
     
    , Jan 7, 2005
    #10
  11. In article <>,
    (Nick Kew) wrote:

    > > Here is a sample HTML code. What would be the best way to put it into

    >
    > Looks more like tag-soup to me.


    "Entity-encoded HTML" *is* tag soup transported over XML character data.
    To make things worse, RSS provides no way of communicating whether the
    characters reported by the XML processor are presentable text or tag
    soup source that needs another level of parsing.

    To make matters still worse, the problem has propagated from RSS 0.92
    and 2.0 descriptions to titles and even to RSS 0.91 and RSS 1.0
    processing, even though there is no spec text supporting "entity-encoded
    HTML" in titles in any version of RSS or in descriptions in RSS 0.91 and
    RSS 1.0.

    For example, Sage misrenders the title "Tag Soup: How Mac IE 5 and
    Safari handle <x> <y> </x> </y>" in http://www.hut.fi/u/hsivonen/feed.xml

    --
    Henri Sivonen

    http://iki.fi/hsivonen/
     
    Henri Sivonen, Jan 7, 2005
    #11
  12. Andy Dingley Guest

    On Sat, 08 Jan 2005 00:22:20 +0200, Henri Sivonen <>
    wrote:

    >To make matters still worse, the problem has propagated from RSS 0.92
    >and 2.0 descriptions to titles and even to RSS 0.91 and RSS 1.0
    >processing,


    No, RSS 1.0 is clear over this - although the others do have a
    problem. The RSS 1.0 spec wasn't written in the sloppy manner of the
    others.
     
    Andy Dingley, Jan 8, 2005
    #12
  13. In article <>,
    Andy Dingley <> wrote:

    > On Sat, 08 Jan 2005 00:22:20 +0200, Henri Sivonen <>
    > wrote:
    >
    > >To make matters still worse, the problem has propagated from RSS 0.92
    > >and 2.0 descriptions to titles and even to RSS 0.91 and RSS 1.0
    > >processing,

    >
    > No, RSS 1.0 is clear over this - although the others do have a
    > problem. The RSS 1.0 spec wasn't written in the sloppy manner of the
    > others.


    My point was that the problem has propagated to RSS 1.0 *processing*.
    That is, there's software that assumes "entity-escaped HTML" in RSS 1.0
    *titles*, even though there is no spec text to back it up.

    --
    Henri Sivonen

    http://iki.fi/hsivonen/
     
    Henri Sivonen, Jan 8, 2005
    #13
  14. Guest

    So can anyone show me how to put this HTML fragment into RSS/XML?
     
    , Jan 8, 2005
    #14
  15. Nick Kew Guest

    In article <>,
    Andy Dingley <> writes:
    > On Fri, 7 Jan 2005 01:25:36 +0000, (Nick Kew)
    > wrote:
    >
    >>Why use a feed if it doesn't save your users anything?

    >
    > Why do you assume the function of my RSS feed ? I've built many


    I don't. I made an inference from the wording of the OP.

    >>Bearing in mind the above, freely mix it, just using namespaces to
    >>distinguish the elements.

    >
    > You can't use namespacing, because the content is HTML rather than
    > XHTML.


    Nonsense. Just map the HTML trivially to XHTML.

    > Apart from the standards-based argument and the fact that
    > namespacing just doesn't make sense for HTML, it's also impractical to
    > expect the incoming HTML content to be well-formed as an XML fragment
    > (or even valid HTML!).


    Not necessary. There's no shortage of software that'll parse HTML
    and XHTML to the same representation or event stream.

    > Remember that RSS is a _feed_, not a one-off document (I wish Winer
    > would recognise this). Like all layered protocols you have to be very
    > careful that your implementations are not only correct for one
    > demonstration example, they have to be demonstrably correct for all
    > possible inputs.


    Yes, and?

    >> Since you're already breaking the purpose of a feed,

    >
    > Rubbish. RSS does _NOT_ define any notion of "purpose", or what's
    > "appropriate" to use it for.


    Erm, I read the OP as implying a conventional/familiar purpose. What
    in the references to "web page content", "rss *feed* readers", or the
    ugly-tagsoup-html sample, leads you to suppose otherwise?

    --
    Nick Kew
     
    Nick Kew, Jan 8, 2005
    #15
  16. Guest

    Nick Kew wrote:
    > In article <>,
    > Andy Dingley <> writes:

    [..]
    > Not necessary. There's no shortage of software that'll parse HTML
    > and XHTML to the same representation or event stream.

    [..]

    what's meant by "event stream," please?


    thanks,

    Thufir Hawat
     
    , Jan 12, 2005
    #16
  17. Nick Kew Guest

    In article <>,
    writes:
    > [..]
    >> Not necessary. There's no shortage of software that'll parse HTML
    >> and XHTML to the same representation or event stream.

    > [..]
    >
    > what's meant by "event stream," please?


    Google for SAX.

    --
    Nick Kew
     
    Nick Kew, Jan 12, 2005
    #17
  18. Venkata Srinivasulu, Feb 19, 2005
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    5
    Views:
    796
    SpaceGirl
    Feb 25, 2005
  2. Sachin Garg
    Replies:
    0
    Views:
    553
    Sachin Garg
    Jul 18, 2005
  3. jkflens
    Replies:
    2
    Views:
    1,515
    jkflens
    May 30, 2006
  4. Replies:
    0
    Views:
    485
  5. Jonathan Groll
    Replies:
    1
    Views:
    295
    Kouhei Sutou
    Jun 27, 2009
Loading...

Share This Page