RSS Feeds - How to Ad to Web Pages

T

Todd Shillam

I'm interested in learning how to add RSS feeds to a web page; thus, I don't have to update content
all the time. I would like to add RSS feeds from some of the popular news sources: MSNBC, CNET,
ZDNET, etc. Would anyone have any recommendations on how to do this or where I can find the
information to learn? Any examples would be great.

Thanks,

Todd
 
A

Andy Dingley

I'm interested in learning how to add RSS feeds to a web page;

Almost any way you like. Usual way involves server-side scripting,
with some platform that includes an XML DOM. What do you have
available ?

You can do it purely client-side, but it's far from ideal.
 
T

Todd Shillam

Andy Dingley said:
Almost any way you like. Usual way involves server-side
scripting, with some platform that includes an XML DOM.
What do you have available ?

You can do it purely client-side, but it's far from ideal.

I've got an IIS server with PHP, Perl, and mysql--does that answer your question? I can script just
about anything whereas I've got a good grasp on programming concepts: loops, arrays, etc. I just
need to get an idea of how it's done, just a general direction. Hope that helps--thanks for the
reply.

Best regards,

Todd
 
S

SpaceGirl

Todd said:
I've got an IIS server with PHP, Perl, and mysql--does that answer your question? I can script just
about anything whereas I've got a good grasp on programming concepts: loops, arrays, etc. I just
need to get an idea of how it's done, just a general direction. Hope that helps--thanks for the
reply.

Best regards,

Todd

RSS is just a raw ASCI text document containing XML - so you can use ASP
to create and write the data to a file, which can then be read from any
RSS client. In the case of one of our sites, every time our news
database (SQLServer) is updated, a script automatically regenerates our
RSS file and writes it to the server root. While not ideal, it works!

Well, sort of. For some reason our server is down right now :/



--


x theSpaceGirl (miranda)

# lead designer @ http://www.dhnewmedia.com #
# remove NO SPAM to email, or use form on website #
 
A

Andy Dingley

I've got an IIS server with PHP, Perl, and mysql

This isn't a dumb question, but it's a hugely open-ended one and thus
hard to answer. There are certainly no neat snappy little short
answers. Do some Googling, because almost everything is already solved
for you - with PHP, you should have no trouble finding ready-built
almost-solutions.

There are two things you can do with RSS; present it and aggregate it.

- Presenting it is simpler. You connect to one entire feed, supplied
externally, and you transform the whole thing, probably into HTML. One
feed is used, and you use the whole feed.

- Aggregating it is harder (and may become really complicated). You
take more than one feed and combine them. To be really useful, you
start filtering items from each; killing duplicates (many interesting
articles soon get replicated around many feeds) and selecting items
that are "of interest" to you. Identifying "of interest" really well
could turn into a PhD thesis. If you combine without filtering, it
soon turns into the Usenet "infinite monkeys" scenario. You can even
offer your aggregator output as its own RSS feed, perhaps
"Coffee-related news compiled from commodity trading news sources
around the world, and the latest roasting recipes from old Havana".

There's also the question of caching. You should cache feed content
that your site downloads and serve it to your users from a local copy.
If you retrieve the feed each time your site gets a request, then this
is firstly a bad thing to do and contrary to the spirit of
syndication, and secondly you'll find your server soon gets locked out
of many feed servers for being "greedy".

You can present without caching, but aggregation really needs caching
to work. In-memory caching is pretty simple with just an XML DOM, but
serious work needs a database.


Aggregators shouldn't present content, except as RSS. It's easier (and
much more flexible) to couple a presenter to the output of an
aggregator than it is to make the aggregator present the content
itself.

Presentation consists of turning unreadable RSS geek-speak into pretty
HTML (or whatever). You can do this any way you like, but it's a
natural task for simple XSLT.



As to feed and RSS versions, then RSS 1.0 is by far the best and
should be used whenever you create a feed. However the input stage of
a presenter / aggregator shoudl always be widely accepting in what it
can take. This is one of the best arguments for not writing your own
from scratch !

PS - Dave Winer is mad, bad, and dangerous to host with. Shun him.
http://www.theregister.co.uk/2004/06/15/winer_weblog_wipeout/


There's a lot I haven't mentioned here. Content encoding, metadata,
filtering algorithms, adaptive aggregation. Come back (or to
comp.text.xml) whenever you have more queries.


I suggest that you put an environment together on your servers that
will allow you to do simple XSLT transforms. Then try writing a few,
and make some incoming RSS appear as HTML. This isn't a useful
technique, but it will give you a feel for the protocols. After that,
to do it for real, look for an open-source aggregator for PHP & MySQL
and try installing that.

Aggregators are hard. _Good_ aggregators are really hard.
 
A

Andy Dingley

SpaceGirl said:
RSS is just a raw ASCI text document containing XML

No it isn't 8-(

You've been reading that bloody Dave Winer's spec's again, haven't you.


(ASCII and "flat file" have no place anywhere near RSS)
 
K

Kris

You can do it purely client-side, but it's far from ideal.

I've got an IIS server with PHP, Perl, and mysql--does that answer your
question? I can script just
about anything whereas I've got a good grasp on programming concepts: loops,
arrays, etc. I just
need to get an idea of how it's done, just a general direction. Hope that
helps--thanks for the
reply.[/QUOTE]

If you understand the Dutch language, you may like
<http://www.naarvoren.nl/artikel/rss.html>
 
K

Kris

SpaceGirl said:
RSS is just a raw ASCI text document containing XML

As Andy points out, it isn't. The XML markup itself is written in the
characters though that make up for the 128 characters of US-ASCII, which
happen to be the first 128 characters of roughly every character set.
 
S

SpaceGirl

Kris said:
As Andy points out, it isn't. The XML markup itself is written in the
characters though that make up for the 128 characters of US-ASCII, which
happen to be the first 128 characters of roughly every character set.

That's what I meant... I was trying to make the point that RSS feeds are
hardly rocket science. All you have to do is work out a way of
generating the data in the first place. Then it's just XML.


--


x theSpaceGirl (miranda)

# lead designer @ http://www.dhnewmedia.com #
# remove NO SPAM to email, or use form on website #
 
A

Andy Dingley

SpaceGirl said:
I was trying to make the point that RSS feeds are
hardly rocket science. All you have to do is work out a way of
generating the data in the first place. Then it's just XML.

RSS feeds are _incredibly_ complicated and difficult to get right.
However this is all done for you by The Power Of XML (tm) and using a
ready-built DOM component that was slaved over by people who really
understood the nuances of the XML spec. The fact that the feed
builder doesn't need to carry out or understand any of this work
themselves doesn't mean that the protocol itself is trivial.

Trying to generate them directly "as files" is a disaster area. I am
heartily sick of mucking out people's feeds when they're basically not
even well-formed XML. Dodgy character entity references, dodgy
character encoding, the lot.

DO NOT WRITE YOUR OWN XML SERIALISERS !!!

There are only a handful of people truly good enough to do it, and
having already done them for every platform out there, there's no
earthly reason to write more of the things. I don't know how to write
an accurate serialiser, and I don't even know anyone else who is (or
at least they won't answer my questions in comp.text.xml on the parts
I don't understand).
 
E

Els

Andy said:
RSS feeds are _incredibly_ complicated and difficult to get
right. However this is all done for you by The Power Of XML
(tm) and using a ready-built DOM component that was slaved
over by people who really understood the nuances of the XML
spec. The fact that the feed builder doesn't need to carry
out or understand any of this work themselves doesn't mean
that the protocol itself is trivial.

Trying to generate them directly "as files" is a disaster
area. I am heartily sick of mucking out people's feeds when
they're basically not even well-formed XML. Dodgy character
entity references, dodgy character encoding, the lot.

DO NOT WRITE YOUR OWN XML SERIALISERS !!!

There are only a handful of people truly good enough to do
it, and having already done them for every platform out
there, there's no earthly reason to write more of the
things. I don't know how to write an accurate serialiser,
and I don't even know anyone else who is (or at least they
won't answer my questions in comp.text.xml on the parts I
don't understand).

What do you mean by XML serialiser?
I have an RSS feed on my site, and I think it's well formed,
but I haven't used any serialiser. I'm probably
misunderstanding you, but could you explain it?
 
M

Matthias Gutfeldt

Andy said:
RSS feeds are _incredibly_ complicated and difficult to get right.
However this is all done for you by The Power Of XML (tm) and using a
ready-built DOM component that was slaved over by people who really
understood the nuances of the XML spec. The fact that the feed
builder doesn't need to carry out or understand any of this work
themselves doesn't mean that the protocol itself is trivial.

Trying to generate them directly "as files" is a disaster area. I am
heartily sick of mucking out people's feeds when they're basically not
even well-formed XML. Dodgy character entity references, dodgy
character encoding, the lot.

DO NOT WRITE YOUR OWN XML SERIALISERS !!!

Bah, that's what CDATA is for - let's just drop everything in the big
content:encoded CDATA soup pot, stir vigorously, and serve :).


Matthias
 
S

SpaceGirl

Els wrote:

What do you mean by XML serialiser?
I have an RSS feed on my site, and I think it's well formed,
but I haven't used any serialiser. I'm probably
misunderstanding you, but could you explain it?

Likewise... I have a self generated feed on one of our sites, and it
validates perfectly against the handful of RSS validators out there too.

--


x theSpaceGirl (miranda)

# lead designer @ http://www.dhnewmedia.com #
# remove NO SPAM to email, or use form on website #
 
A

Andy Dingley

What do you mean by XML serialiser?

If you use an XML DOM, it will generally have either a .xml property,
or a .save method. These take the DOM contents and tun them into a
well-formed XML document. That's "serialisation"

I have an RSS feed on my site, and I think it's well formed,

That's the trouble - you _think_ it is, but how much have you really
tested it ? What about some pathological combination of obscure
characters ? I work in magazine publishing, where we publish
magazines that deal with web publishing - trying to feed an article
_about_ XML through an XML-based content management system can turn up
all sorts of problems.
 
E

Els

Andy said:
If you use an XML DOM, it will generally have either a .xml
property, or a .save method. These take the DOM contents
and tun them into a well-formed XML document. That's
"serialisation"

Hmm... guess I haven't used one. Don't know about any .save
method. I just typed it all by hand in my text editor.
That's the trouble - you _think_ it is, but how much have
you really tested it ?

I tested it how I knew to test it, but apparently there's more
to it.
What about some pathological
combination of obscure characters ?

"a pathological combination of obscure characters"...
Instead of me trying to figure out what that could mean, could
you do me a favor and look at my feed to see if it is well
formed?
http://locusmeus.com/rss.xml
I work in magazine
publishing, where we publish magazines that deal with web
publishing - trying to feed an article _about_ XML through
an XML-based content management system can turn up all
sorts of problems.

I believe you :-\
 
A

Andy Dingley

I just typed it all by hand in my text editor.
I tested it how I knew to test it, but apparently there's more
to it.

Yes, you could say that...

You have created a _document_ here, not a feed. To validate your
document, we can look at one copy of it and study that. This is quite
easy.

Validating a feed is much harder. A feed is the automatic output of
some larger process. We not only need to look at one example of the
feed's output, on one occasion, but we need to study _all_ the
possible outputs of the feed, for all possible inputs. This is much
harder.

We now get into software engineering, and particularly software
testing. This is a complex subject, rarely done, and impossibly rare
for web application software. There's no time for me to go into it
here, but it involves many techniques, and certainly more than just
looking at one instance of output.

Testing may be "black box" or "white box" testing. Here we should
think about white box, which involves knowing how the insides of the
process are carried out. With the knowledge we have of XML, this would
lead me to test "corner cases", such as the sequences "]]>"
"&eacute;" , "&apos;" "<foo>" "<br>" "<br ></br>" and similar. Test
plans for an RSS feed could easily take a week to produce, let alone
carry out.
"a pathological combination of obscure characters"...
Instead of me trying to figure out what that could mean,

Try "]]>" How would you represent that in your <description>
element?

you do me a favor and look at my feed to see if it is well
formed?
http://locusmeus.com/rss.xml

The current version of your feed is well-formed XML, but it is bogus,
invalid, and meta-invalid. (Admittedly two of these terms are rarely
used)

It's bogus. This means that it has no defined schema, so any concept
of "valid" is simply impossible - for there is nothing to validate it
against. There is no DTD or namespace referenced. There is a root
element of <rss version="2.0"> which _might_ mean something to a tool
that already expects to receive RSS - although that's bad XML design
(not your fault - I blame Winer, for he is truly clueless when it
comes to spec writing)

It's bogus - there is no version 2.0 of RSS
Here's why
http://diveintomark.org/archives/2004/02/04/incompatible-rss


It's invalid. Assuming the "best-guess" definition of what RSS 2.0
might have been, your date formats don't comply with RFC 822.

Admittedly I didn't eyeball this - I used the handy RSS / Atom
validator over here:
http://feedvalidator.org/check.cgi?url=http://locusmeus.com/rss.xml

I can't see _why_ they're not valid 822 dates either, but it's 2 in
the morning, so that's not surprising.


It's meta-invalid. You're using CDATA sections to encode HTML
content, rather than entity-encoding like this:
"&lt;p&gt;Hello world&lt;/p&gt;"

This is _not_ the same thing !
Yes, Andrew Urquhart, I'm talking to you ! Remember the SEI ? :cool:

Maybe this use of CDATA is "valid" RSS 2.0 But it's not the effect
you're after. Look at the results of your stylesheet for one thing -
the <p> markup is appearing on the page, and as a single paragraph.
It's not what you wanted, but it's what you (literally) asked for.
Your stylesheet is acting correctly here (as far as I know RSS 2.0),
even though the result is b0rken.
 
E

Els

Andy said:
Yes, you could say that...

You have created a _document_ here, not a feed. To
validate your document, we can look at one copy of it and
study that. This is quite easy.

Validating a feed is much harder. A feed is the automatic
output of some larger process. We not only need to look at
one example of the feed's output, on one occasion, but we
need to study _all_ the possible outputs of the feed, for
all possible inputs. This is much harder.

Okay, I guess that's true. Am not doing any automatic output yet
though, so, by the time I get a blog and would want it
automated, where should I start to read up on the info I need?
We now get into software engineering, and particularly
software testing. This is a complex subject, rarely done,
and impossibly rare for web application software. There's
no time for me to go into it here, but it involves many
techniques, and certainly more than just looking at one
instance of output.

Testing may be "black box" or "white box" testing. Here we
should think about white box, which involves knowing how
the insides of the process are carried out. With the
knowledge we have of XML, this would lead me to test
"corner cases", such as the sequences "]]>" "&eacute;" ,
"&apos;" "<foo>" "<br>" "<br ></br>" and similar. Test
plans for an RSS feed could easily take a week to produce,
let alone carry out.
"a pathological combination of obscure characters"...
Instead of me trying to figure out what that could mean,

Try "]]>" How would you represent that in your
<description> element?

Uh... ??
The current version of your feed is well-formed XML, but it
is bogus, invalid, and meta-invalid. (Admittedly two of
these terms are rarely used)

It's bogus. This means that it has no defined schema, so
any concept of "valid" is simply impossible - for there is
nothing to validate it against. There is no DTD or
namespace referenced. There is a root element of <rss
version="2.0"> which _might_ mean something to a tool that
already expects to receive RSS - although that's bad XML
design (not your fault - I blame Winer, for he is truly
clueless when it comes to spec writing)

Who's Winer?
It's bogus - there is no version 2.0 of RSS
Here's why
http://diveintomark.org/archives/2004/02/04/incompatible-rss

I'll look into that, thanks.
It's invalid. Assuming the "best-guess" definition of what
RSS 2.0 might have been, your date formats don't comply
with RFC 822.

What should the date format look like? I first 'guessed' the way
I put it, but the validator said it should be like the way I
have it now.
Admittedly I didn't eyeball this - I used the handy RSS /
Atom validator over here:
http://feedvalidator.org/check.cgi?url=http://locusmeu
s.com%2Frss.xml

Ah, right. That's the validator I tested it in, but not after
the last two items.

But I don't see it:
What is the difference between
<pubDate>Sun, 13 June 2004 01:00:00 +0200</pubDate>
and
<pubDate>Mon, 7 June 2004 09:18:00 +0200</pubDate>
given that the latter is valid and the former isn't?

I honestly don't see it.
If anything, the later would be wrong, cause it should be 07
June instead of 7 June...
Ah, I see it already. The validator counts the amount of digits,
and complained about the time. It got a false "correct" on 7
June, which should be 07 Jun.

Changed it, it's valid RSS2.0 now. :)
I can't see _why_ they're not valid 822 dates either, but
it's 2 in the morning, so that's not surprising.

I seem to be awake then ;-)
It's meta-invalid. You're using CDATA sections to encode
HTML content, rather than entity-encoding like this:
"&lt;p&gt;Hello world&lt;/p&gt;"

This is _not_ the same thing !
Yes, Andrew Urquhart, I'm talking to you ! Remember the
SEI ? :cool:

Will look into that, although I did try it first with the
entity-encoding instead of CDATA, and iirc it gave the same
result in the 'with stylesheet' version.
Maybe this use of CDATA is "valid" RSS 2.0 But it's not
the effect you're after.

Very true.
Look at the results of your
stylesheet for one thing - the <p> markup is appearing on
the page, and as a single paragraph. It's not what you
wanted, but it's what you (literally) asked for. Your
stylesheet is acting correctly here (as far as I know RSS
2.0), even though the result is b0rken.

Yep, I figured it was most important to get the desired effect
in Feed readers (have only tested in Feedreader and Awasu), and
the stylesheet is really only to make it look a little bit
better than it would be without one.

If anyone knows how to get links in the document to display
correctly in a feedreader _and_ the original doc, please tell
me?
 
A

Andy Dingley

Els said:
Andy Dingley wrote:
"a pathological combination of obscure characters"...
Instead of me trying to figure out what that could mean,

Try "]]>" How would you represent that in your
<description> element?

Uh... ??

You're using CDATA sections to encode the marked-up content. A CDATA
section is ended by the sequence "]]>" So what happens if you try to
represent such a sequence _inside_ your content ?

The sequence:
<![CDATA[ before ']]>' after ]]>
isn't valid. It appears to be a shorter CDATA section, with some
rubbish after it. It'll probably appear to readers as " before ''
after ]]>" rather than
" before ']]>' after " as it ought to.

Your content managament (or in this case, you) should encode it as
something like

<![CDATA[ before ']]]><![CDATA[]>' after ]]>
instead (Alan or Jukka might well correct me here).

This is a very common software testing technique. Whenever there's a
magic "escape sequence" or "end of document" marker, then you try it
with one of them in the data part.


Who's Winer?

Dave Winer. Clueless guy who doesn't understand how to write a
protocol specification.
http://www.theregister.co.uk/2004/06/15/winer_weblog_wipeout/

What should the date format look like?

No idea. RFC 822 (or RFC 2822) will describe the format as BNF
(Backus-Naur). From this you can work it out - with a little coffee
and head scratching.

Will look into that, although I did try it first with the
entity-encoding instead of CDATA, and iirc it gave the same
result in the 'with stylesheet' version.

As a general rule, you can't process RSS for rendering as HTML by the
use of an XSLT stylesheet alone. It's also awkward to process HTML
into RSS by XSLT alone.

XSLT and XPath work with elements and text nodes - they don't look
within a text node (and that includes entity references and CDATA
sections). To render encoded HTML in RSS, you must do this. Your
options are to either:

- Use something other than XSLT
(code over an XML DOM is a good approach,
if you're already on a suitable platform)

- Use XSLT and suffer with the minimal string handling functions you
do have. O'Reilly's "XSLT Cookbook"
<http://www.amazon.co.uk/exec/obidos/ASIN/0596003722/codesmiths>
shows some amazing examples of what's possible, but it'll make your
head explode and the code is nightmarish to maintain afterwards
(hello again Mr Urquhart !)

- Use XSLT with extension functions (or XSLT 2.0).
This is the best option for XSLT. String handling with some easy
JavaScript and regexes makes the whole problem easy.

Yep, I figured it was most important to get the desired effect
in Feed readers (have only tested in Feedreader and Awasu),

No, the important thing is to DO IT RIGHT, according to the
specification. And choose a sensible specification to follow, not
Winer's dog's-breakfast.

Little published RSS is valid, and almost none of it (that uses
<description> content beyond simple plaintext) is meta-valid like
this. As a result, many of the feed readers have taken to accepting
the RSS version of tag-soup and treating it as not only valid, but
inferring "what you meant" rather than what you stated.

This is bad. It's wrong, and it encourages feed authors to be wrong
too.

It's particularly bad if you're trying to syndicate articles about
XML. Maybe I _wanted_ to talk about the "<p>" tag in the middle of my
<p>paragraph</p> ?

Incidentally, the correct RSS markup for this is:
<description>&lt;p&gt;A paragraph about
the &amp;lt;p&amp;gt; tag &lt;/p&gt;</description>

You may find that this:
<description><![CDATA[<p>A paragraph about
the &lt;p&gt; tag </p>]]></description>

Also works, but it's wrong and actually means something else. It ought
to render into HTML as:
&lt;p&gt;A paragraph about
the &amp;lt;p&amp;gt; tag &lt;/p&gt;

Which will _render_ on-screen as:
<p>A paragraph about the &lt;p&gt; tag </p>
 
E

Els

[snip the difficult part, will try understand it later ;-)]
As a general rule, you can't process RSS for rendering as
HTML by the use of an XSLT stylesheet alone. It's also
awkward to process HTML into RSS by XSLT alone.

I am not using an XSLT stylesheet. :)
XSLT and XPath work with elements and text nodes - they
don't look within a text node (and that includes entity
references and CDATA sections). To render encoded HTML in
RSS, you must do this. Your options are to either:

- Use something other than XSLT
(code over an XML DOM is a good approach,
if you're already on a suitable platform)

I suppose I did it all wrong using CSS for the styles then...
- Use XSLT and suffer with the minimal string handling
functions you
do have. O'Reilly's "XSLT Cookbook"
<http://www.amazon.co.uk/exec/obidos/ASIN/0596003722/codesmi
ths> shows some amazing examples of what's possible, but
it'll make your head explode and the code is nightmarish to
maintain afterwards (hello again Mr Urquhart !)

Exploding head... hmm... maybe later ;-)
- Use XSLT with extension functions (or XSLT 2.0).
This is the best option for XSLT. String handling with some
easy JavaScript and regexes makes the whole problem easy.

I don't want to use JavaScript, but I could look into XSLT
2.0, if it really is the best way.
No, the important thing is to DO IT RIGHT,

That's what I would like to accomplish.
according to the
specification. And choose a sensible specification to
follow, not Winer's dog's-breakfast.

Any online sensible specification you could recommend?
Little published RSS is valid, and almost none of it (that
uses <description> content beyond simple plaintext) is
meta-valid like this. As a result, many of the feed readers
have taken to accepting the RSS version of tag-soup and
treating it as not only valid, but inferring "what you
meant" rather than what you stated.

This is bad. It's wrong, and it encourages feed authors to
be wrong too.

This part I understood. It's like what IE does with FrontPage
docs ;-)
It's particularly bad if you're trying to syndicate
articles about XML. Maybe I _wanted_ to talk about the
"<p>" tag in the middle of my <p>paragraph</p> ?

I see your point.
Incidentally, the correct RSS markup for this is:
<description>&lt;p&gt;A paragraph about
the &amp;lt;p&amp;gt; tag &lt;/p&gt;</description>

You may find that this:
<description><![CDATA[<p>A paragraph about
the &lt;p&gt; tag </p>]]></description>

Also works, but it's wrong and actually means something
else. It ought to render into HTML as:
&lt;p&gt;A paragraph about
the &amp;lt;p&amp;gt; tag &lt;/p&gt;

Which will _render_ on-screen as:
<p>A paragraph about the &lt;p&gt; tag </p>

I'll certainly look into all of this one of these days, thanks
for the explanation :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top