an idiot question about a disallowed entity

L

lkrubner

Can't get this RSS feed clean:

http://www.whatisliberalism.com/pdsFiles/page2533.xml


Why is it dying?

Some users write posts in Microsoft Word, then copy and paste their
post to the web browser and paste it in and hit submit and create a
weblog entry. This is what I just did myself.

I've written a PHP function that I thought would clean this feed, it
goes through the whole feed one byte at a time, and makes sure every
byte has an ascii value between 32 and 126. I thought that might give
me some garbage characters but they'd all be safe for RSS.

No. The feed is still dying. How do I find out what entity is killing
it?
 
M

Malcolm Dew-Jones

(e-mail address removed) wrote:

: Can't get this RSS feed clean:

: http://www.whatisliberalism.com/pdsFiles/page2533.xml


: Why is it dying?

: Some users write posts in Microsoft Word, then copy and paste their
: post to the web browser and paste it in and hit submit and create a
: weblog entry. This is what I just did myself.

: I've written a PHP function that I thought would clean this feed, it
: goes through the whole feed one byte at a time, and makes sure every
: byte has an ascii value between 32 and 126. I thought that might give
: me some garbage characters but they'd all be safe for RSS.

: No. The feed is still dying. How do I find out what entity is killing
: it?

First I would feed it through an xml validator. It should tell you where
the xml goes wrong.

It it fails that you know what's wrong. If it passes - well worry about
that after the first test.
 
M

Malcolm Dew-Jones

Malcolm Dew-Jones ([email protected]) wrote:
: (e-mail address removed) wrote:

: : Can't get this RSS feed clean:

: : http://www.whatisliberalism.com/pdsFiles/page2533.xml


: : Why is it dying?

: : Some users write posts in Microsoft Word, then copy and paste their
: : post to the web browser and paste it in and hit submit and create a
: : weblog entry. This is what I just did myself.

: : I've written a PHP function that I thought would clean this feed, it
: : goes through the whole feed one byte at a time, and makes sure every
: : byte has an ascii value between 32 and 126. I thought that might give
: : me some garbage characters but they'd all be safe for RSS.

: : No. The feed is still dying. How do I find out what entity is killing
: : it?

: First I would feed it through an xml validator. It should tell you where
: the xml goes wrong.

: It it fails that you know what's wrong. If it passes - well worry about
: that after the first test.

In fact I realized I had a validator in "easy reach" so I used it on the
above url. I got

XML error: undefined entity, at line 22, column 23535

Using my handy dandy editor, I have cut and pasted some text from around
the offending section.

<description>I've ...

that our activities as feminists &acirc;'' including the
^^^^^^^
ERROR

... of new ideas.</description>


You can see which entity is causing a problem. It fails on the first
error, so there could be other errors after that.
 
L

lkrubner

First I would feed it through an xml validator. It should tell you where
the xml goes wrong.
It it fails that you know what's wrong. If it passes - well worry about
that after the first test.

That was a very good idea. I got a very large number of errors. You can
see them if you go here:

http://www.stg.brown.edu/service/xmlvalid/

and type in this address to the URI validation field:

http://www.whatisliberalism.com/pdsFiles/page2533.xml


I was left wondering what some of the errors meant. What is " error
(1103): end tag uses GI for an undeclared element: title " mean?

And what does " error (1012): reference to undeclared entity:
&acirc; " mean?

I'm confused by the last error. I don't know much about XML, but I
didn't think that an HTML entity reference was invalid in XML. Why
would it be? What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?
 
J

Johannes Koch

And what does " error (1012): reference to undeclared entity:
&acirc; " mean?

I'm confused by the last error. I don't know much about XML, but I
didn't think that an HTML entity reference was invalid in XML. Why
would it be?

Because nobody defined them for the XML-based language that you use.
What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?

Define them.
 
L

lkrubner

I don't know how to define entity references for XML, nor am I aware if
I'm allowed to add new definitions to RSS. XML is one of those things
I've been hoping to study for awhile but have not yet had the chance.

I'm wondering if there is a quick fix that will hold me till I have
time to look at the issue in depth. If I write a little PHP script to
strip out all HTML entity references, then the feed will work?
 
M

Malcolm Dew-Jones

(e-mail address removed) wrote:
: I don't know how to define entity references for XML, nor am I aware if
: I'm allowed to add new definitions to RSS. XML is one of those things
: I've been hoping to study for awhile but have not yet had the chance.

: I'm wondering if there is a quick fix that will hold me till I have
: time to look at the issue in depth. If I write a little PHP script to
: strip out all HTML entity references, then the feed will work?

The quick fix for unrecognized entities is to escape them, so

&circ; should be escaped to become
&amp;circ;

The escaped data "&amp;circ;" will be unescaped back to the original
"circ;" if an xml program extracts the data from the feed.

Whether the "&circ;" will _display_ correctly will depend on the program
that extracts and/or displays the data. I.e. if you use an xml program to
extract the description data into a file, and then use a browser to view
the file, then the browser will display the correct symbol. On the other
hand if the browser itself is reading the rss feed directly then it may or
may not display the desired symbol - it might display the word "&circ;"
instead.

As for the "GI" error, I am not familiar with that, and I'm sorry but I
haven't examined your file to figure it out.
 
P

Peter Flynn

That was a very good idea. I got a very large number of errors. You can
see them if you go here:

http://www.stg.brown.edu/service/xmlvalid/

and type in this address to the URI validation field:

http://www.whatisliberalism.com/pdsFiles/page2533.xml


I was left wondering what some of the errors meant. What is " error
(1103): end tag uses GI for an undeclared element: title " mean?

It means title was never declared in the DTD or Schema.
And what does " error (1012): reference to undeclared entity:
&acirc; " mean?

It means acirc was never declared in the DTD.
I'm confused by the last error. I don't know much about XML, but I
didn't think that an HTML entity reference was invalid in XML.

It is if you haven't declared it (with the exception of the five
which are assumed to pre-exist, but only when *not* using a DTD).
Why would it be?

Because that's what the rules say.
What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?

Convert them to actual characters (eg â for acirc) using the
declared character set of the document.

///Peter
 
J

Johannes Koch

I don't know how to define entity references for XML, nor am I aware if
I'm allowed to add new definitions to RSS. XML is one of those things
I've been hoping to study for awhile but have not yet had the chance.

I'm wondering if there is a quick fix that will hold me till I have
time to look at the issue in depth. If I write a little PHP script to
strip out all HTML entity references, then the feed will work?

If you can change the feed, you could define the entities in a document
type declaration:

<!DOCTYPE rss [
<!ENTITY acirc "â">
]>
<rss>
....
 
L

lkrubner

Peter said:
Convert them to actual characters (eg â for acirc) using the
declared character set of the document.

I see. So if I say that the character encoding for the feed is UTF-8, I
look up what the equivalent of acirc is for UTF-8. That sounds like the
right long-term goal for me to aim for. Should be simple enough to look
up all the entity references on w3c and translate them all to UTF-8,
yes?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top