an idiot question about a disallowed entity

Discussion in 'XML' started by lkrubner@geocities.com, Oct 12, 2005.

  1. Guest

    Can't get this RSS feed clean:

    http://www.whatisliberalism.com/pdsFiles/page2533.xml


    Why is it dying?

    Some users write posts in Microsoft Word, then copy and paste their
    post to the web browser and paste it in and hit submit and create a
    weblog entry. This is what I just did myself.

    I've written a PHP function that I thought would clean this feed, it
    goes through the whole feed one byte at a time, and makes sure every
    byte has an ascii value between 32 and 126. I thought that might give
    me some garbage characters but they'd all be safe for RSS.

    No. The feed is still dying. How do I find out what entity is killing
    it?
     
    , Oct 12, 2005
    #1
    1. Advertising

  2. wrote:

    : Can't get this RSS feed clean:

    : http://www.whatisliberalism.com/pdsFiles/page2533.xml


    : Why is it dying?

    : Some users write posts in Microsoft Word, then copy and paste their
    : post to the web browser and paste it in and hit submit and create a
    : weblog entry. This is what I just did myself.

    : I've written a PHP function that I thought would clean this feed, it
    : goes through the whole feed one byte at a time, and makes sure every
    : byte has an ascii value between 32 and 126. I thought that might give
    : me some garbage characters but they'd all be safe for RSS.

    : No. The feed is still dying. How do I find out what entity is killing
    : it?

    First I would feed it through an xml validator. It should tell you where
    the xml goes wrong.

    It it fails that you know what's wrong. If it passes - well worry about
    that after the first test.



    --

    This programmer available for rent.
     
    Malcolm Dew-Jones, Oct 12, 2005
    #2
    1. Advertising

  3. Malcolm Dew-Jones () wrote:
    : wrote:

    : : Can't get this RSS feed clean:

    : : http://www.whatisliberalism.com/pdsFiles/page2533.xml


    : : Why is it dying?

    : : Some users write posts in Microsoft Word, then copy and paste their
    : : post to the web browser and paste it in and hit submit and create a
    : : weblog entry. This is what I just did myself.

    : : I've written a PHP function that I thought would clean this feed, it
    : : goes through the whole feed one byte at a time, and makes sure every
    : : byte has an ascii value between 32 and 126. I thought that might give
    : : me some garbage characters but they'd all be safe for RSS.

    : : No. The feed is still dying. How do I find out what entity is killing
    : : it?

    : First I would feed it through an xml validator. It should tell you where
    : the xml goes wrong.

    : It it fails that you know what's wrong. If it passes - well worry about
    : that after the first test.

    In fact I realized I had a validator in "easy reach" so I used it on the
    above url. I got

    XML error: undefined entity, at line 22, column 23535

    Using my handy dandy editor, I have cut and pasted some text from around
    the offending section.

    <description>I've ...

    that our activities as feminists &acirc;'' including the
    ^^^^^^^
    ERROR

    ... of new ideas.</description>


    You can see which entity is causing a problem. It fails on the first
    error, so there could be other errors after that.


    --

    This programmer available for rent.
     
    Malcolm Dew-Jones, Oct 12, 2005
    #3
  4. Guest

    >First I would feed it through an xml validator. It should tell you where
    >the xml goes wrong.
    >It it fails that you know what's wrong. If it passes - well worry about
    >that after the first test.


    That was a very good idea. I got a very large number of errors. You can
    see them if you go here:

    http://www.stg.brown.edu/service/xmlvalid/

    and type in this address to the URI validation field:

    http://www.whatisliberalism.com/pdsFiles/page2533.xml


    I was left wondering what some of the errors meant. What is " error
    (1103): end tag uses GI for an undeclared element: title " mean?

    And what does " error (1012): reference to undeclared entity:
    &acirc; " mean?

    I'm confused by the last error. I don't know much about XML, but I
    didn't think that an HTML entity reference was invalid in XML. Why
    would it be? What's the easiest way to sanitize HTML entity references
    so that XML won't choke on them?
     
    , Oct 12, 2005
    #4
  5. wrote:
    > And what does " error (1012): reference to undeclared entity:
    > &acirc; " mean?
    >
    > I'm confused by the last error. I don't know much about XML, but I
    > didn't think that an HTML entity reference was invalid in XML. Why
    > would it be?


    Because nobody defined them for the XML-based language that you use.

    > What's the easiest way to sanitize HTML entity references
    > so that XML won't choke on them?


    Define them.
    --
    Johannes Koch
    In te domine speravi; non confundar in aeternum.
    (Te Deum, 4th cent.)
     
    Johannes Koch, Oct 12, 2005
    #5
  6. Guest

    I don't know how to define entity references for XML, nor am I aware if
    I'm allowed to add new definitions to RSS. XML is one of those things
    I've been hoping to study for awhile but have not yet had the chance.

    I'm wondering if there is a quick fix that will hold me till I have
    time to look at the issue in depth. If I write a little PHP script to
    strip out all HTML entity references, then the feed will work?
     
    , Oct 12, 2005
    #6
  7. wrote:
    : I don't know how to define entity references for XML, nor am I aware if
    : I'm allowed to add new definitions to RSS. XML is one of those things
    : I've been hoping to study for awhile but have not yet had the chance.

    : I'm wondering if there is a quick fix that will hold me till I have
    : time to look at the issue in depth. If I write a little PHP script to
    : strip out all HTML entity references, then the feed will work?

    The quick fix for unrecognized entities is to escape them, so

    &circ; should be escaped to become
    &amp;circ;

    The escaped data "&amp;circ;" will be unescaped back to the original
    "circ;" if an xml program extracts the data from the feed.

    Whether the "&circ;" will _display_ correctly will depend on the program
    that extracts and/or displays the data. I.e. if you use an xml program to
    extract the description data into a file, and then use a browser to view
    the file, then the browser will display the correct symbol. On the other
    hand if the browser itself is reading the rss feed directly then it may or
    may not display the desired symbol - it might display the word "&circ;"
    instead.

    As for the "GI" error, I am not familiar with that, and I'm sorry but I
    haven't examined your file to figure it out.

    --

    This programmer available for rent.
     
    Malcolm Dew-Jones, Oct 13, 2005
    #7
  8. Peter Flynn Guest

    wrote:

    >>First I would feed it through an xml validator. It should tell you where
    >>the xml goes wrong.
    >>It it fails that you know what's wrong. If it passes - well worry about
    >>that after the first test.

    >
    > That was a very good idea. I got a very large number of errors. You can
    > see them if you go here:
    >
    > http://www.stg.brown.edu/service/xmlvalid/
    >
    > and type in this address to the URI validation field:
    >
    > http://www.whatisliberalism.com/pdsFiles/page2533.xml
    >
    >
    > I was left wondering what some of the errors meant. What is " error
    > (1103): end tag uses GI for an undeclared element: title " mean?


    It means title was never declared in the DTD or Schema.

    > And what does " error (1012): reference to undeclared entity:
    > &acirc; " mean?


    It means acirc was never declared in the DTD.

    > I'm confused by the last error. I don't know much about XML, but I
    > didn't think that an HTML entity reference was invalid in XML.


    It is if you haven't declared it (with the exception of the five
    which are assumed to pre-exist, but only when *not* using a DTD).

    > Why would it be?


    Because that's what the rules say.

    > What's the easiest way to sanitize HTML entity references
    > so that XML won't choke on them?


    Convert them to actual characters (eg â for acirc) using the
    declared character set of the document.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
     
    Peter Flynn, Oct 13, 2005
    #8
  9. wrote:

    > I don't know how to define entity references for XML, nor am I aware if
    > I'm allowed to add new definitions to RSS. XML is one of those things
    > I've been hoping to study for awhile but have not yet had the chance.
    >
    > I'm wondering if there is a quick fix that will hold me till I have
    > time to look at the issue in depth. If I write a little PHP script to
    > strip out all HTML entity references, then the feed will work?


    If you can change the feed, you could define the entities in a document
    type declaration:

    <!DOCTYPE rss [
    <!ENTITY acirc "â">
    ]>
    <rss>
    ....
    --
    Johannes Koch
    In te domine speravi; non confundar in aeternum.
    (Te Deum, 4th cent.)
     
    Johannes Koch, Oct 13, 2005
    #9
  10. Guest

    Peter Flynn wrote:
    > wrote:
    > > What's the easiest way to sanitize HTML entity references
    > > so that XML won't choke on them?

    >
    > Convert them to actual characters (eg â for acirc) using the
    > declared character set of the document.


    I see. So if I say that the character encoding for the feed is UTF-8, I
    look up what the equivalent of acirc is for UTF-8. That sounds like the
    right long-term goal for me to aim for. Should be simple enough to look
    up all the entity references on w3c and translate them all to UTF-8,
    yes?
     
    , Oct 31, 2005
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hunsal

    Disallowed Parent Path error

    Hunsal, May 20, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    605
  2. Samuel van Laere

    Entity Name or Entity Number?

    Samuel van Laere, Feb 24, 2007, in forum: HTML
    Replies:
    4
    Views:
    1,751
    Jukka K. Korpela
    Feb 24, 2007
  3. novice
    Replies:
    22
    Views:
    695
    Ravi Nakidi
    Mar 17, 2006
  4. Replies:
    4
    Views:
    408
  5. markla
    Replies:
    1
    Views:
    584
    Steven Cheng
    Oct 6, 2008
Loading...

Share This Page