XML entity parsing question

Discussion in 'XML' started by Tuomas Rannikko, May 30, 2006.

  1. Hello,

    I'm currently writing a XML processor for the fun of it. There is
    something I don't understand in the spec though. I'm obviously missing
    something important.

    The spec states that both Internal General and Character references are
    included when referenced in content. And "included" means:

    <quote>
    4.4.2 Included

    [Definition: An entity is included when its replacement text is
    retrieved and processed, in place of the reference itself, as though it
    were part of the document at the location the reference was recognized.]
    The replacement text MAY contain both character data and (except for
    parameter entities) markup, which MUST be recognized in the usual way.
    (The string "AT&amp;T;" expands to "AT&T;" and the remaining ampersand
    is not recognized as an entity-reference delimiter.) A character
    reference is included when the indicated character is processed in place
    of the reference itself.
    </quote>

    If I understand correctly the specification contradicts itself when it
    says the replacement text is processed in place of the reference itself
    and markup MUST be recognized. Shouldn't the "&T;" in "AT&T;" then be
    actually BE recognized? I understand that if it actually were recognized
    then the character '&' could not be expressed in XML (nor '<' for that
    matter). The question is then, when should the markup in the replacement
    text be recognized and when it shouldn't?

    Thank you in advance for your reply.

    - Tuomas
    Tuomas Rannikko, May 30, 2006
    #1
    1. Advertising

  2. Tuomas Rannikko wrote:
    >
    > Hello,
    >
    > I'm currently writing a XML processor for the fun of it. There is
    > something I don't understand in the spec though. I'm obviously missing
    > something important.
    >
    > The spec states that both Internal General and Character references are
    > included when referenced in content. And "included" means:
    >
    > <quote>
    > 4.4.2 Included
    >
    > [Definition: An entity is included when its replacement text is
    > retrieved and processed, in place of the reference itself, as though it
    > were part of the document at the location the reference was recognized.]
    > The replacement text MAY contain both character data and (except for
    > parameter entities) markup, which MUST be recognized in the usual way.
    > (The string "AT&amp;T;" expands to "AT&T;" and the remaining ampersand
    > is not recognized as an entity-reference delimiter.) A character
    > reference is included when the indicated character is processed in place
    > of the reference itself.
    > </quote>
    >
    > If I understand correctly the specification contradicts itself when it
    > says the replacement text is processed in place of the reference itself
    > and markup MUST be recognized. Shouldn't the "&T;" in "AT&T;" then be
    > actually BE recognized? I understand that if it actually were recognized
    > then the character '&' could not be expressed in XML (nor '<' for that
    > matter). The question is then, when should the markup in the replacement
    > text be recognized and when it shouldn't?
    >
    > Thank you in advance for your reply.
    >
    > - Tuomas


    hi,

    read more here :
    http://www.w3.org/TR/2004/REC-xml-20040204/#sec-predefined-ent
    --
    Cordialement,

    ///
    (. .)
    --------ooO--(_)--Ooo--------
    | Philippe Poulard |
    -----------------------------
    http://reflex.gforge.inria.fr/
    Have the RefleX !
    Philippe Poulard, May 30, 2006
    #2
    1. Advertising

  3. Philippe Poulard wrote:
    > Tuomas Rannikko wrote:
    >>
    >> Hello,
    >>
    >> I'm currently writing a XML processor for the fun of it. There is
    >> something I don't understand in the spec though. I'm obviously missing
    >> something important.
    >>
    >> The spec states that both Internal General and Character references
    >> are included when referenced in content. And "included" means:
    >>
    >> <quote>
    >> 4.4.2 Included
    >>
    >> [Definition: An entity is included when its replacement text is
    >> retrieved and processed, in place of the reference itself, as though
    >> it were part of the document at the location the reference was
    >> recognized.] The replacement text MAY contain both character data and
    >> (except for parameter entities) markup, which MUST be recognized in
    >> the usual way. (The string "AT&amp;T;" expands to "AT&T;" and the
    >> remaining ampersand is not recognized as an entity-reference
    >> delimiter.) A character reference is included when the indicated
    >> character is processed in place of the reference itself.
    >> </quote>
    >>
    >> If I understand correctly the specification contradicts itself when it
    >> says the replacement text is processed in place of the reference
    >> itself and markup MUST be recognized. Shouldn't the "&T;" in "AT&T;"
    >> then be actually BE recognized? I understand that if it actually were
    >> recognized then the character '&' could not be expressed in XML (nor
    >> '<' for that matter). The question is then, when should the markup in
    >> the replacement text be recognized and when it shouldn't?
    >>
    >> Thank you in advance for your reply.
    >>
    >> - Tuomas

    >
    > hi,
    >
    > read more here :
    > http://www.w3.org/TR/2004/REC-xml-20040204/#sec-predefined-ent


    Ah, yes.

    But I still think the spec contradicts itself, or is at least somewhat
    ambiguous on what the "Character" column means in the table in
    http://www.w3.org/TR/2004/REC-xml-20040204/#entproc

    I thought it meant character references:

    Here is the definition for character reference
    http://www.w3.org/TR/2004/REC-xml-20040204/#dt-charref
    which is of course a numeric character reference.

    And then, in the link you sent, it says character references are meant
    to be considered character data, rather than being included as I thought
    while looking at the table.

    Actually, what does the Character column mean in the table?


    - Tuomas
    Tuomas Rannikko, May 30, 2006
    #3
  4. Tuomas Rannikko wrote:
    >
    > Ah, yes.
    >
    > But I still think the spec contradicts itself,


    the parser works like this :

    "AT&amp;T;"
    &amp; is an entity : let's replace it
    "AT&#38;T;"
    the spec said that we must process the replacement text
    & is a character reference : let's replace it
    "AT&T;"
    the character has been replaced, but not yet processed
    "AT&T;"
    now, the character is said "included" : stop process it
    & doesn't stand for an entity reference

    --
    Cordialement,

    ///
    (. .)
    --------ooO--(_)--Ooo--------
    | Philippe Poulard |
    -----------------------------
    http://reflex.gforge.inria.fr/
    Have the RefleX !
    Philippe Poulard, May 30, 2006
    #4
  5. In article <>,
    Tuomas Rannikko <> wrote:

    >But I still think the spec contradicts itself, or is at least somewhat
    >ambiguous on what the "Character" column means in the table in
    >http://www.w3.org/TR/2004/REC-xml-20040204/#entproc
    >
    >I thought it meant character references:


    It does.

    >And then, in the link you sent, it says character references are meant
    >to be considered character data, rather than being included as I thought
    >while looking at the table.


    I think the definition of "Included" in 4.4.2 is unclear; it says

    A character reference is included when the indicated character is
    processed in place of the reference itself.

    and "processed" does not mean that it is reparsed as is the case when
    the replacement text of an entity is "processed". It's just, well,
    included. "Processed as character data" might be better I suppose.

    -- Richard
    Richard Tobin, May 30, 2006
    #5
  6. Philippe Poulard wrote:
    > Tuomas Rannikko wrote:
    >>
    >> Ah, yes.
    >>
    >> But I still think the spec contradicts itself,

    >
    > the parser works like this :
    >
    > "AT&amp;T;"
    > &amp; is an entity : let's replace it
    > "AT&#38;T;"
    > the spec said that we must process the replacement text
    > & is a character reference : let's replace it
    > "AT&T;"
    > the character has been replaced, but not yet processed
    > "AT&T;"
    > now, the character is said "included" : stop process it
    > & doesn't stand for an entity reference
    >


    Thanks for the answer, but this doesn't answer the question of what the
    Character column means in the table.

    I'm sorry for pushing on with this, but I can't get the meaning of the
    column...

    The spec says entities such as &amp; should be declared like this:

    <!ENTITY amp "&#38;">

    Once this declaration is read and the "&" is recognized and the
    replacement text of &amp; therefore becomes "&", not "&#38;"

    The process you put forward is then slightly simpler:

    "AT&amp;T;" in content --> "AT&T" --> "AT&T;"

    The problem is, however, determining when to stop re-parsing the data,
    and the same applies to the actual entity declaration; once "&#38;"
    is parsed to be "&" if the '&' is "included" (as I read from the
    table) then is is recognized as markup and "&" becomes '&', which is
    in turn recognized as markup...

    How I see it, character references are indeed supposed to be expanded
    and then considered character data, not markup. Then if character
    references are NOT to be "included", rather expanded and then "bypassed"
    why doesn't the spec say so?

    I quote the same bit of the spec again:

    <quote>
    4.4.2 Included

    [Definition: An entity is included when its replacement text is
    retrieved and processed, in place of the reference itself, as though it
    were part of the document at the location the reference was recognized.]
    The replacement text MAY contain both character data and (except for
    parameter entities) markup, which MUST be recognized in the usual way.
    (The string "AT&amp;T;" expands to "AT&T;" and the remaining ampersand
    is not recognized as an entity-reference delimiter.) A character
    reference is included when the indicated character is processed in place
    of the reference itself.
    </quote>

    If nothing else is wrong with the spec, then the word "processed" has
    multiple meanings within the same paragraph. The character references
    are not to be "processed" in the same way as entity references, because
    markup in the entity references' replacement text MUST be recognized and
    parsed, tags, references and all.

    "A character reference is included when the indicated character is
    processed in place of the reference itself"... Now if I process the
    indicated character, then in the case of "&", it "indicates" the
    character '&', which IS markup IF processed!?! The spec is in error when
    stating that the "character is processed in place of the reference
    itself." The character is expanded and then bypassed, not processed.

    It is obvious the "included" rule, or the "processed" part of the rule,
    does not apply to character references, otherwise escaping '&' and '<'
    characters would be impossible.

    The table still baffles me. The Character column either means something
    else than character references (which is unlikely), the spec is in plain
    error, or just too damn ambiguous for my "taste".

    - Tuomas
    Tuomas Rannikko, May 30, 2006
    #6
  7. Richard Tobin wrote:
    > I think the definition of "Included" in 4.4.2 is unclear; it says
    >
    > A character reference is included when the indicated character is
    > processed in place of the reference itself.
    >
    > and "processed" does not mean that it is reparsed as is the case when
    > the replacement text of an entity is "processed". It's just, well,
    > included. "Processed as character data" might be better I suppose.
    >


    I agree. I put it in eh, a few, more words in my reply to Philippe.
    Thanks for confirming I'm not missing the point. I started to get a bit
    worried about my logic there :)

    --

    - Tuomas
    Tuomas Rannikko, May 30, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Samuel van Laere

    Entity Name or Entity Number?

    Samuel van Laere, Feb 24, 2007, in forum: HTML
    Replies:
    4
    Views:
    1,624
    Jukka K. Korpela
    Feb 24, 2007
  2. markla
    Replies:
    1
    Views:
    540
    Steven Cheng
    Oct 6, 2008
  3. Norm
    Replies:
    3
    Views:
    2,706
  4. ThatsIT.net.au

    Entity, problem with entity key

    ThatsIT.net.au, Sep 6, 2009, in forum: ASP .Net
    Replies:
    1
    Views:
    1,193
    ThatsIT.net.au
    Sep 7, 2009
  5. Erik Wasser
    Replies:
    5
    Views:
    449
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page