XML entity parsing question

T

Tuomas Rannikko

Hello,

I'm currently writing a XML processor for the fun of it. There is
something I don't understand in the spec though. I'm obviously missing
something important.

The spec states that both Internal General and Character references are
included when referenced in content. And "included" means:

<quote>
4.4.2 Included

[Definition: An entity is included when its replacement text is
retrieved and processed, in place of the reference itself, as though it
were part of the document at the location the reference was recognized.]
The replacement text MAY contain both character data and (except for
parameter entities) markup, which MUST be recognized in the usual way.
(The string "AT&amp;T;" expands to "AT&T;" and the remaining ampersand
is not recognized as an entity-reference delimiter.) A character
reference is included when the indicated character is processed in place
of the reference itself.
</quote>

If I understand correctly the specification contradicts itself when it
says the replacement text is processed in place of the reference itself
and markup MUST be recognized. Shouldn't the "&T;" in "AT&T;" then be
actually BE recognized? I understand that if it actually were recognized
then the character '&' could not be expressed in XML (nor '<' for that
matter). The question is then, when should the markup in the replacement
text be recognized and when it shouldn't?

Thank you in advance for your reply.

- Tuomas
 
P

Philippe Poulard

Tuomas said:
Hello,

I'm currently writing a XML processor for the fun of it. There is
something I don't understand in the spec though. I'm obviously missing
something important.

The spec states that both Internal General and Character references are
included when referenced in content. And "included" means:

<quote>
4.4.2 Included

[Definition: An entity is included when its replacement text is
retrieved and processed, in place of the reference itself, as though it
were part of the document at the location the reference was recognized.]
The replacement text MAY contain both character data and (except for
parameter entities) markup, which MUST be recognized in the usual way.
(The string "AT&amp;T;" expands to "AT&T;" and the remaining ampersand
is not recognized as an entity-reference delimiter.) A character
reference is included when the indicated character is processed in place
of the reference itself.
</quote>

If I understand correctly the specification contradicts itself when it
says the replacement text is processed in place of the reference itself
and markup MUST be recognized. Shouldn't the "&T;" in "AT&T;" then be
actually BE recognized? I understand that if it actually were recognized
then the character '&' could not be expressed in XML (nor '<' for that
matter). The question is then, when should the markup in the replacement
text be recognized and when it shouldn't?

Thank you in advance for your reply.

- Tuomas

hi,

read more here :
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-predefined-ent
--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
 
T

Tuomas Rannikko

Philippe said:
Tuomas said:
Hello,

I'm currently writing a XML processor for the fun of it. There is
something I don't understand in the spec though. I'm obviously missing
something important.

The spec states that both Internal General and Character references
are included when referenced in content. And "included" means:

<quote>
4.4.2 Included

[Definition: An entity is included when its replacement text is
retrieved and processed, in place of the reference itself, as though
it were part of the document at the location the reference was
recognized.] The replacement text MAY contain both character data and
(except for parameter entities) markup, which MUST be recognized in
the usual way. (The string "AT&amp;T;" expands to "AT&T;" and the
remaining ampersand is not recognized as an entity-reference
delimiter.) A character reference is included when the indicated
character is processed in place of the reference itself.
</quote>

If I understand correctly the specification contradicts itself when it
says the replacement text is processed in place of the reference
itself and markup MUST be recognized. Shouldn't the "&T;" in "AT&T;"
then be actually BE recognized? I understand that if it actually were
recognized then the character '&' could not be expressed in XML (nor
'<' for that matter). The question is then, when should the markup in
the replacement text be recognized and when it shouldn't?

Thank you in advance for your reply.

- Tuomas

hi,

read more here :
http://www.w3.org/TR/2004/REC-xml-20040204/#sec-predefined-ent

Ah, yes.

But I still think the spec contradicts itself, or is at least somewhat
ambiguous on what the "Character" column means in the table in
http://www.w3.org/TR/2004/REC-xml-20040204/#entproc

I thought it meant character references:

Here is the definition for character reference
http://www.w3.org/TR/2004/REC-xml-20040204/#dt-charref
which is of course a numeric character reference.

And then, in the link you sent, it says character references are meant
to be considered character data, rather than being included as I thought
while looking at the table.

Actually, what does the Character column mean in the table?


- Tuomas
 
P

Philippe Poulard

Tuomas said:
Ah, yes.

But I still think the spec contradicts itself,

the parser works like this :

"AT&amp;T;"
&amp; is an entity : let's replace it
"AT&#38;T;"
the spec said that we must process the replacement text
& is a character reference : let's replace it
"AT&T;"
the character has been replaced, but not yet processed
"AT&T;"
now, the character is said "included" : stop process it
& doesn't stand for an entity reference

--
Cordialement,

///
(. .)
--------ooO--(_)--Ooo--------
| Philippe Poulard |
 
R

Richard Tobin

Tuomas Rannikko said:
But I still think the spec contradicts itself, or is at least somewhat
ambiguous on what the "Character" column means in the table in
http://www.w3.org/TR/2004/REC-xml-20040204/#entproc

I thought it meant character references:

It does.
And then, in the link you sent, it says character references are meant
to be considered character data, rather than being included as I thought
while looking at the table.

I think the definition of "Included" in 4.4.2 is unclear; it says

A character reference is included when the indicated character is
processed in place of the reference itself.

and "processed" does not mean that it is reparsed as is the case when
the replacement text of an entity is "processed". It's just, well,
included. "Processed as character data" might be better I suppose.

-- Richard
 
T

Tuomas Rannikko

Philippe said:
the parser works like this :

"AT&amp;T;"
&amp; is an entity : let's replace it
"AT&#38;T;"
the spec said that we must process the replacement text
& is a character reference : let's replace it
"AT&T;"
the character has been replaced, but not yet processed
"AT&T;"
now, the character is said "included" : stop process it
& doesn't stand for an entity reference

Thanks for the answer, but this doesn't answer the question of what the
Character column means in the table.

I'm sorry for pushing on with this, but I can't get the meaning of the
column...

The spec says entities such as &amp; should be declared like this:

<!ENTITY amp "&#38;">

Once this declaration is read and the "&" is recognized and the
replacement text of &amp; therefore becomes "&", not "&#38;"

The process you put forward is then slightly simpler:

"AT&amp;T;" in content --> "AT&T" --> "AT&T;"

The problem is, however, determining when to stop re-parsing the data,
and the same applies to the actual entity declaration; once "&#38;"
is parsed to be "&" if the '&' is "included" (as I read from the
table) then is is recognized as markup and "&" becomes '&', which is
in turn recognized as markup...

How I see it, character references are indeed supposed to be expanded
and then considered character data, not markup. Then if character
references are NOT to be "included", rather expanded and then "bypassed"
why doesn't the spec say so?

I quote the same bit of the spec again:

<quote>
4.4.2 Included

[Definition: An entity is included when its replacement text is
retrieved and processed, in place of the reference itself, as though it
were part of the document at the location the reference was recognized.]
The replacement text MAY contain both character data and (except for
parameter entities) markup, which MUST be recognized in the usual way.
(The string "AT&amp;T;" expands to "AT&T;" and the remaining ampersand
is not recognized as an entity-reference delimiter.) A character
reference is included when the indicated character is processed in place
of the reference itself.
</quote>

If nothing else is wrong with the spec, then the word "processed" has
multiple meanings within the same paragraph. The character references
are not to be "processed" in the same way as entity references, because
markup in the entity references' replacement text MUST be recognized and
parsed, tags, references and all.

"A character reference is included when the indicated character is
processed in place of the reference itself"... Now if I process the
indicated character, then in the case of "&", it "indicates" the
character '&', which IS markup IF processed!?! The spec is in error when
stating that the "character is processed in place of the reference
itself." The character is expanded and then bypassed, not processed.

It is obvious the "included" rule, or the "processed" part of the rule,
does not apply to character references, otherwise escaping '&' and '<'
characters would be impossible.

The table still baffles me. The Character column either means something
else than character references (which is unlikely), the spec is in plain
error, or just too damn ambiguous for my "taste".

- Tuomas
 
T

Tuomas Rannikko

Richard said:
I think the definition of "Included" in 4.4.2 is unclear; it says

A character reference is included when the indicated character is
processed in place of the reference itself.

and "processed" does not mean that it is reparsed as is the case when
the replacement text of an entity is "processed". It's just, well,
included. "Processed as character data" might be better I suppose.

I agree. I put it in eh, a few, more words in my reply to Philippe.
Thanks for confirming I'm not missing the point. I started to get a bit
worried about my logic there :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top