Re: Putting a "<" in an attribute value (was about CDATA sections)

Discussion in 'XML' started by Jon Noring, Nov 15, 2005.

  1. Jon Noring

    Jon Noring Guest

    As an addendum to my prior message where I asked if there is an
    absolute ban on using the "<" character in attribute values (for
    well-formed XML documents) no matter how the "<" is represented.

    Googling around at various "authorities" on this topic I get different
    answers. I suppose this is to be expected. <laugh/>

    To summarize, there are four mechanisms by which the "<" character may
    be included in an attribute value, some or all of which are illegal
    per XML well-formedness rules:

    1) <foo bar="is x < y ?">

    2) <foo bar="is x &lt; y ?">

    3) <foo bar="is x < y ?">

    4) <foo bar="is x &lessthan; y ?"

    a) where in the DTD we have <!ENTITY lessthan "<">

    b) where in the DTD we have <!ENTITY lessthan "&lt;">

    c) where in the DTD we have <!ENTITY lessthan "<">


    From the latest XML spec (section 3.1, rule 41 and associated WFC),
    see http://www.w3.org/TR/REC-xml/#NT-AttValue , it says

    "No < in Attribute Values.
    "The replacement text of any entity referred to directly or
    indirectly in an attribute value MUST NOT contain a <."

    So it is clear from this that #1 and #4a are illegal. But the others
    are ambiguous (section 2.4 essentially says numeric character
    references are equivalent to the escape strings.) It partly seems to
    hinge around the definition of an "entity".

    The plot thickens when looking at the 1998 first edition of the XML
    spec, http://www.w3.org/TR/1998/REC-xml-19980210.html#sec-starttags .
    It says:

    "No < in Attribute Values.
    "The replacement text of any entity referred to directly or
    indirectly in an attribute value (other than "&lt;") must not
    contain a <."


    The difference between the current XML spec and the first 1998 spec
    is that in the 1998 spec it clearly says "&lt;" may be used to
    represent the literal "<" character in an attribute value (and I
    would assume, by extension in section 2.4, so would be &#x003C or
    <). So in the 1998 spec, #2 and #4b appear legal, and likely #3
    and #4c.

    So what does the removal of the phrase '(other than "&lt;")' mean
    in the current XML spec edition? Was it removed because it is
    superfluous (that is, &lt;, and < are not considered "any
    entity" -- this is supported in that in section 2.4 XML calls &lt; a
    "string", not an "entity".) Or was it a change to have a total,
    absolute ban on using that character no matter how it is represented?

    An inquiring mind wants to know.

    Jon
     
    Jon Noring, Nov 15, 2005
    #1
    1. Advertising

  2. Jon Noring

    Peter Flynn Guest

    Jon Noring wrote:

    > As an addendum to my prior message where I asked if there is an
    > absolute ban on using the "<" character in attribute values (for
    > well-formed XML documents) no matter how the "<" is represented.
    >
    > Googling around at various "authorities" on this topic I get different
    > answers. I suppose this is to be expected. <laugh/>


    Yes. Google is a fine thing, but the pages it indexes are not subjected
    to any form of authority.

    > To summarize, there are four mechanisms by which the "<" character may
    > be included in an attribute value, some or all of which are illegal
    > per XML well-formedness rules:
    >
    > 1) <foo bar="is x < y ?">


    No.

    > 2) <foo bar="is x &lt; y ?">


    Yes.

    > 3) <foo bar="is x < y ?">


    Yes.

    > 4) <foo bar="is x &lessthan; y ?"


    That is well-formed.

    > a) where in the DTD we have <!ENTITY lessthan "<">


    No, that's an invalid declaration.

    > b) where in the DTD we have <!ENTITY lessthan "&lt;">


    That's OK.

    > c) where in the DTD we have <!ENTITY lessthan "<">


    So is that.

    > From the latest XML spec (section 3.1, rule 41 and associated WFC),
    > see http://www.w3.org/TR/REC-xml/#NT-AttValue , it says
    >
    > "No < in Attribute Values.
    > "The replacement text of any entity referred to directly or
    > indirectly in an attribute value MUST NOT contain a <."
    >
    > So it is clear from this that #1 and #4a are illegal. But the others
    > are ambiguous (section 2.4 essentially says numeric character
    > references are equivalent to the escape strings.) It partly seems to
    > hinge around the definition of an "entity".


    All these terms have their formal definition in SGML (ISO 8879:1986).
    You may want to borrow a copy of Goldfarb, C, "The SGML Handbook" (OUP)
    to check them out, but beware the formal standards-ese language (Charles
    is a lawyer :) XML has inherited these definitions with very few
    changes.

    To understand what happens may help: validity attaches to the state of
    the characters making up the file at the time of parsing, without any
    form of interpretation (ie no substitution of entity values for entity
    references...yet). So a < in a CDATA attribute value is invalid, but
    a &lt; or < is valid because neither of them contains a literal <
    character. Once validity is established, an application will receive
    a data representation of the document from the parser, which includes
    both the structural information (where the markup nodes were) and the
    character data content information (where the document text is). This
    is variously known in assorted circles as "the grove", "the
    post-schema-validation infoset" and other terms. How it is presented
    to the application varies, but at this stage all physical markup has
    disappeared (or rather, been turned into pointers of some kind) and
    all entity references and character references have been resolved.

    One way to get a handle on this (and to solve any other questions of
    validity or invalidity) is to install a validating parser like onsgmls
    or rxp which runs from the command-line. onsgmls in particular is
    useful, despite its now having some small areas of non-conformance)
    in that it can output a format called ESIS, which is a line-by-line
    echo of the markup interpretation. As an example, here is your XML
    file:

    <?xml version="1.0" encoding="ISO-8859-1"?>
    <!DOCTYPE header [
    <!ELEMENT header (#PCDATA)>
    <!ATTLIST header title CDATA #REQUIRED>
    <!ENTITY lessthan "&lt;">
    ]>
    <header title="Is A &lessthan; B?"> ... </header>

    and here is onsgmls's unsuppressed output (there's a -s option to turn
    this off and simply report validity or not):

    $ onsgmls -wxml /usr/share/sgml/xml.dcl test.xml
    onsgmls:/usr/share/sgml/xml.dcl:1:W: SGML declaration was not implied
    ?xml version="1.0" encoding="ISO-8859-1"
    Atitle CDATA Is A < B?
    (header
    - ...
    )header
    C
    $

    Ignore the warning about the SGML declaration for the moment. The ESIS
    output clearly shows the data and markup being dissected and exposed
    for processing. Lines beginning with A are attribute values, ( is the
    start-tag of an element type, ) is the end-tag, - is character data,
    and C is the end.

    > The plot thickens when looking at the 1998 first edition of the XML
    > spec, http://www.w3.org/TR/1998/REC-xml-19980210.html#sec-starttags .
    > It says:
    >
    > "No < in Attribute Values.
    > "The replacement text of any entity referred to directly or
    > indirectly in an attribute value (other than "&lt;") must not
    > contain a <."


    Right. That means it mustn't contain a literal < sign like 4(a)
    above. It may well resolve to a < sign at the end of the day, but
    for the purposes of document validity we're only concerned with
    the actual characters in the file, not what they represent.

    > The difference between the current XML spec and the first 1998 spec
    > is that in the 1998 spec it clearly says "&lt;" may be used to
    > represent the literal "<" character in an attribute value (and I
    > would assume, by extension in section 2.4, so would be &#x003C or
    > <). So in the 1998 spec, #2 and #4b appear legal, and likely #3
    > and #4c.


    Yes, exactly correct.

    > So what does the removal of the phrase '(other than "&lt;")' mean
    > in the current XML spec edition? Was it removed because it is
    > superfluous


    Yes.

    > (that is, &lt;, and < are not considered "any
    > entity" -- this is supported in that in section 2.4 XML calls &lt; a
    > "string", not an "entity".) Or was it a change to have a total,
    > absolute ban on using that character no matter how it is represented?


    It was just to avoid clouding the issue, so far as I know.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
     
    Peter Flynn, Nov 15, 2005
    #2
    1. Advertising

  3. Peter Flynn <> wrote:

    >> a) where in the DTD we have <!ENTITY lessthan "<">

    >
    > No, that's an invalid declaration.


    It's not. The specification says this though:

    Although the EntityValue production allows the definition of a
    general entity consisting of a single explicit < in the literal
    (e.g., <!ENTITY mylt "<">), it is strongly advised to avoid this
    practice since any reference to that entity will cause a well-
    formedness error.

    -- http://www.w3.org/TR/REC-xml/#IDA2S1S

    --
    David Håsäther
     
    David Håsäther, Nov 15, 2005
    #3
  4. In article <>,
    Peter Flynn <> wrote:

    >> c) where in the DTD we have <!ENTITY lessthan "<">

    >
    >So is that.


    No, see my other message, and also the comment in the spec about the
    definition of the built-in entities. Both amp and lt need double
    escaping in their definitions.

    -- Richard
     
    Richard Tobin, Nov 16, 2005
    #4
  5. In article <>,
    Jon Noring <> wrote:

    >So what does the removal of the phrase '(other than "&lt;")' mean
    >in the current XML spec edition?


    In early drafts of the XML spec, the built-in entities - in particular
    amp and lt - were "magic". It was pointed out that they could be
    defined non-magically by use of double escaping, and the final spec
    used this. The phrase you quote was probably a hangover from the
    earlier version that was removed in an erratum when it was noticed
    that it wasn't needed an more.

    -- Richard
     
    Richard Tobin, Nov 16, 2005
    #5
  6. Jon Noring

    Peter Flynn Guest

    David Håsäther wrote:

    > Peter Flynn <> wrote:
    >
    >>> a) where in the DTD we have <!ENTITY lessthan "<">

    >>
    >> No, that's an invalid declaration.

    >
    > It's not.


    I'm sorry, you're quite right. I'm not sure what my brain was doing when
    I wrote that. Making a reference to &lessthan; certainly would cause an
    error, as you point out.

    ///Peter
     
    Peter Flynn, Nov 16, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Janne
    Replies:
    0
    Views:
    1,798
    Janne
    Nov 11, 2003
  2. Arvin Portlock
    Replies:
    0
    Views:
    404
    Arvin Portlock
    Feb 26, 2004
  3. Chris Waddingham

    expat whitespace in CDATA sections

    Chris Waddingham, Mar 3, 2004, in forum: XML
    Replies:
    0
    Views:
    505
    Chris Waddingham
    Mar 3, 2004
  4. Dave Matthews

    Detecting CDATA sections with XSLT

    Dave Matthews, Jun 18, 2004, in forum: XML
    Replies:
    2
    Views:
    710
    Dave Matthews
    Jun 19, 2004
  5. Raman
    Replies:
    6
    Views:
    4,776
    santosh
    Aug 3, 2007
Loading...

Share This Page