Re: XML declaration in DTD System ID? ...

Discussion in 'XML' started by Martin Honnen, Jul 7, 2009.

  1. Russell Potter wrote:
    > I'm a bit confused on this point after reading
    > the spec:
    >
    > The spec. says that the entity resolved by the
    > URI given in the ystem identifer of the DOCTYPE
    > declaration should be aan external parsed
    > () enttity ehich means, to my mind, that it
    > should start with an XML declaration.


    http://www.w3.org/TR/xml/#sec-TextDecl says
    "External parsed entities SHOULD each begin with a text declaration."
    so it says "should" and it says "text declaration", not "XML
    declaration". That way it is possible to not have a text declaration as
    long as the encoding is UTF-8 or UTF-16 and the version is 1.0 I think.




    --

    Martin Honnen
    http://msmvps.com/blogs/martin_honnen/
    Martin Honnen, Jul 7, 2009
    #1
    1. Advertising

  2. (In case Martin's point wasn't clear: SHOULD in W3C specs always means
    "not required but highly recommended.")
    Joe Kesselman, Jul 7, 2009
    #2
    1. Advertising

  3. > probably need to use csome knd if heuristic algorithm
    > to figure out its encoding ...


    Exactly the same situation as if there was not encoding declaration on
    the main document.

    http://www.w3.org/TR/REC-xml/#sec-guessing
    Joe Kesselman, Jul 8, 2009
    #3
  4. > probably need to use csome knd if heuristic algorithm
    > to figure out its encoding ...


    Exactly the same situation as if there was no encoding declaration on
    the main document.

    http://www.w3.org/TR/REC-xml/#sec-guessing
    Joe Kesselman, Jul 8, 2009
    #4
  5. We heard you the first time <grin/><sigh/>
    Joe Kesselman, Jul 8, 2009
    #5
  6. Note that if you use any of the off-the-shelf XML parsers they should
    handle this level of detail for you.
    Joe Kesselman, Jul 8, 2009
    #6
  7. Been there, done that... <smile/>
    Joe Kesselman, Jul 8, 2009
    #7
  8. Russell Potter wrote:
    > ... But I also need to inderstand an XML proc-
    > esssor's behaviour in snough detail to be able
    > to implement one when I get to that stage.


    If you need this level of understanding, I highly recommend the
    "Annotated XML Recommendation" -- a version of the XML 1.0 spec which
    Tim Bray reworked with hypertext links to explanations of what some of
    the obscure wording is intended to mean, why those decisions were made,
    and so on.

    Note that to be useful for modern applications, an XML processor must
    _at_least_ implement not only XML, but XML Namespaces, and depending on
    what you're doing you may need other specs as well (XML Schema, for
    example). And there are issues like encodings (as you pointed out),
    serializers (going from in-memory representations back to XML syntax),
    etc. Plus subtleties in implementing the common XML APIs, if you aren't
    reinventing those from scratch.

    If you're working in an environment for which a suitable processor isn't
    already available, you may not have any choice about implementing your
    own. Or you may have special needs -- sometimes applications of wheels
    are different enough to require reinventing; gears are different from
    tires or circular saw blades.

    But in most cases it really is a lot easier to take advantage of the
    efforts the community has already put into making all of this work and
    tuning it for performance than to code it all de novo. Not least because
    that leaves you free to focus on implementing your actual application.
    Joe Kesselman, Jul 8, 2009
    #8
  9. Russell Potter wrote:
    > I can
    > get a lot more out of writing my own processor, the IP
    > of which I own


    Perhaps. You'd be competing against things like the Apache Xerces
    parser, which has a pretty darned good open-source license (basically,
    just give them credit if you use some or all of their code). That's
    already a "pre-written building block", with zero effort up front as
    well as zero effort to reuse. Of course it's only available in Java and
    C++ (and C?), but there are freely-remixable parsers available in some
    other languages.

    If there's something you need that these don't give you and can't be
    adapted to give you, by all means make the investment. I just wanted to
    make sure you'd checked your options before doing so.

    (The project I'm currently working on -- the Websphere XML Feature Pack
    -- also involves a certain amount of reinvention. But we've got specific
    goals and justifications for the new code.)
    Joe Kesselman, Jul 8, 2009
    #9
  10. Martin Honnen

    Peter Flynn Guest

    Russell Potter wrote:
    > Martin,
    >
    >> http://www.w3.org/TR/xml/#sec-TextDecl says
    >> "External parsed entities SHOULD each begin with a text declaration."
    >> so it says "should" and it says "text declaration", not "XML
    >> declaration". That way it is possible to not have a text declaration
    >> as long as the encoding is UTF-8 or UTF-16 and the version is 1.0 I
    >> think.

    >
    > Thanks for that very helpful info ... :)
    >
    > If, however, the entity *didn't* begin with an XML declaration,
    > how would a processor figure out whether the encoding was UTF-8
    > or UTF-16? Some type of heuristic algorithm?


    I'd sniff the first few bytes. If we're talking about a DTD here, then
    pretty much the two non-white-space characters are going to have to be
    an MDO (<!) which means 0x3C 0x21 or 0x00 0x3C 0x00 0x21 (or the other
    way round depending on the BOM).

    As a general comment, a typical free-standing DTD file should IMHE *not*
    start with an XML Declaration, despite what the Spec says, as this seems
    to make some parsers gag, although it's been so long since I saw a DTD
    file start with an XML Declaration that perhaps the parser-writers have
    fixed this.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
    Peter Flynn, Jul 8, 2009
    #10
  11. Russell Potter wrote:
    > Perhaps, then, I should have a look at that code (assuming,
    > that is, that this code, like all open-source code, is avai-
    > lable online) to see how it deals with the qurestions I'm
    > asking ...


    The source is indeed available.
    Joe Kesselman, Jul 9, 2009
    #11
  12. Not necessarily true. Element names, for example, can contain any
    unicode name character, and some of those may be represented differently
    in different encodings. The seven-bit ASCII range is pretty stable (at
    least, unless you think someone might throw an EBCDIC file at you), but
    if you want to do a full implementation you can't completely ignore
    encodings.

    However, as the spec points out: If not otherwise specified, XML is
    normally assumed to default to UTF-8 or UTF-16, and distinguishing those
    is straightforward. If someone uses another encoding and doesn't tell
    you that they're doing so by providing the declaration, any breakage is
    their fault.
    Joe Kesselman, Jul 9, 2009
    #12
  13. Russell Potter schrieb:
    > Joe Kesselman schrieb:
    >> Not necessarily true. Element names, for example, can contain any
    >> unicode name character ...

    >
    > In that posting I was restricting my comments to
    > external DTDs which (to my knowledge) only contain
    > element, attribute list and parameter sntity decl-
    > arations and commments, which all lie within the
    > ASCII range, but no actual mark-up such as element
    > names, which, as you say, could contain any Unicode
    > character.


    How do you write an element declaration for an element type with a name
    containing non-ASCII characters?

    --
    Johannes Koch
    In te domine speravi; non confundar in aeternum.
    (Te Deum, 4th cent.)
    Johannes Koch, Jul 10, 2009
    #13
  14. > the Unicode characters to which Joe was referring to are
    > those appearing in element tag names, rather than DTD
    > elemment declarations.


    No, I was referring to both.

    > Wouldn't you just use a character entity?


    You could. But you don't need to, and generally you wouldn't if you were
    working in an encoding which supported that character. That's the whole
    point of being able to specify the encoding, after all -- to be able to
    enter characters directly rather than via awkward workarounds.

    Of course Unicode covers *all* those characters (well, all those in any
    natural language plus some symbols plus -- unofficially -- some
    artificial languages like Klingon) and can express them directly. Which
    is why the default encoding, if you don't specify otherwise, is assumed
    to be one of the Unicode encodings (UTF-8 or UTF-16).

    Your processor, if you're still determined to write one, doesn't have to
    support all possible encodings -- but it really should support UTF-8 and
    UTF-16 if you want to claim you correctly implement the XML
    Recommendation. If you're working in Java, that's relatively easy; Java
    defaults to UTF-8 files and UTF-16 internally. In other languages it may
    take more work, and/or tracking down support libraries.

    That's a good example of the kind of nickpicky detail that makes
    implementing a complete XML parser/serializer less than completely
    trivial. Still want to write your own?
    Joe Kesselman, Jul 11, 2009
    #14
  15. > But, actually, looking through the XML grammar
    > in the spec., the only mandatory start of an XML
    > document is the start of the document's root
    > element ('<'), so since we know that the absense
    > of an XML declaration means the encoding must be
    > either UTF-8, UTF-16BE or UTF-16LE, then the first
    > two bytes must be {'<', non-0}, {'<', 0} and
    > {0, '<'},respectively (I think :), to detect the
    > encoding.



    Almost right. There one thing which may come before the XML Declaration
    or Text Declaration is a byte order mark.

    If you look at the appendix to the XML Recommendation which deals
    specifically with this -- and which I believe I posted a URL for -- they
    describe exactly how to handle this.
    Joe Kesselman, Jul 11, 2009
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Joseph Tilian
    Replies:
    0
    Views:
    340
    Joseph Tilian
    Dec 21, 2004
  2. Sarah Haskins
    Replies:
    3
    Views:
    1,446
    Puff Addison
    Jan 15, 2004
  3. Matt
    Replies:
    1
    Views:
    820
    Richard Tobin
    Nov 9, 2004
  4. ezmiller
    Replies:
    1
    Views:
    588
    Johannes Koch
    Nov 26, 2005
  5. test
    Replies:
    2
    Views:
    1,999
    Oliver Wong
    Jul 28, 2006
Loading...

Share This Page