Expanded Entities Not In Document Encoding - Shouldn't This Be AParse Error?

Discussion in 'XML' started by MaggotChild, Jan 28, 2010.

  1. MaggotChild

    MaggotChild Guest

    Parsers usually error if there is a byte that's not in the input
    document's stated encoding.
    I have a ISO-8859-1 document that contains entities representing
    several non 8859-1 chars. When these entities are expanded, the
    document is no longer in the given encoding, but there is no error
    from the parser. Is this in accordance with the XML spec?

    The parser is libxml2 (via Perl interface).
     
    MaggotChild, Jan 28, 2010
    #1
    1. Advertisements

  2. Once the entities are expanded, the document isn't in an encoding at
    all. It's just unicode characters.

    An XML document can contain any (legal) unicode characters of the
    encoding it's written in. One of the main purposes of character
    references is so that you aren't limited to the characters in your
    encoding.

    -- Richard
     
    Richard Tobin, Jan 28, 2010
    #2
    1. Advertisements

  3. MaggotChild

    MaggotChild Guest

    So, in general, this means if my language does not support wide chars,
    and I want to check for such a char in the parse tree, I need to look
    for the unicode code point?

    Thanks
     
    MaggotChild, Jan 29, 2010
    #3
  4. That's correct. The XML APIs (DOM, SAX, and as far as I know all the
    others) use Unicode internally, usually as strings of UTF-16 characters.

    Conceptually, the first thing the parser does is convert from your
    actual source encoding to Unicode. Then, as it scans through the unicode
    representation of your document, it handles any <![CDATA[]]> sections,
    Entity References and Numeric Character References. (The latter simply
    turn into the corresponding Unicode character, of course.)

    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
     
    Joe Kesselman, Jan 29, 2010
    #4
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.