How do xml parsers handle encoding?

Discussion in 'XML' started by billsahiker@yahoo.com, Apr 30, 2008.

  1. Guest

    if an xml file specifies an encoding, e.g., utf16, do xml browsers and
    xml editors read and verify each character in the file to make sure it
    is utf16? and throw an error if it is not, or. do they do an automatic
    filtering/converting to utf16, or do they do something else?

    Do they default to utf8 if the xml file does not specify an encoding?

    Bill
    , Apr 30, 2008
    #1
    1. Advertising

  2. wrote:
    > if an xml file specifies an encoding, e.g., utf16, do xml browsers and
    > xml editors read and verify each character in the file to make sure it
    > is utf16? and throw an error if it is not, or. do they do an automatic
    > filtering/converting to utf16, or do they do something else?
    >
    > Do they default to utf8 if the xml file does not specify an encoding?


    An XML parser checks for a BOM (byte order mark) to find out whether it
    is UTF-8 or UTF-16 if there is no XML declaration declaring an encoding.

    And XML parsers are required to check that documents are properly
    encoded. However browser like Firefox or Opera I think might not report
    any such violation. For instance I saved an XML document as UTF-8 but
    with an XML declaration saying encoding="UTF-16" and then loaded with
    Firefox 2.0 and Opera 9 and they both did not report an error, instead
    treated the document as UTF-8. IE 6 reported an error.



    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
    Martin Honnen, Apr 30, 2008
    #2
    1. Advertising

  3. Martin Honnen wrote:

    > And XML parsers are required to check that documents are properly
    > encoded. However browser like Firefox or Opera I think might not report
    > any such violation. For instance I saved an XML document as UTF-8 but
    > with an XML declaration saying encoding="UTF-16" and then loaded with
    > Firefox 2.0 and Opera 9 and they both did not report an error, instead
    > treated the document as UTF-8. IE 6 reported an error.


    For Mozilla, the FAQ
    http://developer.mozilla.org/en/doc...rom_the_treatment_of_text.2Fhtml_documents.3F
    says:
    "Most well-formedness constraints are enforced. (Currently Mozilla
    does not catch character encoding errors, because the document is
    re-encoded using a lenient encoding converter before the document
    reaches the XML parser. This is a bug.)"



    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
    Martin Honnen, Apr 30, 2008
    #3
  4. The rules for how they're *supposed* to handle it are spelled out in the
    XML Recommendation. Not all parsers are in strict compliance with all
    parts of the recommendation, alas. Bug Happens.

    If you're asking whether you can get away with cheating: the brief
    answer is that it's extremely bad practice to try. If you're asking
    whether you can be certain a particular parser will or won't let
    something through, you can ask its development/user community... but be
    aware that the next release may fix this, and it's a very bad idea to
    write code that depends on bugs in specific versions.
    Joseph J. Kesselman, Apr 30, 2008
    #4
  5. Guest

    On Apr 30, 8:20 am, Martin Honnen <> wrote:
    > Martin Honnen wrote:
    > > And XML parsers are required to check that documents are properly
    > > encoded.


    So how do they do that? do they check every character? or do they just
    convert? if the encoding attribute is utf8 and the file has a
    character not utf8, does the browser error, convert it or what? Like
    if a Korean character is in a file that says it is utf8.

    Bill
    , Apr 30, 2008
    #5
  6. In article <>,
    <> wrote:

    >> > And XML parsers are required to check that documents are properly
    >> > encoded.


    >So how do they do that? do they check every character?


    Yes.

    >Like if a Korean character is in a file that says it is utf8.


    utf-8 covers all of Unicode, so it includes Korean characters.

    A parser has to check two things: that the data is legal for the
    encoding (for example, some sequences of bytes are not legal in
    UTF-8), and that the character it encodes is allowed in XML.

    -- Richard
    --
    :wq
    Richard Tobin, Apr 30, 2008
    #6
  7. Guest

    On Apr 30, 9:49 am, (Richard Tobin) wrote:
    > In article <>,
    >
    >  <> wrote:
    > >> > And XML parsers are required to check that documents are properly
    > >> > encoded.

    > >So how do they do that? do they check every character?

    >
    > Yes.
    >
    > >Like if a Korean character is in a file that says it is utf8.

    >
    > utf-8 covers all of Unicode, so it includes Korean characters.
    >
    > A parser has to check two things: that the data is legal for the
    > encoding (for example, some sequences of bytes are not legal in
    > UTF-8), and that the character it encodes is allowed in XML.
    >
    > -- Richard
    > --
    > :wq


    OK. I dont know if you are a .net programmer or not(Martin is so maybe
    he can respond to this too), but if I use streamreader to read an xml
    file with encoding specified as utf8 and I set the
    streamreader.encoding property to utf8, will streamreader fire an
    exception if a character is not utf8,
    or do I have to parse every character and check its value to see if it
    is in the utf8 range?

    Bill
    , Apr 30, 2008
    #7
  8. wrote:

    > OK. I dont know if you are a .net programmer or not(Martin is so maybe
    > he can respond to this too), but if I use streamreader to read an xml
    > file with encoding specified as utf8 and I set the
    > streamreader.encoding property to utf8, will streamreader fire an
    > exception if a character is not utf8,
    > or do I have to parse every character and check its value to see if it
    > is in the utf8 range?


    As far as I know StreamReader does not throw an exception.


    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
    Martin Honnen, Apr 30, 2008
    #8
  9. wrote:
    > So how do they do that? do they check every character? or do they just
    > convert?


    Most hand it off to an appropriate encoding-aware stream reader library
    and let that code do the work. Why build a wheel when you can buy one?
    Joseph J. Kesselman, Apr 30, 2008
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    1
    Views:
    699
    Esmond Pitt
    Mar 27, 2005
  2. Thomas Guettler

    xml.parsers.expat vs. xml.sax

    Thomas Guettler, Apr 27, 2004, in forum: Python
    Replies:
    2
    Views:
    883
    Martijn Faassen
    Apr 27, 2004
  3. Replies:
    2
    Views:
    365
  4. kaens
    Replies:
    6
    Views:
    326
    Stefan Behnel
    May 23, 2007
  5. kaens
    Replies:
    0
    Views:
    370
    kaens
    May 23, 2007
Loading...

Share This Page