XML not well formed and UTF-8 encoding

Discussion in 'XML' started by Philou59, Jan 19, 2007.

  1. Philou59

    Philou59 Guest

    Hi
    I am pretty new to XML and I am struggling validating this XML:
    <?xml version="1.0" encoding="UTF-8"?>
    <XML>×</XML>

    This is a (very limited) extract of a huge file we receive from one of
    our partners.

    My validation tool (and parser!) says this is not a well formed XML
    because of the multiplication sign - hexa D7.
    As it is encoded in UTF-8, I can't see why some characters would be
    refused... I must say I am really confused...

    Any help ?
    many thanks in advance
    Philou59, Jan 19, 2007
    #1
    1. Advertising

  2. * Philou59 wrote in comp.text.xml:
    >I am pretty new to XML and I am struggling validating this XML:
    ><?xml version="1.0" encoding="UTF-8"?>
    ><XML>×</XML>
    >
    >This is a (very limited) extract of a huge file we receive from one of
    >our partners.
    >
    >My validation tool (and parser!) says this is not a well formed XML
    >because of the multiplication sign - hexa D7.
    >As it is encoded in UTF-8, I can't see why some characters would be
    >refused... I must say I am really confused...


    The lone sequence 0xD7 cannot occur in a UTF-8 encoded document. If it
    really does occur like that, then the document is not UTF-8 encoded and
    therefore not well-formed XML. You should contact whoever sends you the
    document like this, and have them fix it.
    --
    Björn Höhrmann · mailto: · http://bjoern.hoehrmann.de
    Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
    68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
    Bjoern Hoehrmann, Jan 19, 2007
    #2
    1. Advertising

  3. Philou59 schrieb:
    > Hi
    > I am pretty new to XML and I am struggling validating this XML:
    > <?xml version="1.0" encoding="UTF-8"?>
    > <XML>×</XML>


    May not be related to your problem, but...
    <http://www.w3.org/TR/REC-xml/#dt-name>:
    [Definition: A Name is a token beginning with a letter or one of a few
    punctuation characters, and continuing with letters, digits, hyphens,
    underscores, colons, or full stops, together known as name characters.]
    Names beginning with the string "xml", or with any string which would
    match (('X'|'x') ('M'|'m') ('L'|'l')), are reserved for standardization
    in this or future versions of this specification.

    --
    Johannes Koch
    In te domine speravi; non confundar in aeternum.
    (Te Deum, 4th cent.)
    Johannes Koch, Jan 19, 2007
    #3
  4. Philou59 wrote:
    > Hi
    > I am pretty new to XML and I am struggling validating this XML:
    > <?xml version="1.0" encoding="UTF-8"?>
    > <XML>×</XML>
    >
    > This is a (very limited) extract of a huge file we receive from one of
    > our partners.
    >
    > My validation tool (and parser!) says this is not a well formed XML
    > because of the multiplication sign - hexa D7.
    > As it is encoded in UTF-8, I can't see why some characters would be
    > refused... I must say I am really confused...
    >
    > Any help ?
    > many thanks in advance
    >


    http://unicode.org/unicode/faq/utf_bom.html#15

    --
    Cordialement,

    ///
    (. .)
    --------ooO--(_)--Ooo--------
    | Philippe Poulard |
    -----------------------------
    http://reflex.gforge.inria.fr/
    Have the RefleX !
    Philippe Poulard, Jan 19, 2007
    #4
  5. Philou59

    Andy Dingley Guest

    Bjoern Hoehrmann wrote:

    > You should contact whoever sends you the
    > document like this, and have them fix it.


    It's a mistake to validate invalidate documents, even when they're
    exchanged between businesses. Take this issue up with the provider of
    the document and have them fix it, where it ought to be fixed. If it's
    not XML, it's not XML. If it's unpredictably case-insensitive,
    unbalanced, misses quotes, insists on particular quote characters, uses
    HTML entities or is otherwise broken, then they're just not doing XML,
    no mattter how much their pointy-headed conslutant claims they are.

    I know this approach is unpopular, especially with your bosses' boss.
    If you start applying work-arounds though, the whole thing becomes
    increasingly unmaintainable. Just don't go there, whatever the
    commercial problem in sorting it out properly.

    I can't remember how many times I've gone down this route 8-(
    Andy Dingley, Jan 19, 2007
    #5
  6. > <?xml version="1.0" encoding="UTF-8"?>
    > <XML>×</XML>


    If you're going to use XML, you have to honor XML's rules. If you don't,
    it isn't XML, period.

    As others have said, first thing to do is to rename that element to
    something other than XML.

    As far as the character being illegal: Not all characters are permitted
    in XML; see the spec, available from the W3C's website. XML 1.1 permits
    many characters that XML 1.0 didn't, but that requires that the document
    be marked as being 1.1 (yours explicitly says 1.0) and requires that
    everyone working with the document use tools that support 1.1.

    Even in 1.1, I believe some characters are reserved. The traditional
    workaround if you really need unconstrained binary data is the same one
    used in e-mail: encode the data (typically as base-64) and make
    converting it between the encoded form and the actual form the
    application's responsibility.





    >
    > May not be related to your problem, but...
    > <http://www.w3.org/TR/REC-xml/#dt-name>:
    > [Definition: A Name is a token beginning with a letter or one of a few
    > punctuation characters, and continuing with letters, digits, hyphens,
    > underscores, colons, or full stops, together known as name characters.]
    > Names beginning with the string "xml", or with any string which would
    > match (('X'|'x') ('M'|'m') ('L'|'l')), are reserved for standardization
    > in this or future versions of this specification.
    >



    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Jan 19, 2007
    #6
  7. (And, as others have suggested, you should probably also doublecheck the
    UTF-8 encoding rules to make sure you're expressing that character
    correctly. Even if it's legal XML, UTF-8 may require that it be encoded
    as multiple bytes.)

    --
    Joe Kesselman / Beware the fury of a patient man. -- John Dryden
    Joseph Kesselman, Jan 19, 2007
    #7
  8. In article <>,
    Bjoern Hoehrmann <> wrote:

    >The lone sequence 0xD7 cannot occur in a UTF-8 encoded document. If it
    >really does occur like that, then the document is not UTF-8 encoded and
    >therefore not well-formed XML. You should contact whoever sends you the
    >document like this, and have them fix it.


    Most likely, they have made a mistake with the labelling. Perhaps the
    data is in fact in ISO Latin-1. Did your supplier put the declaration
    at the top or did you?

    -- Richard
    --
    "Consideration shall be given to the need for as many as 32 characters
    in some alphabets" - X3.4, 1963.
    Richard Tobin, Jan 19, 2007
    #8
  9. Philou59

    Peter Flynn Guest

    Joe Kesselman wrote:
    > Philou59 wrote in comp.text.xml:
    >>I am pretty new to XML


    So, I think, are your partners.

    >> <?xml version="1.0" encoding="UTF-8"?>
    >> <XML>�</XML>


    0xD7 is a multiplication sign in ISO-8859-1 (and a few others). In UTF-8
    it's 0xC397. Either your partners' system is falsifying the encoding of
    the document, or their software is crocked.

    > If you're going to use XML, you have to honor XML's rules. If you don't,
    > it isn't XML, period.


    Yep. If people send you invalid documents, send them straight back and
    ask for valid ones. Sometimes you have to hit them with a lart first.
    This includes partners.

    > Even in 1.1, I believe some characters are reserved.


    Not reserved, forbidden :)

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie
    Peter Flynn, Jan 20, 2007
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rimu Atkinson

    how is this XML not well-formed???

    Rimu Atkinson, Jul 9, 2003, in forum: XML
    Replies:
    1
    Views:
    1,071
    Peter Flynn
    Jul 15, 2003
  2. Replies:
    2
    Views:
    619
    Joe Kesselman
    Nov 16, 2006
  3. =?ISO-8859-1?Q?J=FCrgen_Kahrs?=

    Gtkdialg accepts XML data that is not well-formed

    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Dec 20, 2006, in forum: XML
    Replies:
    3
    Views:
    414
    Manuel Collado
    Dec 21, 2006
  4. seven.reeds
    Replies:
    2
    Views:
    689
    seven.reeds
    Jul 1, 2007
  5. Rich Fowler
    Replies:
    2
    Views:
    1,261
    Rich Fowler
    Jan 22, 2010
Loading...

Share This Page