XML parse validation

Discussion in 'XML' started by sarosh.shirazi@gmail.com, Jan 3, 2008.

  1. Guest

    Hi,

    I'm facing an illegal character problem when I read an XML file. Below
    code was used to do the reading.

    XmlReaderSettings settings = new XmlReaderSettings();
    settings.CheckCharacters = false;

    string fXmlFileName = _FilePath;
    XmlReader reader = XmlReader.Create(fXmlFileName,
    settings);
    XML= new XPathDocument(reader);

    The exception comes on the constructor of XPathDocument. I want to
    read the file overlooking the characters like (UTF-8 encoding). A
    solution pointed out to me was to parse it manually by reading it in
    ascii and replacing the characters but this damages my performance
    level so i want to avoid it. Any suggestion in this regard would be
    most welcome... How can i avoid validation???
     
    , Jan 3, 2008
    #1
    1. Advertising

  2. wrote:
    > The exception comes on the constructor of XPathDocument. I want to
    > read the file overlooking the characters like (UTF-8 encoding).


    This isn't a validation issue, but a well-formedness issue. That
    character is not legal in XML; if it is present, your file is simply not
    an MXL file.

    Change the code which is generating the XML to avoid putting forbidden
    characters into the document in the first place (if you really need to
    express random binary data, the usual workaround it to encode it as
    something like base-64 before putting it into the XML and decode it
    before using it).

    The alternative, as you pointed out, is to prefilter the data before it
    gets to the XML parser. I don't know enough about C# to give you any
    advice, but in Java setting up a filtered input stream is quite
    straightforward.

    --
    Joe Kesselman / Beware the fury of a patient man. -- John Dryden
     
    Joseph Kesselman, Jan 3, 2008
    #2
    1. Advertising

  3. (Note: I'm assuming you're not working in Java because you spelled
    "string" with a lowercase S. If that was just a typo, and you are using
    Java, then a filter would do the job. But the real question remains: Why
    are you generating broken XML in the first place, and shouldn't you fix
    that rather than trying to work around it?)

    --
    Joe Kesselman / Beware the fury of a patient man. -- John Dryden
     
    Joseph Kesselman, Jan 3, 2008
    #3
  4. wrote:

    > I'm facing an illegal character problem when I read an XML file. Below
    > code was used to do the reading.
    >
    > XmlReaderSettings settings = new XmlReaderSettings();
    > settings.CheckCharacters = false;
    >
    > string fXmlFileName = _FilePath;
    > XmlReader reader = XmlReader.Create(fXmlFileName,
    > settings);
    > XML= new XPathDocument(reader);
    >
    > The exception comes on the constructor of XPathDocument. I want to
    > read the file overlooking the characters like (UTF-8 encoding). A
    > solution pointed out to me was to parse it manually by reading it in
    > ascii and replacing the characters but this damages my performance
    > level so i want to avoid it. Any suggestion in this regard would be
    > most welcome... How can i avoid validation???


    If you set CheckCharacters to false then the XmlReader allows character
    references like so I am not sure why you get a parse error. Are you
    sure you have characters references like ? If you have such
    characters literally in the document then CheckCharacters set to false
    does not help. In that case the XML APIs do not help at all, you need to
    preprocess the document to get rid of those characters.

    --

    Martin Honnen
    http://JavaScript.FAQTs.com/
     
    Martin Honnen, Jan 3, 2008
    #4
  5. Guest

    On Jan 3, 8:06 pm, Martin Honnen <> wrote:
    > wrote:
    > > I'm facing an illegal character problem when I read anXMLfile. Below
    > > code was used to do the reading.

    >
    > >                 XmlReaderSettings settings = new XmlReaderSettings();
    > >                 settings.CheckCharacters = false;

    >
    > >                 string fXmlFileName = _FilePath;
    > >                 XmlReader reader = XmlReader.Create(fXmlFileName,
    > > settings);
    > >                XML= new XPathDocument(reader);

    >
    > >  The exception comes on the constructor of XPathDocument. I want to
    > > read the file overlooking the characters like (UTF-8 encoding). A
    > > solution pointed out to me was to parse it manually by reading it in
    > > ascii and replacing the characters but this damages my performance
    > > level so i want to avoid it. Any suggestion in this regard would be
    > > most welcome... How can i avoidvalidation???

    >
    > If you set CheckCharacters to false then the XmlReader allows character
    > references like so I am not sure why you get a parse error. Are you
    > sure you have characters references like ? If you have such
    > characters literally in the document then CheckCharacters set to false
    > does not help. In that case theXMLAPIs do not help at all, you need to
    > preprocess the document to get rid of those characters.
    >
    > --
    >
    >         Martin Honnen
    >        http://JavaScript.FAQTs.com/- Hide quoted text -
    >
    > - Show quoted text -


    To Joseph: It's part of the requirement that such characters would
    come up...so i'll have to bear the heck :)
    To Martin: Yeah these characters are coming up literally in the
    file...
    Is there any way other than ascii preprocessing or preparsing. I know
    the tags which shall have these chars. Can i somehow have these
    particular tags and their data simply ignored in XML?
     
    , Jan 9, 2008
    #5
  6. Andy Dingley Guest

    On 9 Jan, 06:29, wrote:

    > To Joseph: It's part of the requirement that such characters would
    > come up.


    I doubt this very much. The _character_ / codepoint "&x00" is a
    different concept to the byte or octet "&x00". Although Unicode
    encodings may well involve such a byte value at the level of the raw
    wire protocol, they certainly don't allow it as a valid character
    (sic, codepoint).

    XML, at the level you describe it, is a character stream. In XML the
    entity is a reference to this possible (albeit forbidden) 00
    value as a _character_, not just a raw byte.

    It sounds as if your problem here is an encoding problem (i.e. a
    Unicode problem, not an XML problem), even before it gets as far as
    being an XML well-formedness issue. Raw bytes 0f 00 are just bytes
    (which might have some correct place in the encoding you're using) but
    they're not intended to encode a resultant _character_ of 00, or the
    same thing as a numeric entity of
     
    Andy Dingley, Jan 9, 2008
    #6
  7. Peter Flynn Guest

    On Tue, 08 Jan 2008 22:29:35 -0800, sarosh.shirazi wrote:

    [snip]
    > To Joseph: It's part of the requirement that such characters would come
    > up...so i'll have to bear the heck :)


    Then as Joseph said, your file is not an XML file, so you must use
    non-XML software to process it.

    > Is there any way other than ascii preprocessing or preparsing.


    Not as far as I am aware.

    > I know
    > the tags which shall have these chars. Can i somehow have these
    > particular tags and their data simply ignored in XML?


    No, because (as already explained) your file is not an XML file.
    You cannot use XML software and methods on non-XML files in this
    way (apart from the method Martin suggested).

    If you can fix it by exchanging the invalid characters on a 1:1 basis,
    then just use a simple inline filter like tr, which is extremely fast.

    Alternatively, change all the invalid characters to some form of markup,
    eg <junk char="0"/> so that they can be transformed back again after
    processing. A stream editor like sed is very fast for this kind of thing.

    And tell your data source that their data will process more easily if
    they generate well-formed XML. A "requirement" like the one you mention
    is simply evidence of bad planning on their part.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
     
    Peter Flynn, Jan 12, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. javadude
    Replies:
    0
    Views:
    729
    javadude
    Feb 16, 2005
  2. Mike
    Replies:
    1
    Views:
    1,162
    Patrick TJ McPhee
    Nov 21, 2003
  3. Andy
    Replies:
    0
    Views:
    542
  4. Replies:
    19
    Views:
    1,146
    Daniel Vallstrom
    Mar 15, 2005
  5. 7stud --

    optparse: parse v. parse! ??

    7stud --, Feb 20, 2008, in forum: Ruby
    Replies:
    3
    Views:
    192
    7stud --
    Feb 20, 2008
Loading...

Share This Page