this code: &#x3, an invalid XML character error.

Discussion in 'XML' started by Kaidi, Sep 27, 2004.

  1. Kaidi

    Kaidi Guest

    Hello guys,
    I get the "an invalid XML character" error when using xerces to parse
    a XML file. I know that XML will correspond the &, <, >, " to special
    strings like "&gt;&lt;". However, how about if the XML file really
    needs to contain some text like: ""? (as
    content of a tag)

    The story is:
    I am writing a program to parse some XML files from another program.
    In that program, it graps webpages, and saves the pages' URLs and
    content into a XML file, something like (for each webpage):

    <pageurl>http://www.cs.waikato.ac.nz/~ml/weka/agridatasets.jar</pageurl>
    <pagecontent> the_page_HTML_content </pagecontent>

    This works fine since that program will replace &, <, > etc with &lt;
    etc.

    However, some web urls point to files: .zip, .pdf file, etc. The
    program just "prints" the .pdf content as text and puts it in the XML
    file. In this case, the content of <pagecontent> will look like:

    PKÈR&lt;+&#
    ......
    (Just think what you will see if you open a .pdf file in notepad!)

    In this way, when I use a XML parser (xerces) to parse it, it will get
    errors like:

    FATAL: line 5079: Character reference "&#x3" is an invalid XML
    character.
    org.xml.sax.SAXParseException: Character reference "&#x3" is an
    invalid XML character.
    at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown
    Source)
    at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown
    Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
    Source)
    at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown
    Source)
    at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
    at org.apache.xerces.impl.XMLScanner.scanCharReferenceValue(Unknown
    Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanCharReference(Unknown
    Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
    Source)
    at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
    Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

    So, any idea how I can make it work?
    How can I tell the xerces parser to ignore the "&xx;" pairs (except
    those for <,>,", etc) and parse them just as plain text?

    Thanks a lot.
     
    Kaidi, Sep 27, 2004
    #1
    1. Advertising

  2. In article <>,
    Kaidi <> wrote:

    % I get the "an invalid XML character" error when using xerces to parse
    % a XML file. I know that XML will correspond the &, <, >, " to special
    % strings like "&gt;&lt;". However, how about if the XML file really
    % needs to contain some text like: ""? (as
    % content of a tag)

    The only valid characters in an XML file are the non-control code points
    from Unicode, tab, carriage-return, and line-feed. Even if you enter
    them as numeric entity references, other control characters (such as
    ) are not allowed. I suggest encoding binary data using one of
    the schemes recognised in mime, such as quoted-printable (for text with
    the odd control character) or base64.

    % However, some web urls point to files: .zip, .pdf file, etc. The
    % program just "prints" the .pdf content as text and puts it in the XML
    % file. In this case, the content of <pagecontent> will look like:

    For these, use base64.

    --

    Patrick TJ McPhee
    East York Canada
     
    Patrick TJ McPhee, Sep 27, 2004
    #2
    1. Advertising

  3. Kaidi wrote:
    > The
    > program just "prints" the .pdf content as text and puts it in the XML
    > file. In this case, the content of <pagecontent> will look like:
    >
    > PKÈR&lt;+&#
    > ......
    > (Just think what you will see if you open a .pdf file in notepad!)
    >
    > In this way, when I use a XML parser (xerces) to parse it,


    Why do you want to parse PDF with an XML parser? When downloading the
    resources, you may store the content-type and make XML pasring dependent
    on the content-type.
    --
    Johannes Koch
    In te domine speravi; non confundar in aeternum.
    (Te Deum, 4th cent.)
     
    Johannes Koch, Sep 27, 2004
    #3
  4. Kaidi

    Kaidi Guest

    Johannes Koch <> wrote in message news:<>...
    > Kaidi wrote:
    > > The
    > > program just "prints" the .pdf content as text and puts it in the XML
    > > file. In this case, the content of <pagecontent> will look like:
    > >
    > > PKÃ?R&lt;+&#
    > > ......
    > > (Just think what you will see if you open a .pdf file in notepad!)
    > >
    > > In this way, when I use a XML parser (xerces) to parse it,

    >
    > Why do you want to parse PDF with an XML parser? When downloading the
    > resources, you may store the content-type and make XML pasring dependent
    > on the content-type.


    yes, if let me write the whole program, I will do that way. The
    problem is: the existing program (which I can not change) is doing
    that way: it just put .jar/pdf, etc. into one XML file. I need to
    process this XML file. :-(
     
    Kaidi, Sep 27, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. cgbusch
    Replies:
    6
    Views:
    7,508
    Mike Brown
    Sep 2, 2003
  2. Mark

    Invalid XML character

    Mark, Aug 18, 2004, in forum: XML
    Replies:
    5
    Views:
    5,632
    Richard Tobin
    Aug 18, 2004
  3. Marco Montel

    invalid XML character

    Marco Montel, Dec 7, 2004, in forum: XML
    Replies:
    6
    Views:
    10,519
    David Carlisle
    Dec 8, 2004
  4. Patrick.O.Ige
    Replies:
    1
    Views:
    1,966
    Patrick.O.Ige
    Jul 2, 2006
  5. kevin
    Replies:
    0
    Views:
    976
    kevin
    Jan 16, 2008
Loading...

Share This Page