Interested in System ID only, not the whole parsing ...

Discussion in 'XML' started by Dhurandhar Bhatvadekar, Mar 3, 2007.

  1. I am not sure if this is a naive question. But I have an arbitrarily
    long document where I know that a DOCTYPE
    declaration exists. I am not interested in "parsing" the document. All
    I am interested is in finding out what the
    System id and Public id of the document is.

    A way I can think of is to write an entity resolver and somehow
    arrange for the implementation of resolveEntity()
    return an appropriate InputSource and preserve the system ID because
    System/Public ID are passed to the method.

    If that's the only way to achieve it, my question is:
    - will this have performance impact and overhead of doing it this way,
    because I have to give a call to the parse() method?

    If there are other ways of achieving this (again, noting that I am
    only interested in the declaration part), please
    let me know.

    Thank you!
    Dhurandhar Bhatvadekar, Mar 3, 2007
    #1
    1. Advertising

  2. Dhurandhar Bhatvadekar wrote:
    > I am not sure if this is a naive question. But I have an arbitrarily
    > long document where I know that a DOCTYPE
    > declaration exists. I am not interested in "parsing" the document. All
    > I am interested is in finding out what the
    > System id and Public id of the document is.


    Outside of writing a parser yourself for that much of the document...

    Run a SAX parser, and as soon as you've gotten that information have
    your handler throw an exception to crash the parser. (Obviously the code
    that calls the parser will want to catch and recognize this particular
    exception as a "normal abnormal exit.")

    However, when I proposed that to one manager, he held his nose and
    insisted that I let the parser finish spinning instead. And I can't
    _entirely_ disagree with him.


    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Mar 3, 2007
    #2
    1. Advertising

  3. Hi Joe,

    Thanks for your reply. So, here is some code-review time for you. Can
    you please let me know if the following
    will work? With my preliminary tests it appears to work. But I want to
    be sure.

    ----------------------------------------------
    private String getSystemIdFromDtd() {
    //Use Streaming XML parser, returns null in case of parsing
    error
    BufferedInputStream bis = null;
    try {
    bis = new BufferedInputStream(new FileInputStream(xml)); //
    xml is defined elsewhere
    final XMLReader xr =
    XMLReaderFactory.createXMLReader();
    final InputSource is = new InputSource(bis);
    xr.setEntityResolver(new EntityResolver() {
    public InputSource resolveEntity(final String pid,
    final String sid)
    throws SAXException, IOException {
    if (sid != null) {
    mSystemId = sid.trim(); //mSystemId is
    defined elsewhere
    //resolve the entities locally somehow and
    return a meaningful InputSource instance
    } //else default resolution
    } //else default resolution
    return ( null );
    }
    });
    xr.parse(is);
    return ( mSystemId );
    } catch (final Exception ioe) {
    throw new RuntimeException(ioe);
    } finally {
    try {
    if (bis != null)
    bis.close();
    } catch(Exception ee) {
    //squelching ee on purpose
    }
    }
    }
    ------------------------------------------------------------------

    Thanks again!
    Dhurandhar Bhatvadekar, Mar 3, 2007
    #3
  4. Sorry, but code review goes beyond what you get for free.
    Joe Kesselman, Mar 3, 2007
    #4
  5. Dhurandhar Bhatvadekar

    Peter Flynn Guest

    Dhurandhar Bhatvadekar wrote:
    > I am not sure if this is a naive question. But I have an arbitrarily
    > long document where I know that a DOCTYPE
    > declaration exists. I am not interested in "parsing" the document. All
    > I am interested is in finding out what the
    > System id and Public id of the document is.


    All XML tools conduct a formal parse, either for well-formedness or for
    validity as well. This implies they read to the end of the file. Most
    XML tools don't provide for fragmentary reading, so the penalty when you
    "just" want something from the top of the file is enormous unless you do
    the "crash me when I find it" trick.

    If you can guarantee that the entire Document type Declaration will be
    contained in the first nn lines of the file, and that the double quote
    has been used to delimit the identifiers, then the following Unix
    commands will do the job, returning two lines: the first is the SYSTEM
    identifier, and the second (if non-empty) is the FPI:

    head -nn yourfile.xml|tr '\012\015<' '\040\040\012'|grep -m 1
    '^!DOCTYPE'|awk -F\" '{print $2 "\n" $4}'

    The commands head, tr, grep, and awk are also available for Windows.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
    Peter Flynn, Mar 4, 2007
    #5
  6. Joe Kesselman wrote:
    > Dhurandhar Bhatvadekar wrote:
    >> I am not sure if this is a naive question. But I have an arbitrarily
    >> long document where I know that a DOCTYPE
    >> declaration exists. I am not interested in "parsing" the document. All
    >> I am interested is in finding out what the
    >> System id and Public id of the document is.

    >
    > Outside of writing a parser yourself for that much of the document...
    >
    > Run a SAX parser, and as soon as you've gotten that information have
    > your handler throw an exception to crash the parser. (Obviously the code


    Following Joe's idea (and assuming there always
    _is_ a DOCTYPE declaration in your file), I
    implemented this in XMLgawk:

    XMLSTARTDOCT {
    print XMLATTR["PUBLIC"], XMLATTR["SYSTEM"]
    exit
    }

    The "exit" statement ensures that the XML data
    will only be read up to the point where the
    DOCTYPE declaration is. Immediately after this,
    parsing will be terminated. I described such an
    approach in the XMLgawk doc:

    http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Dealing-with-DTDs
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Mar 4, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stefan Mueller
    Replies:
    5
    Views:
    499
    Steven Saunderson
    Jul 10, 2006
  2. \A_Michigan_User\
    Replies:
    2
    Views:
    867
    \A_Michigan_User\
    Aug 21, 2006
  3. Replies:
    4
    Views:
    752
    =?Utf-8?B?RGF2aWQgSmVzc2Vl?=
    Aug 24, 2006
  4. Ben
    Replies:
    3
    Views:
    311
    Alexey Smirnov
    Nov 10, 2008
  5. josh logan
    Replies:
    4
    Views:
    308
    John Nagle
    Oct 26, 2010
Loading...

Share This Page