How to speed up XML reading

Discussion in 'XML' started by Ramon F Herrera, Sep 11, 2012.

  1. My application makes a large number of XPath() retrievals and that's
    the code that predominantly uses most of the clock time. The rest of
    the tasks take a negligible amount of CPU and disk. In short, all the
    app does is to read XML variables and write them in a PDF file.

    See a previous, very related post below.

    -Ramon

    =============================================
    Very true. (Though some DOM parsers/loaders bypass SAX for greater
    efficiency; I believe Xerces actually uses lower-level events to drive
    its DOM construction.)

    SAX does require that you manage all the state information, which may
    or may not include building something like the DOM for part or all of
    the document. How fast or slow that will be depends entirely on the
    problem at hand and how good your code is.

    If you've got time, doing it all via SAX may be worth trying. But it
    isn't always going to be a magic bullet.

    As I said in my other post, the first thing to do is to find out
    whether this is even a significant part of your application's
    processing time.
     
    Ramon F Herrera, Sep 11, 2012
    #1
    1. Advertisements

  2. A related thread is: "Why is SAX faster than DOM?"

    -RFH
     
    Ramon F Herrera, Sep 11, 2012
    #2
    1. Advertisements

  3. Tools used:
    C++
    Xerces-C
    XQilla
    Developed under Linux, ported to Windows


    A very important lesson that I learned follows. Xerces implements a
    reasonably/very fast XPath retrieval BUT it does so at the expense of
    flexibility. The only type of XPath retrieval supported by Xerces is
    the MINIMAL one:

    string neededVariable = XPath("/this/is/the/variable/that/i/need");

    If the path contains any character like "[", "@", "=", etc. I must
    resort to XQilla, which is wonderful (a LOT easier to code than pure
    Xerces), but as slow as molasses in cold weather:

    string someOtherVar = XPath("/table/joint/ancestor::table/
    @titledetail");

    After running some benchmarks I have concluded that my best option is
    to use a combination of the 2 XPath engines: Xerces for the "easy"
    stuff and Xqilla for the more complex.

    -Ramon
     
    Ramon F Herrera, Sep 11, 2012
    #3
  4. [...]
    ... would have the same effect as ancestor::table since the query starts
    at document root.
    XPath may require DOM if you use funny axes, e.g., preceding-sibling::*
    and, maybe, ancestor.

    However, for the request you show above, a hand-coded SAX parser keeping
    a simple stack (with @titledetail cached where appropriate) can extract what
    you want. XPath, and any generic query language for that matter, is far
    more powerful, and will therefore most likely be slower.

    (Generating the SAX handler for any given XPath query is left as an
    exercise for the reader. :)

    -- Alain.
     
    Alain Ketterlin, Sep 12, 2012
    #4
  5. Merci, Alain.

    Actually, I think that the solution to my performance problem is to
    implement (via SAX?) the reading of the whole XML file and insert the
    variables in my own data structures. That must speed up the variable
    retrieval substantially BUT an XML guru is required, which I am not.

    In the meantime, I downloaded libxml and will see how well it
    performs. Perhaps that is the solution to my problem. Being written in
    C, it should be faster than Xerces-C++

    -Ramon
     
    Ramon F Herrera, Sep 12, 2012
    #5
  6. El 12/09/2012 14:52, Ramon F Herrera escribió:
    You could try Expat, written in C.
     
    Manuel Collado, Sep 12, 2012
    #6
  7. (Answer: It isn't always. Depends on the patterns of access to the data.)


    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
     
    Joe Kesselman, Sep 14, 2012
    #7
  8. You might want to look at Xalan. There was a fair amount of work put
    into Xalan performance; I don't know how XQilla compares to that.

    Or, if you're using IBM's Java environment, you might want to look at
    the XML support that ships with that JRE, which is another design
    iteration past Xalan. Or, in Websphere, the Websphere XML feature, which
    supports XPath 2.0, XSLT 2.0, and XQuery and is yet another design
    iteration.

    With all of these, remember that the JAXP/TRAX APIs allow precompiling a
    path or query. And remember that the performance can be improved if the
    document is cached in memory in the appropriate internal representation.
    (The Xerces implementation is single-pass, I believe; if you want to run
    more than one path the advantage goes away quickly because you have to
    reparse the input document.)


    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
     
    Joe Kesselman, Sep 14, 2012
    #8
  9. Actually, I think that the solution to my performance problem is to
    In many cases, yes, XML should be used as your "portability" level, and
    custom internal representations should be used within the application.
    Of course the downside is that you then have to implement a lot more of
    your own logic rather than being able to take advantage of the XML-level
    utilities.
    C++ isn't necessarily slower than C. That depends on the details of the
    code, both in coding style and in algorithms. Remember, an infinite
    speedup of something that accounts for only 1% of runtime is only a 1%
    real improvement.

    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
     
    Joe Kesselman, Sep 14, 2012
    #9
  10. Following a previous advice of yours, I looked into it. It seems that
    Xalan has reached a dead end. It won't even compile on a regular Linux
    box.

    What I discovered is that most of the action is in libxml. See my
    thread "Dramatic performance gains with Libxml" (I develop under C/C+
    +).

    -Ramon
     
    Ramon F Herrera, Sep 16, 2012
    #10
  11. The C++ version of Xalan has lost most of its contributors, agreed. The
    Java version is still alive and kicking, though not as actively under
    development as it was when IBM was donating lots of manhours to it.

    I'm not sure whether there's a C++ version of Saxon; if so that would
    also be worth looking at.

    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
     
    Joe Kesselman, Sep 20, 2012
    #11
  12. Ramon F Herrera

    shivers.paul Guest

    Have you looked at liquid xml c++ tool? (http://www.liquid-technologies.com/xmldatabinding/xml-schema-to-cpp.aspx)
     
    shivers.paul, Sep 21, 2012
    #12
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.