How to speed up XML reading

Discussion in 'XML' started by Ramon F Herrera, Sep 11, 2012.

  1. My application makes a large number of XPath() retrievals and that's
    the code that predominantly uses most of the clock time. The rest of
    the tasks take a negligible amount of CPU and disk. In short, all the
    app does is to read XML variables and write them in a PDF file.

    See a previous, very related post below.

    -Ramon

    =============================================

    > You can't compare SAX and DOM. SAX is under the parsing level therefore
    > DOM is for manipulating an XML document. DOM is mostly built with SAX
    > system. You can use it or ignore it building your own SAX code. However
    > create your own SAX handler is much complex and the final result could
    > be much slower than with a pure DOM usage.


    Very true. (Though some DOM parsers/loaders bypass SAX for greater
    efficiency; I believe Xerces actually uses lower-level events to drive
    its DOM construction.)

    SAX does require that you manage all the state information, which may
    or may not include building something like the DOM for part or all of
    the document. How fast or slow that will be depends entirely on the
    problem at hand and how good your code is.

    If you've got time, doing it all via SAX may be worth trying. But it
    isn't always going to be a magic bullet.

    As I said in my other post, the first thing to do is to find out
    whether this is even a significant part of your application's
    processing time.

    --
    Joe Kesselman,
    Ramon F Herrera, Sep 11, 2012
    #1
    1. Advertising

  2. A related thread is: "Why is SAX faster than DOM?"

    -RFH
    Ramon F Herrera, Sep 11, 2012
    #2
    1. Advertising

  3. Tools used:
    C++
    Xerces-C
    XQilla
    Developed under Linux, ported to Windows


    A very important lesson that I learned follows. Xerces implements a
    reasonably/very fast XPath retrieval BUT it does so at the expense of
    flexibility. The only type of XPath retrieval supported by Xerces is
    the MINIMAL one:

    string neededVariable = XPath("/this/is/the/variable/that/i/need");

    If the path contains any character like "[", "@", "=", etc. I must
    resort to XQilla, which is wonderful (a LOT easier to code than pure
    Xerces), but as slow as molasses in cold weather:

    string someOtherVar = XPath("/table/joint/ancestor::table/
    @titledetail");

    After running some benchmarks I have concluded that my best option is
    to use a combination of the 2 XPath engines: Xerces for the "easy"
    stuff and Xqilla for the more complex.

    -Ramon
    Ramon F Herrera, Sep 11, 2012
    #3
  4. Ramon F Herrera <> writes:

    [...]
    > A very important lesson that I learned follows. Xerces implements a
    > reasonably/very fast XPath retrieval BUT it does so at the expense of
    > flexibility. The only type of XPath retrieval supported by Xerces is
    > the MINIMAL one:
    >
    > string neededVariable = XPath("/this/is/the/variable/that/i/need");
    >
    > If the path contains any character like "[", "@", "=", etc. I must
    > resort to XQilla, which is wonderful (a LOT easier to code than pure
    > Xerces), but as slow as molasses in cold weather:
    >
    > string someOtherVar = XPath("/table/joint/ancestor::table/
    > @titledetail");


    ... would have the same effect as ancestor::table since the query starts
    at document root.

    > After running some benchmarks I have concluded that my best option is
    > to use a combination of the 2 XPath engines: Xerces for the "easy"
    > stuff and Xqilla for the more complex.


    XPath may require DOM if you use funny axes, e.g., preceding-sibling::*
    and, maybe, ancestor.

    However, for the request you show above, a hand-coded SAX parser keeping
    a simple stack (with @titledetail cached where appropriate) can extract what
    you want. XPath, and any generic query language for that matter, is far
    more powerful, and will therefore most likely be slower.

    (Generating the SAX handler for any given XPath query is left as an
    exercise for the reader. :)

    -- Alain.
    Alain Ketterlin, Sep 12, 2012
    #4
  5. On Sep 12, 4:32 am, Alain Ketterlin <-strasbg.fr>
    wrote:

    > (Generating the SAX handler for any given XPath query
    > is left as an exercise for the reader. :)
    >
    > -- Alain.


    Merci, Alain.

    Actually, I think that the solution to my performance problem is to
    implement (via SAX?) the reading of the whole XML file and insert the
    variables in my own data structures. That must speed up the variable
    retrieval substantially BUT an XML guru is required, which I am not.

    In the meantime, I downloaded libxml and will see how well it
    performs. Perhaps that is the solution to my problem. Being written in
    C, it should be faster than Xerces-C++

    -Ramon
    Ramon F Herrera, Sep 12, 2012
    #5
  6. El 12/09/2012 14:52, Ramon F Herrera escribió:
    >...
    > Actually, I think that the solution to my performance problem is to
    > implement (via SAX?) the reading of the whole XML file and insert the
    > variables in my own data structures. That must speed up the variable
    > retrieval substantially BUT an XML guru is required, which I am not.
    >
    > In the meantime, I downloaded libxml and will see how well it
    > performs. Perhaps that is the solution to my problem. Being written in
    > C, it should be faster than Xerces-C++


    You could try Expat, written in C.

    --
    Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
    Manuel Collado, Sep 12, 2012
    #6
  7. On 9/11/2012 1:52 PM, Ramon F Herrera wrote:
    > A related thread is: "Why is SAX faster than DOM?"


    (Answer: It isn't always. Depends on the patterns of access to the data.)


    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, Sep 14, 2012
    #7
  8. On 9/11/2012 2:20 PM, Ramon F Herrera wrote:
    > If the path contains any character like "[", "@", "=", etc. I must
    > resort to XQilla, which is wonderful (a LOT easier to code than pure
    > Xerces), but as slow as molasses in cold weather


    You might want to look at Xalan. There was a fair amount of work put
    into Xalan performance; I don't know how XQilla compares to that.

    Or, if you're using IBM's Java environment, you might want to look at
    the XML support that ships with that JRE, which is another design
    iteration past Xalan. Or, in Websphere, the Websphere XML feature, which
    supports XPath 2.0, XSLT 2.0, and XQuery and is yet another design
    iteration.

    With all of these, remember that the JAXP/TRAX APIs allow precompiling a
    path or query. And remember that the performance can be improved if the
    document is cached in memory in the appropriate internal representation.
    (The Xerces implementation is single-pass, I believe; if you want to run
    more than one path the advantage goes away quickly because you have to
    reparse the input document.)


    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, Sep 14, 2012
    #8
  9. > Actually, I think that the solution to my performance problem is to
    > implement (via SAX?) the reading of the whole XML file and insert the
    > variables in my own data structures. That must speed up the variable
    > retrieval substantially BUT an XML guru is required, which I am not.


    In many cases, yes, XML should be used as your "portability" level, and
    custom internal representations should be used within the application.
    Of course the downside is that you then have to implement a lot more of
    your own logic rather than being able to take advantage of the XML-level
    utilities.

    > In the meantime, I downloaded libxml and will see how well it
    > performs. Perhaps that is the solution to my problem. Being written in
    > C, it should be faster than Xerces-C++


    C++ isn't necessarily slower than C. That depends on the details of the
    code, both in coding style and in algorithms. Remember, an infinite
    speedup of something that accounts for only 1% of runtime is only a 1%
    real improvement.

    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, Sep 14, 2012
    #9
  10. On Sep 14, 12:04 am, Joe Kesselman <>
    wrote:
    > On 9/11/2012 2:20 PM, Ramon F Herrera wrote:
    >
    > > If the path contains any character like "[", "@", "=", etc. I must
    > > resort to XQilla, which is wonderful (a LOT easier to code than pure
    > > Xerces), but as slow as molasses in cold weather

    >


    > You might want to look at Xalan. There was a fair amount
    > of work put into Xalan performance; I don't know how XQilla
    > compares to that.


    Following a previous advice of yours, I looked into it. It seems that
    Xalan has reached a dead end. It won't even compile on a regular Linux
    box.

    What I discovered is that most of the action is in libxml. See my
    thread "Dramatic performance gains with Libxml" (I develop under C/C+
    +).

    -Ramon
    Ramon F Herrera, Sep 16, 2012
    #10
  11. On 9/16/2012 2:05 PM, Ramon F Herrera wrote:
    > Following a previous advice of yours, I looked into it. It seems that
    > Xalan has reached a dead end. It won't even compile on a regular Linux
    > box.


    The C++ version of Xalan has lost most of its contributors, agreed. The
    Java version is still alive and kicking, though not as actively under
    development as it was when IBM was donating lots of manhours to it.

    I'm not sure whether there's a C++ version of Saxon; if so that would
    also be worth looking at.

    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
    Joe Kesselman, Sep 20, 2012
    #11
  12. Ramon F Herrera

    Guest

    On Tuesday, September 11, 2012 7:20:21 PM UTC+1, Ramon F Herrera wrote:
    > Tools used:
    >
    > C++
    >
    > Xerces-C
    >
    > XQilla
    >
    > Developed under Linux, ported to Windows
    >
    >
    >
    >
    >
    > A very important lesson that I learned follows. Xerces implements a
    >
    > reasonably/very fast XPath retrieval BUT it does so at the expense of
    >
    > flexibility. The only type of XPath retrieval supported by Xerces is
    >
    > the MINIMAL one:
    >
    >
    >
    > string neededVariable = XPath("/this/is/the/variable/that/i/need");
    >
    >
    >
    > If the path contains any character like "[", "@", "=", etc. I must
    >
    > resort to XQilla, which is wonderful (a LOT easier to code than pure
    >
    > Xerces), but as slow as molasses in cold weather:
    >
    >
    >
    > string someOtherVar = XPath("/table/joint/ancestor::table/
    >
    > @titledetail");
    >
    >
    >
    > After running some benchmarks I have concluded that my best option is
    >
    > to use a combination of the 2 XPath engines: Xerces for the "easy"
    >
    > stuff and Xqilla for the more complex.
    >
    >
    >
    > -Ramon


    Have you looked at liquid xml c++ tool? (http://www.liquid-technologies.com/xmldatabinding/xml-schema-to-cpp.aspx)
    , Sep 21, 2012
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ham

    I need speed Mr .Net....speed

    Ham, Oct 28, 2004, in forum: ASP .Net
    Replies:
    6
    Views:
    2,336
    Antony Baula
    Oct 29, 2004
  2. efiedler
    Replies:
    1
    Views:
    2,030
    Tim Ward
    Oct 9, 2003
  3. Replies:
    2
    Views:
    2,286
    Howard
    Apr 28, 2004
  4. Replies:
    2
    Views:
    332
    Christopher Benson-Manica
    Apr 28, 2004
  5. Weng Lei-QCH1840
    Replies:
    1
    Views:
    181
    Thomas
    Aug 15, 2003
Loading...

Share This Page