Parsing large XML files FAST

Discussion in 'XML' started by PedroX, Jun 26, 2005.

  1. PedroX

    PedroX Guest

    Hello:

    I need to parse some large XML files, and save the data in an Access DB. I
    was using MSXML 2 and ASP, but it turns out to be extremely slow when then
    XML documents are like 10 mb in size. It's taking over an hour to parse such
    sizes!?

    I don't really need to use ASP or a web server at all because I am parsing
    all in my own computer. Is there any executable that can do this parsing
    faster than the way I was doing it?

    Thanks in advance.
    PedroX, Jun 26, 2005
    #1
    1. Advertising

  2. PedroX

    PedroX Guest

    I wrote:

    > I need to parse some large XML files, and save the data in an Access DB. I
    > was using MSXML 2 and ASP, but it turns out to be extremely slow when

    then

    I made a mistake. I am actually using MSXML 4.0.
    PedroX, Jun 26, 2005
    #2
    1. Advertising

  3. PedroX

    Brian Staff Guest

    > Is there any executable that can do this parsing
    > faster than the way I was doing it?


    >> objXMLDoc.selectNodes("//node_name")


    I am not an expert on techniques of parsing, but if performance were a
    problem for me, I would try and use as much explicit node naming as
    possible...for instance I would maybe recode the above statement to be
    something like this:

    objXMLDoc.selectNodes("rootNode/childNode/node_name")

    I know if _I_ was the parser, I would be able to find those nodes in a 10mb
    structure quicker using the second technique rather than using the first.

    JAT - Brian
    Brian Staff, Jun 27, 2005
    #3
  4. PedroX wrote:

    > I need to parse some large XML files, and save the data in an Access DB. I
    > was using MSXML 2 and ASP, but it turns out to be extremely slow when then
    > XML documents are like 10 mb in size. It's taking over an hour to parse such
    > sizes!?


    Andrew Schorr had a similar problem. He read
    XML larger than a GigaByte and stored them into
    Postgres. He also had problems with finding the
    right tool for it. Eventually, he used an extension
    of the GNU Awk language:

    http://home.vrweb.de/~juergen.kahrs/gawk/XML/

    Use Google and you will find his explanations
    in comp.lang.awk.

    > I don't really need to use ASP or a web server at all because I am parsing
    > all in my own computer. Is there any executable that can do this parsing
    > faster than the way I was doing it?


    XML parsers can read large files fast only with
    the SAX approach (or similar event-driven models).
    The DOM model simply cant do this because it has
    to hold the complete XML tree in memory.
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Jun 27, 2005
    #4
  5. PedroX

    PedroX Guest

    "Brian Staff" wrote

    > I am not an expert on techniques of parsing, but if performance were a
    > problem for me, I would try and use as much explicit node naming as
    > possible...for instance I would maybe recode the above statement to be
    > something like this:
    >
    > objXMLDoc.selectNodes("rootNode/childNode/node_name")


    WOW. That DID make a difference.
    What was taking over an hour before now takes about 2 minutes!
    Thank you !!!!!!!!!!!!!!!!
    PedroX, Jun 27, 2005
    #5
  6. > WOW. That DID make a difference.
    > What was taking over an hour before now takes about 2 minutes!
    > Thank you !!!!!!!!!!!!!!!!
    >


    This will make a huge difference. Remember that with XPath, the //node_name
    means that it will search *every* node in the entire document. If you make
    it more specific, it will be a lot faster.

    However, when dealing with 10mb+ documents, you should really start using
    SAX and not DOM. I was unaware that VBScript couldn't do SAX, since MSXML's
    SAX parser is just a COM object, I figured you could (I've just never
    tried). If you can't implement the interface, you could always create a COM
    Wrapper that does specifically what you need and call that from your ASP
    page. I.e. using VB, create a COM object that takes an XML string, it
    implements the SAX parser to do the inserts into Access, etc.

    But the point is, 10mb+, stay away from DOM, use SAX...

    Bryce K. Nielsen
    SysOnyx, Inc. (www.sysonyx.com)
    Makers of xmlDig, the XML-SQL Extractor
    http://www.sysonyx.com/products/xmldig

    P.S. Why did you cross-post this? I typically find better results when I
    post messages to one board at a time...
    Bryce K. Nielsen, Jun 27, 2005
    #6
  7. PedroX wrote:

    > WOW. That DID make a difference.
    > What was taking over an hour before now takes about 2 minutes!


    Expat (a XML/SAX parser) needs about 2 seconds for 10 MB.
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Jun 27, 2005
    #7
  8. PedroX

    Brian Staff Guest

    > WOW. That DID make a difference.
    > What was taking over an hour before now takes about 2 minutes!


    Well, it was a bit of a guess on my part<g> - but it is encouraging to know
    that explicit Xpath naming does really make a difference.

    Brian
    Brian Staff, Jun 27, 2005
    #8
  9. PedroX

    PedroX Guest

    > But the point is, 10mb+, stay away from DOM, use SAX...

    I wanted to, but I the whole thing (including the alternative .NET's
    XmlTextReader)
    is just beyond my comprehension. I found no tutorials that I could
    understand.
    I know VBScript, Javascript / JScript, and that's pretty much it.
    No Java, no C, no Visual Basic per se (although is similar to VBScript).
    PedroX, Jun 27, 2005
    #9
  10. > Well, it was a bit of a guess on my part<g> - but it is encouraging to
    know
    > that explicit Xpath naming does really make a difference.
    >


    Yeah, it will. The double-slash is like a wildcard, search *every* node for
    this xpath. If you use an explicit path, it knows to only look in one area.
    Also don't forget that the result set of a wildcard search could be large,
    where-as an explicit one will probably only return the one node...

    -BKN
    Bryce K. Nielsen, Jun 28, 2005
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Yi Xing
    Replies:
    6
    Views:
    455
    Simon Forman
    Jul 26, 2006
  2. Catherine Moroney

    fast copying of large files in python

    Catherine Moroney, Nov 2, 2011, in forum: Python
    Replies:
    1
    Views:
    859
    Dave Angel
    Nov 2, 2011
  3. Devesh Agrawal
    Replies:
    18
    Views:
    248
  4. Stuart Clarke

    Fast searching of large files

    Stuart Clarke, Jul 1, 2010, in forum: Ruby
    Replies:
    6
    Views:
    188
    Roger Pack
    Jul 1, 2010
  5. Philip Rhoades
    Replies:
    6
    Views:
    258
    Brian Candler
    Feb 27, 2011
Loading...

Share This Page