Parsing large XML files FAST

P

PedroX

Hello:

I need to parse some large XML files, and save the data in an Access DB. I
was using MSXML 2 and ASP, but it turns out to be extremely slow when then
XML documents are like 10 mb in size. It's taking over an hour to parse such
sizes!?

I don't really need to use ASP or a web server at all because I am parsing
all in my own computer. Is there any executable that can do this parsing
faster than the way I was doing it?

Thanks in advance.
 
P

PedroX

I said:
I need to parse some large XML files, and save the data in an Access DB. I
was using MSXML 2 and ASP, but it turns out to be extremely slow when
then

I made a mistake. I am actually using MSXML 4.0.
 
B

Brian Staff

Is there any executable that can do this parsing
faster than the way I was doing it?

I am not an expert on techniques of parsing, but if performance were a
problem for me, I would try and use as much explicit node naming as
possible...for instance I would maybe recode the above statement to be
something like this:

objXMLDoc.selectNodes("rootNode/childNode/node_name")

I know if _I_ was the parser, I would be able to find those nodes in a 10mb
structure quicker using the second technique rather than using the first.

JAT - Brian
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

PedroX said:
I need to parse some large XML files, and save the data in an Access DB. I
was using MSXML 2 and ASP, but it turns out to be extremely slow when then
XML documents are like 10 mb in size. It's taking over an hour to parse such
sizes!?

Andrew Schorr had a similar problem. He read
XML larger than a GigaByte and stored them into
Postgres. He also had problems with finding the
right tool for it. Eventually, he used an extension
of the GNU Awk language:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/

Use Google and you will find his explanations
in comp.lang.awk.
I don't really need to use ASP or a web server at all because I am parsing
all in my own computer. Is there any executable that can do this parsing
faster than the way I was doing it?

XML parsers can read large files fast only with
the SAX approach (or similar event-driven models).
The DOM model simply cant do this because it has
to hold the complete XML tree in memory.
 
P

PedroX

"Brian Staff" wrote
I am not an expert on techniques of parsing, but if performance were a
problem for me, I would try and use as much explicit node naming as
possible...for instance I would maybe recode the above statement to be
something like this:

objXMLDoc.selectNodes("rootNode/childNode/node_name")

WOW. That DID make a difference.
What was taking over an hour before now takes about 2 minutes!
Thank you !!!!!!!!!!!!!!!!
 
B

Bryce K. Nielsen

WOW. That DID make a difference.
What was taking over an hour before now takes about 2 minutes!
Thank you !!!!!!!!!!!!!!!!

This will make a huge difference. Remember that with XPath, the //node_name
means that it will search *every* node in the entire document. If you make
it more specific, it will be a lot faster.

However, when dealing with 10mb+ documents, you should really start using
SAX and not DOM. I was unaware that VBScript couldn't do SAX, since MSXML's
SAX parser is just a COM object, I figured you could (I've just never
tried). If you can't implement the interface, you could always create a COM
Wrapper that does specifically what you need and call that from your ASP
page. I.e. using VB, create a COM object that takes an XML string, it
implements the SAX parser to do the inserts into Access, etc.

But the point is, 10mb+, stay away from DOM, use SAX...

Bryce K. Nielsen
SysOnyx, Inc. (www.sysonyx.com)
Makers of xmlDig, the XML-SQL Extractor
http://www.sysonyx.com/products/xmldig

P.S. Why did you cross-post this? I typically find better results when I
post messages to one board at a time...
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

PedroX said:
WOW. That DID make a difference.
What was taking over an hour before now takes about 2 minutes!

Expat (a XML/SAX parser) needs about 2 seconds for 10 MB.
 
B

Brian Staff

WOW. That DID make a difference.
What was taking over an hour before now takes about 2 minutes!

Well, it was a bit of a guess on my part<g> - but it is encouraging to know
that explicit Xpath naming does really make a difference.

Brian
 
P

PedroX

But the point is, 10mb+, stay away from DOM, use SAX...

I wanted to, but I the whole thing (including the alternative .NET's
XmlTextReader)
is just beyond my comprehension. I found no tutorials that I could
understand.
I know VBScript, Javascript / JScript, and that's pretty much it.
No Java, no C, no Visual Basic per se (although is similar to VBScript).
 
B

Bryce K. Nielsen

Well said:
that explicit Xpath naming does really make a difference.

Yeah, it will. The double-slash is like a wildcard, search *every* node for
this xpath. If you use an explicit path, it knows to only look in one area.
Also don't forget that the result set of a wildcard search could be large,
where-as an explicit one will probably only return the one node...

-BKN
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top