Processing a huge xml file

Tim Perrett · Jul 23, 2007

Hey guys

I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Overall, its about 20,000 lines of XML to load. Even on my macbook pro
with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
GB of virtual memory). This is obviously unacceptable, but I am not sure
that a work around exists?

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

Cheers

Tim

Lloyd Linklater · Jul 23, 2007

Tim said:
I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Overall, its about 20,000 lines of XML to load. Even on my macbook pro
with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
GB of virtual memory). This is obviously unacceptable, but I am not sure
that a work around exists?

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

Run it in windows?

But seriously, 20k lines of XML should not take that much memory unless
the lines are HUGE. How about a simplistic approach? I know that this
is not intensively RUBY but it may help.

What if you were to launch it in a browser? They display XML files in
formatted fashion which means that they must parse them. You could then
parse through the resulting page and see if there is an error message
therein. Just a text search for "XML Parsing Error" and that should
tell you if it worked.

Trans · Jul 23, 2007

Hey guys

I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Overall, its about 20,000 lines of XML to load. Even on my macbook pro
with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
GB of virtual memory). This is obviously unacceptable, but I am not sure
that a work around exists?

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

libxml has some know issues, memory consumption especially. Hopefully
they will get fixed, but in the mean time one can only frown at the
irony -- <rubyXML> was one of the earliest Ruby web sites around, yet
Ruby's support of _fast_ XML processing is still dearly lacking.

T.

Robert Klemme · Jul 23, 2007

2007/7/23 said:
Hey guys

I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Overall, its about 20,000 lines of XML to load. Even on my macbook pro
with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
GB of virtual memory). This is obviously unacceptable, but I am not sure
that a work around exists?

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

The generic answer would be, use a XML stream parser (as opposed to a
DOM parser). Even if you directly fill up a model that contains the
whole document it's likely less resource intensive than a DOM. Of
course it's optimal (resource wise) if you can do your validation on
the fly (i.e. while stream parsing).

Kind regards

robert

Tim Perrett · Jul 23, 2007

Lloyd said:
Run it in windows?

But seriously, 20k lines of XML should not take that much memory unless
the lines are HUGE. How about a simplistic approach? I know that this
is not intensively RUBY but it may help.

What if you were to launch it in a browser? They display XML files in
formatted fashion which means that they must parse them. You could then
parse through the resulting page and see if there is an error message
therein. Just a text search for "XML Parsing Error" and that should
tell you if it worked.

Thats a very fair point actually, if it runs in the browser, it must be
parsable. Its actually 32,606 lines!
Firefox used 500mb of RAM to open it, so in theory, libxml-ruby should
be able to use less i would have thought? Unless its DOM methodology is
just a lot more memory intensive?

What are peoples thoughts? Is it crazy trying to ask libxml to read that
much into memory?

Cheers

Tim

Lloyd Linklater · Jul 23, 2007

Tim said:
Firefox used 500mb of RAM to open it, so in theory, libxml-ruby should
be able to use less i would have thought? Unless its DOM methodology is
just a lot more memory intensive?

I am new to ruby and, as much as I love the language syntax, I have yet
to see how to actually use it in real world applications. I know that
is likely to get me into trouble as everyone else seems to do it but
there it is.

That having been said, it can be seen that I do not know the inner
workings of Ruby well enough to dig that far inside. However, it cannot
be the DOM as the browser uses that to parse. There would have to be
some other thing that is making the difference and finding that goes
beyond my Ruby knowledge.

James Moore · Jul 23, 2007

I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Something's going wrong. 20k lines is a pretty small XML file; we're
sucking in files that are larger than that (50meg or so - a little
less than a million lines long) many times a day using the Ruby libxml
bindings and not seeing a similar issue. It's possible that your
average line length is _much_ longer than ours, of course. Our normal
process size is about 400m, but a big chunk of that is the processing
we're doing on the data; I want to say that the size after loading in
the xml is in the 200m range, but I haven't looked at that for a
while.

Are you doing stream processing? We never tried to load the whole
document at once, so there may be an issue doing that.

- James Moore

Tim Perrett · Jul 23, 2007

Lloyd said:
That having been said, it can be seen that I do not know the inner
workings of Ruby well enough to dig that far inside. However, it cannot
be the DOM as the browser uses that to parse. There would have to be
some other thing that is making the difference and finding that goes
beyond my Ruby knowledge.

I wonder if its somthing to do with the XSD includes and imports that it
doesnt like.... i might have to ask the libxml core team

Cheers

Tim

Raymond O'Connor · Jul 26, 2007

I wrote a ruby script which parses a 25gb xml file. I used the
XMLParser library from http://www.yoshidam.net/Ruby.html

So parsing a large amount of xml can definitely be accomplished.

-Ray

Tim Perrett · Jul 26, 2007

Hey all

thanks for your replys!

The file in question is actually an XSD file, so I think your right,
XML::Schema.new() would use DOM parsing. Does lixml even suport stream
parsing? I cant seem to find a great deal on it...

Has anyone ever had any experience with such a large XSD? I cant think
there would be a way of validating the instance xml without the XSD
being held in memory to then check against?

How do things like xerces manage it with java?

I fear i might be wanting the imposible! lol

Cheers

-Tim

Robert Klemme · Jul 27, 2007

2007/7/27 said:
The file in question is actually an XSD file, so I think your right,
XML::Schema.new() would use DOM parsing. Does lixml even suport stream
parsing? I cant seem to find a great deal on it...

Has anyone ever had any experience with such a large XSD? I cant think
there would be a way of validating the instance xml without the XSD
being held in memory to then check against?

Yes and no: since the XML (XSD in your case) is known the parser could
store an optimized representation in memory (i.e. does not need the
original DOM).

How do things like xerces manage it with java?

When a colleague testes JDom few years ago, it needed loads of mem.
But of course, that could have changed by now (and also, there's 64
bit JVMs).

I fear i might be wanting the imposible! lol

"Impossible is nothing - Ruby..."

Kind regards

robert

Tim Perrett · Jul 28, 2007

Good point, and thanks for the reply

When you say "known the parser could store an optimized representation
in memory" what exactly do you mean?

Cheers

TP

Robert Klemme · Jul 29, 2007

When you say "known the parser could store an optimized representation
in memory" what exactly do you mean?

XML is a generic format, so a XML DOM needs to be able to store all
variants. XSD is a specific format (as is every other format defined by
a DTD or even XDS) and so you can craft a specific model that represents
XSD's object model.

One example: since XML is markup you can have things like

<foo>text<bar>13</bar>blah</foo>

Any DOM implementation needs to be able to store "text" and "blah". But
often, when XML is used to represent data, there is either text in an
element *or* nested elements but not both. An OO implementation then
would only need to allow for one of the two. Hope that clears it up.

Kind regards

robert

Trying to parse a HUGE(1gb) xml file	41	Dec 20, 2010
testing speed of xml parsing in MRI and JRuby	0	Mar 30, 2008
Desc of packages for XML processing	1	Dec 23, 2005
how to write some xml into huge xml file into speceific location???	4	Nov 23, 2003
How to read sequentially from a random point in a large Xml File.(200 - 2000 MB)	1	Apr 3, 2008
ZODB memory problems (was: processing a Very Large file)	1	May 21, 2005
bxmlnode: A C++ XML file reader (parser)	0	Jul 27, 2004
JAXP - Fusing XSLT transformation results into a single XML file	1	Oct 2, 2005

Processing a huge xml file

Tim Perrett

Lloyd Linklater

Trans

Robert Klemme

Tim Perrett

Lloyd Linklater

James Moore

Tim Perrett

Raymond O'Connor

Tim Perrett

Robert Klemme

Tim Perrett

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads