Processing a huge xml file

T

Tim Perrett

Hey guys

I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Overall, its about 20,000 lines of XML to load. Even on my macbook pro
with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
GB of virtual memory). This is obviously unacceptable, but I am not sure
that a work around exists?

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

Cheers

Tim
 
L

Lloyd Linklater

Tim said:
I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Overall, its about 20,000 lines of XML to load. Even on my macbook pro
with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
GB of virtual memory). This is obviously unacceptable, but I am not sure
that a work around exists?

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

Run it in windows? :)

But seriously, 20k lines of XML should not take that much memory unless
the lines are HUGE. How about a simplistic approach? I know that this
is not intensively RUBY but it may help.

What if you were to launch it in a browser? They display XML files in
formatted fashion which means that they must parse them. You could then
parse through the resulting page and see if there is an error message
therein. Just a text search for "XML Parsing Error" and that should
tell you if it worked.
 
T

Trans

Hey guys

I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Overall, its about 20,000 lines of XML to load. Even on my macbook pro
with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
GB of virtual memory). This is obviously unacceptable, but I am not sure
that a work around exists?

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

libxml has some know issues, memory consumption especially. Hopefully
they will get fixed, but in the mean time one can only frown at the
irony -- <rubyXML> was one of the earliest Ruby web sites around, yet
Ruby's support of _fast_ XML processing is still dearly lacking.

T.
 
R

Robert Klemme

2007/7/23 said:
Hey guys

I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Overall, its about 20,000 lines of XML to load. Even on my macbook pro
with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
GB of virtual memory). This is obviously unacceptable, but I am not sure
that a work around exists?

I wanted to load in the schema in order to validate the messages and xml
I was generating. Has anyone any ideas on a potential work around?

The generic answer would be, use a XML stream parser (as opposed to a
DOM parser). Even if you directly fill up a model that contains the
whole document it's likely less resource intensive than a DOM. Of
course it's optimal (resource wise) if you can do your validation on
the fly (i.e. while stream parsing).

Kind regards

robert
 
T

Tim Perrett

Lloyd said:
Run it in windows? :)

But seriously, 20k lines of XML should not take that much memory unless
the lines are HUGE. How about a simplistic approach? I know that this
is not intensively RUBY but it may help.

What if you were to launch it in a browser? They display XML files in
formatted fashion which means that they must parse them. You could then
parse through the resulting page and see if there is an error message
therein. Just a text search for "XML Parsing Error" and that should
tell you if it worked.

Thats a very fair point actually, if it runs in the browser, it must be
parsable. Its actually 32,606 lines!
Firefox used 500mb of RAM to open it, so in theory, libxml-ruby should
be able to use less i would have thought? Unless its DOM methodology is
just a lot more memory intensive?

What are peoples thoughts? Is it crazy trying to ask libxml to read that
much into memory?

Cheers

Tim
 
L

Lloyd Linklater

Tim said:
Firefox used 500mb of RAM to open it, so in theory, libxml-ruby should
be able to use less i would have thought? Unless its DOM methodology is
just a lot more memory intensive?

I am new to ruby and, as much as I love the language syntax, I have yet
to see how to actually use it in real world applications. I know that
is likely to get me into trouble as everyone else seems to do it but
there it is.

That having been said, it can be seen that I do not know the inner
workings of Ruby well enough to dig that far inside. However, it cannot
be the DOM as the browser uses that to parse. There would have to be
some other thing that is making the difference and finding that goes
beyond my Ruby knowledge.
 
J

James Moore

I was wondering what advice anyone could possibly hand me about
processing a huge XML (in fact its an XSD file)

Something's going wrong. 20k lines is a pretty small XML file; we're
sucking in files that are larger than that (50meg or so - a little
less than a million lines long) many times a day using the Ruby libxml
bindings and not seeing a similar issue. It's possible that your
average line length is _much_ longer than ours, of course. Our normal
process size is about 400m, but a big chunk of that is the processing
we're doing on the data; I want to say that the size after loading in
the xml is in the 200m range, but I haven't looked at that for a
while.

Are you doing stream processing? We never tried to load the whole
document at once, so there may be an issue doing that.

- James Moore
 
T

Tim Perrett

Lloyd said:
That having been said, it can be seen that I do not know the inner
workings of Ruby well enough to dig that far inside. However, it cannot
be the DOM as the browser uses that to parse. There would have to be
some other thing that is making the difference and finding that goes
beyond my Ruby knowledge.

I wonder if its somthing to do with the XSD includes and imports that it
doesnt like.... i might have to ask the libxml core team

Cheers

Tim
 
T

Tim Perrett

Hey all

thanks for your replys!

The file in question is actually an XSD file, so I think your right,
XML::Schema.new() would use DOM parsing. Does lixml even suport stream
parsing? I cant seem to find a great deal on it...

Has anyone ever had any experience with such a large XSD? I cant think
there would be a way of validating the instance xml without the XSD
being held in memory to then check against?

How do things like xerces manage it with java?

I fear i might be wanting the imposible! lol

Cheers

-Tim
 
R

Robert Klemme

2007/7/27 said:
The file in question is actually an XSD file, so I think your right,
XML::Schema.new() would use DOM parsing. Does lixml even suport stream
parsing? I cant seem to find a great deal on it...

Has anyone ever had any experience with such a large XSD? I cant think
there would be a way of validating the instance xml without the XSD
being held in memory to then check against?

Yes and no: since the XML (XSD in your case) is known the parser could
store an optimized representation in memory (i.e. does not need the
original DOM).
How do things like xerces manage it with java?

When a colleague testes JDom few years ago, it needed loads of mem.
But of course, that could have changed by now (and also, there's 64
bit JVMs).
I fear i might be wanting the imposible! lol

"Impossible is nothing - Ruby..." :)

Kind regards

robert
 
T

Tim Perrett

Good point, and thanks for the reply :)

When you say "known the parser could store an optimized representation
in memory" what exactly do you mean?

Cheers

TP
 
R

Robert Klemme

When you say "known the parser could store an optimized representation
in memory" what exactly do you mean?

XML is a generic format, so a XML DOM needs to be able to store all
variants. XSD is a specific format (as is every other format defined by
a DTD or even XDS) and so you can craft a specific model that represents
XSD's object model.

One example: since XML is markup you can have things like

<foo>text<bar>13</bar>blah</foo>

Any DOM implementation needs to be able to store "text" and "blah". But
often, when XML is used to represent data, there is either text in an
element *or* nested elements but not both. An OO implementation then
would only need to allow for one of the two. Hope that clears it up.

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top