Processing a huge xml file

Discussion in 'Ruby' started by Tim Perrett, Jul 23, 2007.

  1. Tim Perrett

    Tim Perrett Guest

    Hey guys

    I was wondering what advice anyone could possibly hand me about
    processing a huge XML (in fact its an XSD file)

    Overall, its about 20,000 lines of XML to load. Even on my macbook pro
    with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
    GB of virtual memory). This is obviously unacceptable, but I am not sure
    that a work around exists?

    I wanted to load in the schema in order to validate the messages and xml
    I was generating. Has anyone any ideas on a potential work around?

    Cheers

    Tim
    --
    Posted via http://www.ruby-forum.com/.
     
    Tim Perrett, Jul 23, 2007
    #1
    1. Advertising

  2. Tim Perrett wrote:

    > I was wondering what advice anyone could possibly hand me about
    > processing a huge XML (in fact its an XSD file)
    >
    > Overall, its about 20,000 lines of XML to load. Even on my macbook pro
    > with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
    > GB of virtual memory). This is obviously unacceptable, but I am not sure
    > that a work around exists?
    >
    > I wanted to load in the schema in order to validate the messages and xml
    > I was generating. Has anyone any ideas on a potential work around?


    Run it in windows? :)

    But seriously, 20k lines of XML should not take that much memory unless
    the lines are HUGE. How about a simplistic approach? I know that this
    is not intensively RUBY but it may help.

    What if you were to launch it in a browser? They display XML files in
    formatted fashion which means that they must parse them. You could then
    parse through the resulting page and see if there is an error message
    therein. Just a text search for "XML Parsing Error" and that should
    tell you if it worked.
    --
    Posted via http://www.ruby-forum.com/.
     
    Lloyd Linklater, Jul 23, 2007
    #2
    1. Advertising

  3. Tim Perrett

    Trans Guest

    On Jul 23, 4:28 am, Tim Perrett <> wrote:
    > Hey guys
    >
    > I was wondering what advice anyone could possibly hand me about
    > processing a huge XML (in fact its an XSD file)
    >
    > Overall, its about 20,000 lines of XML to load. Even on my macbook pro
    > with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
    > GB of virtual memory). This is obviously unacceptable, but I am not sure
    > that a work around exists?
    >
    > I wanted to load in the schema in order to validate the messages and xml
    > I was generating. Has anyone any ideas on a potential work around?


    libxml has some know issues, memory consumption especially. Hopefully
    they will get fixed, but in the mean time one can only frown at the
    irony -- <rubyXML> was one of the earliest Ruby web sites around, yet
    Ruby's support of _fast_ XML processing is still dearly lacking.

    T.
     
    Trans, Jul 23, 2007
    #3
  4. 2007/7/23, Tim Perrett <>:
    > Hey guys
    >
    > I was wondering what advice anyone could possibly hand me about
    > processing a huge XML (in fact its an XSD file)
    >
    > Overall, its about 20,000 lines of XML to load. Even on my macbook pro
    > with 2GB of RAM, libxml-ruby eats it up extremely quickly (and about 2.8
    > GB of virtual memory). This is obviously unacceptable, but I am not sure
    > that a work around exists?
    >
    > I wanted to load in the schema in order to validate the messages and xml
    > I was generating. Has anyone any ideas on a potential work around?


    The generic answer would be, use a XML stream parser (as opposed to a
    DOM parser). Even if you directly fill up a model that contains the
    whole document it's likely less resource intensive than a DOM. Of
    course it's optimal (resource wise) if you can do your validation on
    the fly (i.e. while stream parsing).

    Kind regards

    robert
     
    Robert Klemme, Jul 23, 2007
    #4
  5. Tim Perrett

    Tim Perrett Guest

    Lloyd Linklater wrote:

    >
    > Run it in windows? :)
    >
    > But seriously, 20k lines of XML should not take that much memory unless
    > the lines are HUGE. How about a simplistic approach? I know that this
    > is not intensively RUBY but it may help.
    >
    > What if you were to launch it in a browser? They display XML files in
    > formatted fashion which means that they must parse them. You could then
    > parse through the resulting page and see if there is an error message
    > therein. Just a text search for "XML Parsing Error" and that should
    > tell you if it worked.


    Thats a very fair point actually, if it runs in the browser, it must be
    parsable. Its actually 32,606 lines!
    Firefox used 500mb of RAM to open it, so in theory, libxml-ruby should
    be able to use less i would have thought? Unless its DOM methodology is
    just a lot more memory intensive?

    What are peoples thoughts? Is it crazy trying to ask libxml to read that
    much into memory?

    Cheers

    Tim
    --
    Posted via http://www.ruby-forum.com/.
     
    Tim Perrett, Jul 23, 2007
    #5
  6. Tim Perrett wrote:

    > Firefox used 500mb of RAM to open it, so in theory, libxml-ruby should
    > be able to use less i would have thought? Unless its DOM methodology is
    > just a lot more memory intensive?


    I am new to ruby and, as much as I love the language syntax, I have yet
    to see how to actually use it in real world applications. I know that
    is likely to get me into trouble as everyone else seems to do it but
    there it is.

    That having been said, it can be seen that I do not know the inner
    workings of Ruby well enough to dig that far inside. However, it cannot
    be the DOM as the browser uses that to parse. There would have to be
    some other thing that is making the difference and finding that goes
    beyond my Ruby knowledge.
    --
    Posted via http://www.ruby-forum.com/.
     
    Lloyd Linklater, Jul 23, 2007
    #6
  7. Tim Perrett

    James Moore Guest

    On 7/23/07, Tim Perrett <> wrote:
    > I was wondering what advice anyone could possibly hand me about
    > processing a huge XML (in fact its an XSD file)


    Something's going wrong. 20k lines is a pretty small XML file; we're
    sucking in files that are larger than that (50meg or so - a little
    less than a million lines long) many times a day using the Ruby libxml
    bindings and not seeing a similar issue. It's possible that your
    average line length is _much_ longer than ours, of course. Our normal
    process size is about 400m, but a big chunk of that is the processing
    we're doing on the data; I want to say that the size after loading in
    the xml is in the 200m range, but I haven't looked at that for a
    while.

    Are you doing stream processing? We never tried to load the whole
    document at once, so there may be an issue doing that.

    - James Moore
     
    James Moore, Jul 23, 2007
    #7
  8. Tim Perrett

    Tim Perrett Guest

    Lloyd Linklater wrote:
    > That having been said, it can be seen that I do not know the inner
    > workings of Ruby well enough to dig that far inside. However, it cannot
    > be the DOM as the browser uses that to parse. There would have to be
    > some other thing that is making the difference and finding that goes
    > beyond my Ruby knowledge.


    I wonder if its somthing to do with the XSD includes and imports that it
    doesnt like.... i might have to ask the libxml core team

    Cheers

    Tim
    --
    Posted via http://www.ruby-forum.com/.
     
    Tim Perrett, Jul 23, 2007
    #8
  9. Raymond O'Connor, Jul 26, 2007
    #9
  10. Tim Perrett

    Tim Perrett Guest

    Hey all

    thanks for your replys!

    The file in question is actually an XSD file, so I think your right,
    XML::Schema.new() would use DOM parsing. Does lixml even suport stream
    parsing? I cant seem to find a great deal on it...

    Has anyone ever had any experience with such a large XSD? I cant think
    there would be a way of validating the instance xml without the XSD
    being held in memory to then check against?

    How do things like xerces manage it with java?

    I fear i might be wanting the imposible! lol

    Cheers

    -Tim
    --
    Posted via http://www.ruby-forum.com/.
     
    Tim Perrett, Jul 26, 2007
    #10
  11. 2007/7/27, Tim Perrett <>:
    > The file in question is actually an XSD file, so I think your right,
    > XML::Schema.new() would use DOM parsing. Does lixml even suport stream
    > parsing? I cant seem to find a great deal on it...
    >
    > Has anyone ever had any experience with such a large XSD? I cant think
    > there would be a way of validating the instance xml without the XSD
    > being held in memory to then check against?


    Yes and no: since the XML (XSD in your case) is known the parser could
    store an optimized representation in memory (i.e. does not need the
    original DOM).

    > How do things like xerces manage it with java?


    When a colleague testes JDom few years ago, it needed loads of mem.
    But of course, that could have changed by now (and also, there's 64
    bit JVMs).

    > I fear i might be wanting the imposible! lol


    "Impossible is nothing - Ruby..." :)

    Kind regards

    robert
     
    Robert Klemme, Jul 27, 2007
    #11
  12. Tim Perrett

    Tim Perrett Guest

    Good point, and thanks for the reply :)

    When you say "known the parser could store an optimized representation
    in memory" what exactly do you mean?

    Cheers

    TP
    --
    Posted via http://www.ruby-forum.com/.
     
    Tim Perrett, Jul 28, 2007
    #12
  13. On 29.07.2007 00:55, Tim Perrett wrote:
    > When you say "known the parser could store an optimized representation
    > in memory" what exactly do you mean?


    XML is a generic format, so a XML DOM needs to be able to store all
    variants. XSD is a specific format (as is every other format defined by
    a DTD or even XDS) and so you can craft a specific model that represents
    XSD's object model.

    One example: since XML is markup you can have things like

    <foo>text<bar>13</bar>blah</foo>

    Any DOM implementation needs to be able to store "text" and "blah". But
    often, when XML is used to represent data, there is either text in an
    element *or* nested elements but not both. An OO implementation then
    would only need to allow for one of the two. Hope that clears it up.

    Kind regards

    robert
     
    Robert Klemme, Jul 29, 2007
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bomb Diggy
    Replies:
    0
    Views:
    456
    Bomb Diggy
    Jul 28, 2004
  2. Xenia
    Replies:
    4
    Views:
    445
    Xenia
    Nov 25, 2003
  3. John Redmond

    Processing huge files

    John Redmond, Jun 27, 2006, in forum: XML
    Replies:
    0
    Views:
    458
    John Redmond
    Jun 27, 2006
  4. Anders =?UTF-8?B?U8O4bmRlcmdhYXJk?=

    Processing huge datasets

    Anders =?UTF-8?B?U8O4bmRlcmdhYXJk?=, May 10, 2004, in forum: Python
    Replies:
    6
    Views:
    379
    William Park
    May 12, 2004
  5. Replies:
    3
    Views:
    522
Loading...

Share This Page