Removing elements from large XML documents

Discussion in 'Java' started by Jakub Moskal, Mar 28, 2007.

  1. Jakub Moskal

    Jakub Moskal Guest

    Hi,

    I need to remove certain elements from the XML document tree based on
    given parameters, e.g. I have a document with a structure as follows:

    <country>
    <city>
    <street name="streetName" />
    </city>
    </country>

    and I want to remove all <country> nodes for which the street name is
    "someName" (I know the example is lame, but it exposes my problem).

    Initially I used DOM and whenever I found <street> element with the
    name attribute that I don't want, I removed such country using:
    root.removeChild(node.getParent().getParent().getParent())).

    It worked just fine with small files, but problems occurred when I
    started dealing with docs that are 10-60MB in size. DOM loads the
    entire document tree into the memory and this solution doesn't scale
    at all - on most computers I get memory issues. I don't want to go
    into giving JVM more memory, because I don't feel that this is the
    direction in which I should go about it - it's not a universal
    solution.

    SAX parses the document in a serial fashion, I can't find a way to
    remove the great-grand-node of the current element with it. Processing
    XSLT works similar to DOM and memory issues occur.

    Is there anything else out there that would help me solve this issue?
    Would chopping the file into smaller pieces be a good solution?

    Any help greatly appreciated,
    Jakub.
     
    Jakub Moskal, Mar 28, 2007
    #1
    1. Advertising

  2. Jakub Moskal

    Tom Hawtin Guest

    Jakub Moskal wrote:
    >
    > SAX parses the document in a serial fashion, I can't find a way to
    > remove the great-grand-node of the current element with it. Processing
    > XSLT works similar to DOM and memory issues occur.


    (Strictly whether XSLT uses a DOM is implementation dependent. There was
    some talk of making Xalan work in a streaming mode several years ago,
    but XSLT isn't seen as sexy as it once was.)

    My suggestion is that when you hit a <country> element, you switch to a
    temporary stream (StringWriter, say). When you find a <street> element
    you don't want, you switch the output to a null stream. At the end of
    the </country> element (or before) write the temporary stream to the
    real output stream, and switch back.

    (I suggest not using RandomAccessFile to jump backwards, as it is
    excessively slow.)

    Tom Hawtin
     
    Tom Hawtin, Mar 28, 2007
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mike
    Replies:
    4
    Views:
    881
    Bob Foster
    Nov 23, 2003
  2. Replies:
    1
    Views:
    482
    Juan T. Llibre
    Oct 18, 2006
  3. Chris  Chiasson
    Replies:
    6
    Views:
    622
    Richard Tobin
    Nov 14, 2006
  4. Adam Hartshorne
    Replies:
    2
    Views:
    374
    Nitin Motgi
    Jan 27, 2006
  5. cyberco
    Replies:
    2
    Views:
    1,271
    Roedy Green
    Nov 7, 2007
Loading...

Share This Page