Update huge xml files without loading into RAM

  • Thread starter Christian Hiller
  • Start date
C

Christian Hiller

Hi,

i have to update large XML-files (> 100 MB). Sometimes some attributes
has to be changed. If i load it with the sax reader into a DOM-Object
this will require a huge amount of RAM.

Does somebody know how to handle so huge xml-files without loading it
into the RAM? I there a possibility to update xml-files directy in the
filesystem?

Thanks
Christian
 
S

Skippy

i have to update large XML-files (> 100 MB). Sometimes some attributes
has to be changed. If i load it with the sax reader into a DOM-Object
this will require a huge amount of RAM.

Does somebody know how to handle so huge xml-files without loading it
into the RAM? I there a possibility to update xml-files directy in the
filesystem?

I'm affraid you can't, because the meaning and interpretation of xml-tags
depend on their parent-tags, so the whole tree has to be in RAM.
 
T

Tim Ward

Christian Hiller said:
Hi,

i have to update large XML-files (> 100 MB). Sometimes some attributes
has to be changed. If i load it with the sax reader into a DOM-Object
this will require a huge amount of RAM.

Does somebody know how to handle so huge xml-files without loading it
into the RAM? I there a possibility to update xml-files directy in the
filesystem?

Ah, so it sounds like the "XML is the solution to all data representation
and storage problems" approach might have some teensy little drawbacks,
doesn't it. Like scalability for example.

You'll have to restructure the data into a real database - SQL might sound
very old hat and boring to XML advocates, but RDMSs work, have been working
for decades, and can handle tiddly little bits of data like 100MB without
even stirring in their sleep. You can, of course, store smaller XML
fragments in the database if that turns out to be helpful.

Then updating a few attributes will take a small fraction of a second using
a few K of RAM.
 
M

Marco Schmidt

Christian Hiller:
i have to update large XML-files (> 100 MB). Sometimes some attributes
has to be changed. If i load it with the sax reader into a DOM-Object
this will require a huge amount of RAM.

Does somebody know how to handle so huge xml-files without loading it
into the RAM? I there a possibility to update xml-files directy in the
filesystem?

Directly in the file system: only if the changed attributes are
exactly as large as the old ones. Then you can seek to the right
position (how you discover that is another question) and overwrite the
old attribute.

Obviously, you could get a good programmer's text editor like
Ultraedit and edit the large file manually. May be slow, depending on
the system, but it works.

As for loading into RAM - you can extend
org.xml.sax.helpers.DefaultHandler and make it copy everything into
some output stream, only changing your attributes. This will require
reading and writing the whole input file, but you don't have to have
everything in memory at the same time.

If you have to change huge XML files on a regular basis, something is
flawed with the whole approach. Huge XML files are just not efficient
for editing the data. Maybe that data can be changed in a database
which is the basis for an XML export, or everything could be put into
several XML files instead of one, ...

Regards,
Marco
 
J

Johan Poppe

Skippy skrev:
I'm affraid you can't, because the meaning and interpretation of xml-tags
depend on their parent-tags, so the whole tree has to be in RAM.

That is not correct.

You don't need the whole tree in RAM to know the parent elements of
any given element, it's enough to have the stack of currently open
elements. Unless the tree consist of one very long branch, that's a
huge difference.

Depending on the DTD, and on what level of understanding you need for
the job you are doing on the file, you may not even need to know all
parent elements to understand a given element, it may be enough to
have a few flags to tell where you are in the file and you don't even
need the open elements stack.


So yes, you can update an xml file without ever storing all of it in
RAM: Read in with a SAX parser. Store elements in a stack in
startElement(), and pop out of the stack in endElement(). As you get
data in startElement(), endElement(), characters() and
ignorableWhiteSpace(), you write the updated version to a temp-file.
Finally, delete original file and rename temp-file.
 
G

Grzegorz Glowaty

Hi,
i have to update large XML-files (> 100 MB). Sometimes some attributes
has to be changed. If i load it with the sax reader into a DOM-Object
this will require a huge amount of RAM.

Does somebody know how to handle so huge xml-files without loading it
into the RAM? I there a possibility to update xml-files directy in the
filesystem?

Hi. I cannot see any easy solution but
what about parsing the file with SAX parser (not to loading it into dom) and
just rewriting its contents to another temporary file.
When a node to update is reached update action is performed and updated node
is written out instead of old node.
After all you got a new file in your temporary file. Just rename it to the
old one.
Should work well and should not be time or RAM consuming action.

greg
 
S

SPC

You are probably on to a hiding to nothing.

It's probably going to be better to split the files up and work with
subsets of them.

It *might* be possible to use SAX and parse your way through the file
to the point you need to change, writing the parsed SAX output to an
output file. When you hit the point you need to modify, write your
changes to the file, then continue to parse. Repeat until end of file.
There's code on the Sun site that echos an XML file using SAX, its
part of their SAX, DOM XML tutorial you could start with that, and
then graft on the code to recognise the bits you need to
change/delete/insert.

Mind you, I can't help feeling that 100Mb XML files are a bit huge.
I'd still be inclined to try and split up those files if at all
possible...

HTH

Steve
 
A

Andrew

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Christian Hiller wrote:
| Hi,
|
| i have to update large XML-files (> 100 MB). Sometimes some attributes
| has to be changed. If i load it with the sax reader into a DOM-Object
| this will require a huge amount of RAM.
|
| Does somebody know how to handle so huge xml-files without loading it
| into the RAM? I there a possibility to update xml-files directy in the
| filesystem?
|
| Thanks
| Christian

You could use a filter to do this, both reading and writing the data.
See this article:
http://www-106.ibm.com/developerworks/xml/library/x-tipbigdoc.html

Note: There are 4 parts to it, but the important sections for you will
relate specifically to filtering the xml document before it reaches
memory (ie -> filter the _stream_).



- --
[A n d r e w]
- ------|------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/haMCx4nopx2ZsVURAuZdAKDD/qDE0Ov6V5bHnfUtU+3EBgWRpwCggvVD
Vi13w56215h3NGxheQYHMOs=
=smMe
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top