Get "java.lang.OutOfMemoryError" when Parsing an XML useing DOM

N

NeoGeoSNK

Hello,
I just write a XML parsing tool use java Dom parser, It works fine
when parsing small XML files, but when I parsing a over 500000 lines
XML file, it throws an "java.lang.OutOfMemoryError" Exception at line
4.

1: File f = new File(filename);
2: DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
3: DocumentBuilder builder = factory.newDocumentBuilder();
4: Document doc = builder.parse(f);

However, I don't want to use other XML parsers such as "SAX" because I
must rewrite most of my codes :'( Below is the Syntax of the XML
file:

<journal>
<record type="1" id="275">
<header>
<header_generic>

</header_generic>
<header_specific_user>
</header_specific_user>
</header>
<body>
<frame frame_id="200011">
<attribute type="STRING">
...........

</attribute>
</frame>
............

</body>
</record>
</journal>


Is there somebody give me some suggestions?

Thanks and Best Regards!
 
A

Andrew Thompson

Hello,
I just write a XML parsing tool use java Dom parser, It works fine
when parsing small XML files, but when I parsing a over 500000 lines
XML file, it throws an "java.lang.OutOfMemoryError" Exception ..

Note that as the quoted text clearly states,
this is an *Error*, not an *Exception*. This
is an important distinction if attempting to
catch the result.

Have you tried increasing the memory available
to the application?

Andrew T.
 
M

Mike Schilling

NeoGeoSNK said:
Hello,
I just write a XML parsing tool use java Dom parser, It works fine
when parsing small XML files, but when I parsing a over 500000 lines
XML file, it throws an "java.lang.OutOfMemoryError" Exception at line
4.

1: File f = new File(filename);
2: DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
3: DocumentBuilder builder = factory.newDocumentBuilder();
4: Document doc = builder.parse(f);

However, I don't want to use other XML parsers such as "SAX" because I
must rewrite most of my codes :

DOM creates an object for each feature (element, attribute, text, etc.) of
the XML document. A bigger document occupies more memory. If you're going
to construct DOMs for huge documents, you'll need to give the JVM more
memory.

If you don't need to keep the entire document in memory (say, if you process
each element and cease to need it after it's processed), then SAX or a pull
parser would be far better choices.
 
N

NeoGeoSNK

DOM creates an object for each feature (element, attribute, text, etc.) of
the XML document. A bigger document occupies more memory. If you're going
to construct DOMs for huge documents, you'll need to give the JVM more
memory.

If you don't need to keep the entire document in memory (say, if you process
each element and cease to need it after it's processed), then SAX or a pull
parser would be far better choices.

Thanks very much,
I just use the java -Xmx1024m option to allocated 1GB memory to JVM,
but 40 minutes from now, it haven't work out the XML file :'(
 
N

NeoGeoSNK

Note that as the quoted text clearly states,
this is an *Error*, not an *Exception*. This
is an important distinction if attempting to
catch the result.

Have you tried increasing the memory available
to the application?

Andrew T.

Thanks very much,
I just increaseing the memory availble to 1Gb(java -Xmx1024m)
But It haven't finished the work from now, do you know how to
calculate the time and memory consumed?
 
N

NeoGeoSNK

Note that as the quoted text clearly states,
this is an *Error*, not an *Exception*. This
is an important distinction if attempting to
catch the result.
Thanks,
I remember I have heard before that Exception is the only error handle
mechanism of Java?
and the error log on another PC list below is different from mine:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap
space
at java.util.Arrays.copyOfRange(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at
com.sun.org.apache.xerces.internal.xni.XMLString.toString(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.AbstractDOMParser.characters(Unknown
Source)
at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)

at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown
Source)
at
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown
Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at ParsingLog.parsing(ParsingLog.java:21)
at Log2XML.main(Log2XML.java:12)

BRs
Ning Yu.
 
A

Andrew Thompson

On Mar 23, 1:52 pm, "Mike Schilling" <[email protected]> ...
*
...
I just use the java -Xmx1024m option to allocated 1GB memory to JVM,
but 40 minutes from now, it haven't work out the XML file

Another 20 minutes and it becomes an
'incomputable problem' according to
the definition as I vaguely recall..

* Sounds as though the task might be better
achieved using the optimal tools for the
job, rather than try to 'work around' the
problems of parsing the entire document
using DOM.

Andrew T.
 
N

NeoGeoSNK

Another 20 minutes and it becomes an
'incomputable problem' according to
the definition as I vaguely recall..

* Sounds as though the task might be better
achieved using the optimal tools for the
job, rather than try to 'work around' the
problems of parsing the entire document
using DOM.

Andrew T.

Thanks,
I can't wait any more time, the job is take nearly 2 hours but haven't
finished yet.I think I'll try the SAX api, is there more fast api to
parsing XML in java?
 
J

Jaakko Kangasharju

NeoGeoSNK said:
Thanks very much,
I just use the java -Xmx1024m option to allocated 1GB memory to JVM,
but 40 minutes from now, it haven't work out the XML file :'(

Do you actually have 1 GB of memory on your computer? DOM parsing
isn't actually very much slower than SAX, and for an XML file of the
size you described, parsing should be measurable in seconds on a
reasonably modern computer. So the only reason I can think of for it
to be as slow as it is is that you don't have enough physical memory
and the JVM starts paging.

I would try lowering the -Xmx option to less than the actual memory
you have and try to find a value that lets you parse the file without
paging to disk. It's hard to say the exact value, but your XML file
seems pretty heavy on the structure, so a DOM representation is going
to take a lot of memory. I have here a 2 MB XML file about as heavily
structured, and it takes about 20 MB as a DOM tree, so you can perhaps
estimate from that.
 
T

Tom Hawtin

Jaakko said:
I would try lowering the -Xmx option to less than the actual memory
you have and try to find a value that lets you parse the file without
paging to disk. It's hard to say the exact value, but your XML file

It's also worth setting -Xms to the same value as -Xmx. There is no
point in doing lots of garbage collection if you could just allocate
some more memory.

Also -server might speed things up a bit. And if in validating mode,
DocumentBuilder.setIgnoringElementContentWhitespace might reduce memory
a bit.

Tom Hawtin
 
J

John W. Kennedy

NeoGeoSNK said:
Thanks,
I remember I have heard before that Exception is the only error handle
mechanism of Java?
and the error log on another PC list below is different from mine:

Unfortunately, when speaking of Java, the word "exception" is used in
more than one way. A thing that can be thrown and caught is often called
an "exception", but the correct name is "Throwable". Throwables are
divided into two groups, the "Error" group and the "Exception" group.
The difference is that an Error normally represents a disaster, such as
running out of memory, that a program should not normally try to (or be
able to) recover from.

Therefore, your "it throws an 'java.lang.OutOfMemoryError' Exception" is
a whuzzat, like saying "a Pennsylvania Canadian".
 
J

John W. Kennedy

NeoGeoSNK said:
Thanks,
I can't wait any more time, the job is take nearly 2 hours but haven't
finished yet.I think I'll try the SAX api, is there more fast api to
parsing XML in java?

SAX won't necessarily be /faster/ -- it could be a lot slower. It
depends on what you're doing.

Are you page-thrashing? If so, than SAX is definitely a good idea.
--
John W. Kennedy
"...if you had to fall in love with someone who was evil, I can see why
it was her."
-- "Alias"
* TagZilla 0.066 * http://tagzilla.mozdev.org
 
L

Lew

John said:
SAX won't necessarily be /faster/ [than DOM] -- it could be a lot slower. It
depends on what you're doing.

Are you page-thrashing? If so, than SAX is definitely a good idea.

Another way SAX can really speed things up is that you can use it to handle an
entire XML document in a single pass without huge memory structures to build
and traverse. Back when Java 1.2 first came out I was on a project that used
Java and SAX to parse largish XML documents over the network and it ran like a
bat out of heck. With modern network tech (gigabit LAN, ...), today's
processors and the improvements in Java it would truly scream.

It sounds like the OP's DOM tree is too large to process efficiently. SAX,
correctly used, would almost certainly create a huge speed improvement - like
from 2 hours-infinite down to about a second or two, I would guess.

Like JWK said, it really depends on what you do with the parsed data.
Additional I/O (writing the parsed data to a file or DBMS), large auxiliary
memory structures and other factors could kill the speedup.

-- Lew
 
N

NeoGeoSNK

SAX won't necessarily be /faster/ -- it could be a lot slower. It
depends on what you're doing.

Are you page-thrashing? If so, than SAX is definitely a good idea.
--
John W. Kennedy
"...if you had to fall in love with someone who was evil, I can see why
it was her."
-- "Alias"
* TagZilla 0.066 *http://tagzilla.mozdev.org- Hide quoted text -

- Show quoted text -

what "...if you had to fall in love with someone who was evil, I can
see why
it was her." means?
I don't know how DOM works when it parsing a XML, I use DOM that is
because the XPath can quciky location some particular elements. I
think if the SAX only reports events but not store the whole structure
of XML like DOM does, It must be more efficient. What does "page-
thrashing" means ?
I paste the source of the code:)
FYI

public Set parsing(String filename) throws Exception{
Set subset = new LinkedHashSet();
File f = new File(filename);
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(f);
Element root = doc.getDocumentElement();
XPathFactory xpfactory = XPathFactory.newInstance();
XPath path = xpfactory.newXPath();
NodeList recoredlist = (NodeList)path.evaluate("/journal/record",
doc, XPathConstants.NODESET);
// System.out.println("frameIdlist.getLength()= " +
recoredlist.getLength());
//enumerate all record in a log
for(int i = 0; i < recoredlist.getLength(); i ++){
// System.out.println("recoredlist = " + recoredlist.item(i));
Node record = recoredlist.item(i);
Element recordelement = (Element)record;
//System.out.println(recordelement.getTagName());
//get operat type
String BEtype = (String)path.evaluate("header/header_generic/domain/
@value", recordelement);
// System.out.println("operation type = " + BEtype);
if(!BEtype.equals("SHLR::Subscription"))
continue;
SubInfo subscriber = new SubInfo();
NodeList framelist = (NodeList)path.evaluate("body/frame",
recordelement, XPathConstants.NODESET);
// System.out.println("framelist = " + framelist.getLength());
//enumerate frame list in a record
for(int j = 0; j < framelist.getLength(); j++){
// System.out.println("frame = " + framelist.item(j));
NodeList attriblist = (NodeList)path.evaluate("attribute/
attribute_value/string/@value", framelist.item(j),
XPathConstants.NODESET);
for(int k = 0; k < attriblist.getLength(); k++){
//System.out.println(attriblist.item(k));
//System.out.println(attriblist.item(k).getClass());
Node attribute = attriblist.item(k);
String value = attribute.getNodeValue();
//String value = att.getAttribute("Value");
// System.out.println("Value = " + value);
if(value.equals("create")){
subscriber.setModifier("create");
}else{
if(value.equals("modify")){
subscriber.setModifier("modify");
}else{
if(value.equals("delete")){
subscriber.setModifier("delete");
}else{
if(value.trim().matches("dirNumberId.*")){
//System.out.println("dirNumberId = " +
value);
String dirnumber =
value.substring(value.indexOf("dirNumberId=") + 12,
value.indexOf(",sHLRSubsOrganizationId"));
String ndc =
value.substring(value.indexOf("nDCId=") + 6,
value.indexOf(",managedElementId=SHLR"));
// System.out.println("dirnumber=" +
dirnumber + ndc);
subscriber.setNDCId(ndc);

subscriber.setdirNumberId(dirnumber);
}else{
if(value.equals("calledList")){
Node calledattr = attriblist.item(k + 1);
String calledvalue =
calledattr.getNodeValue();
// System.out.println("calledList = " +
calledvalue);
if(calledvalue.equals("NULL"))
subscriber.removeCalledList();
else
subscriber.addCalledList(calledvalue);
}else{
if(value.equals("callingList")){
Node callingattr = attriblist.item(k + 1);
String callingvalue =
callingattr.getNodeValue();
// System.out.println("callingList = " +
callingvalue);
if(callingvalue.equals("NULL"))
subscriber.removeCallingList();
else
subscriber.addCallingList(callingvalue);
}else{
if(value.equals("lRNumberId")){
Node lrnattr = attriblist.item(k + 1);
String lrnvalue = lrnattr.getNodeValue();
subscriber.setlrnNumberId(lrnvalue);

}
}
}

}
}
}
}
}
}
if(subscriber != null)
subset.add(subscriber);
}



return subset;
}
 
N

NeoGeoSNK

John said:
SAX won't necessarily be /faster/ [than DOM] -- it could be a lot slower. It
depends on what you're doing.
Are you page-thrashing? If so, than SAX is definitely a good idea.

Another way SAX can really speed things up is that you can use it to handle an
entire XML document in a single pass without huge memory structures to build
and traverse. Back when Java 1.2 first came out I was on a project that used
Java and SAX to parse largish XML documents over the network and it ran like a
bat out of heck. With modern network tech (gigabit LAN, ...), today's
processors and the improvements in Java it would truly scream.

It sounds like the OP's DOM tree is too large to process efficiently. SAX,
correctly used, would almost certainly create a huge speed improvement - like
from 2 hours-infinite down to about a second or two, I would guess.

Like JWK said, it really depends on what you do with the parsed data.
Additional I/O (writing the parsed data to a file or DBMS), large auxiliary
memory structures and other factors could kill the speedup.

-- Lew

Thanks,
I just want to transfer the original XML to another XML, the original
is a log of subscribers, I ectract and return a set of these
subscribers and build a new sturcture XML.
 
A

Andrew Thompson

....
what "...if you had to fall in love with someone who was evil, I can
see why
it was her." means?

Note that it was not connected to the technical
part of the conversation, it is just part of a
'sig.' or 'signature line'. Sig.s are often
intended to be humorous, or funny, and that is
just one such line. Other people's sig.'s might
push points of view that the person is particularly
fond of, or to simply add details of themselves,
or their own web sites, or links of interest to
them.

I generally prefer the 'funny' sig.s - most
other sig.s take themselves far too seriously.

(Note also that it is generally a good idea
to trim sig.s when replying, as the relevant
information ('who posted what') is still contained
in the 'Jim wrote: ..' attribution lines above the
text.)

Andrew T.
 
N

NeoGeoSNK

Unfortunately, when speaking of Java, the word "exception" is used in
more than one way. A thing that can be thrown and caught is often called
an "exception", but the correct name is "Throwable". Throwables are
divided into two groups, the "Error" group and the "Exception" group.
The difference is that an Error normally represents a disaster, such as
running out of memory, that a program should not normally try to (or be
able to) recover from.

Therefore, your "it throws an 'java.lang.OutOfMemoryError' Exception" is
a whuzzat, like saying "a Pennsylvania Canadian".

--
John W. Kennedy
A proud member of the reality-based community.
* TagZilla 0.066 *http://tagzilla.mozdev.org- Hide quoted text -

- Show quoted text -

I think the "Exception" your mean is the Excpetion class which extends
the java.lang.Throwable, but here I talk about is the Java error
handle mechanism ,so I think Exception is an excepiton, error is an
exception, and throwable is an exception too.
by the way,
"Exception in thread "main" java.lang.OutOfMemoryError: Java heap
space" is reported by the JVM, not I said:)

-- Ny
 
N

NeoGeoSNK

Note that it was not connected to the technical
part of the conversation, it is just part of a
'sig.' or 'signature line'. Sig.s are often
intended to be humorous, or funny, and that is
just one such line. Other people's sig.'s might
push points of view that the person is particularly
fond of, or to simply add details of themselves,
or their own web sites, or links of interest to
them.

I generally prefer the 'funny' sig.s - most
other sig.s take themselves far too seriously.

(Note also that it is generally a good idea
to trim sig.s when replying, as the relevant
information ('who posted what') is still containedin the 'Jim wrote: ..' attribution lines above the

text.)

Andrew T.

Thanks Andrew T
I just uesed the SAX to rewrite the code, and the performance
increased a lot,To my surprise, the DOM parsing the XML will consume
more than 6 hours, but the SAX take 6 seconds only:), I think the DOM
can't paring a XML file which more than 100000 lines without throw an
memory exception,I think there would be no argument about the speed of
these two parsers. When use DOM, I must load the whole XML in to
memory,
Document doc = builder.parse(file);
this will become impossible when the file is too large.

I realy can't understand about the 'signature line' you explained? I
think it's more complex than the XML parser and Java:)
I guess the ".)" is a 'funny' sig.s of you ?
 
A

Andrew Thompson

(big trim)
Thanks Andrew T

Well, ..for what ever I've done 'your welcome',
but most of the best suggestions in this thread
came from other people! AFAIR it was Mike S.
that first suggested the much better strategy
of using SAX.
I just uesed the SAX to rewrite the code, and the performance
increased a lot,To my surprise, the DOM parsing the XML will consume
more than 6 hours, but the SAX take 6 seconds only:),

Hmm... That is quite an impressive difference,
isn't it? Lew's estimate was not far off (I did
not comment at the time - but I really thought
his statement of '2 hour -> 1 to 2 seconds' was
unrealistic!).
I realy can't understand about the 'signature line' you explained? I
think it's more complex than the XML parser and Java:)

It is both more complicated, and far less
important, but I do not quite understand
what you mean - if you need further information,
please write your question a little differently
(I do not understand your *question*).

On the other hand, I recommend forgetting
the sig. - it is really not that important.

By the way - I am glad you solved the
technical problem. :)

Andrew T.
 
P

Patricia Shanahan

NeoGeoSNK wrote:
....
I don't know how DOM works when it parsing a XML, I use DOM that is
because the XPath can quciky location some particular elements. I
think if the SAX only reports events but not store the whole structure
of XML like DOM does, It must be more efficient. What does "page-
thrashing" means ?
....

Imagine working in an office, doing some complicated task, using a desk
with a limited area, and a file cabinet with far more paper in it than
can fit on the desk.

The desk top is usually full, so when you need to create a new document
or get something from the filing cabinet, you need to remove something
from the desk. The easiest way is to just get rid of a paper you have
not looked at recently.

There are two very different cases:

1. The pages you need more often than once every few minutes all fit on
the desk. You spend most of your time working, but sometimes have to get
another paper from the file cabinet.

2. The task you are doing needs far more papers than can fit on the
desk. Every time you need to follow up a reference, it points to a page
that is in the filing cabinet, and you cannot make progress until you
get it. But to put it on the desk, you have to remove something else,
and a few minutes later you need the page that you just removed...

The second condition is page thrashing.

desk top <-> computer's main memory
file cabinet <-> swap file
page of paper <-> virtual storage page

There are two cases when building the whole document in memory:

1. It fits. In that case there will be a heap size that is both big
enough to hold the document (no out of memory errors) and small enough
to fit on the desk (no page thrashing, the computer spends most of its
time doing useful work, not shuffling pages between disk and memory).
The obvious heap size to try is a bit smaller than the computer's
physical memory. If any size works, that one will.

2. It does not fit. Any memory size big enough to avoid OutOfMemoryError
is big enough to cause page thrashing.

Patricia
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top