how to extract info from a huge size xml ?

S

sxshu02

Sorry ,everyone ,i'm new to Java.
If i wannt first search with the xml file to get the right
position , then extract the corresponding xml
item after that .
how can i do that ? I maybe be short of memory usage. also the
performance should be good.
I checked that the SAX might do it good , but the efficiency may be
not good.

Thanks for helping !
 
R

Raghav

Sorry ,everyone ,i'm new to Java.
If i wannt first search with the xml file to get the right
position , then extract the corresponding xml
item after that .
how can i do that ? I maybe be short of memory usage. also the
performance should be good.
I checked that the SAX might do it good , but the efficiency may be
not good.

Thanks for helping !

You cannot store the context with SAX. You need to use a DOM parser
and build the DOM in memory.
To get to the right node, you might want to find the XPath of that
node, which will be a String.
There are 3p libraries to get the node at an XPath.

You can as well write your own traversal class using NodeIterator or
TreeWalker.
HTH.
 
P

Philipp Taprogge

Hi!

Thus spake (e-mail address removed) on 05/11/2007 07:03 PM:
I checked that the SAX might do it good , but the efficiency may be
not good.

An alternative approach could be StAX [JSR173]. It is a stream-based
XML api that is basically event-driven, allowing you to sort of
"react" to things like encountering a begin or end tag.
One implementation is woodstox, that can be found at
http://woodstox.codehaus.org/

HTH,

Phil
 
L

Lew

SAX is incredibly efficient.
You cannot store the context with SAX. You need to use a DOM parser
and build the DOM in memory.

Not true. I've built many a SAX parser that kept track of context and didn't
need to keep everything in memory all at once.
 
S

Seashor

I have considered the DOM, but I've been told that it "eats" the
memory .
You think it works?
Raghav дµÀ£º
 
L

Lew

Seashor said:
I have considered the DOM, but I've been told that it "eats" the
memory .
You think it works?

Please do not top-post (placement of reply above material quoted).

DOM does indeed "eat" memory, more so the larger the document. SAX is fast,
efficient and can be coded to be very parsimonious of memory.

And it does not have to lose context, despite misinformation provided earlier.

You should use SAX or StAX.

-- Lew
 
R

Raghav

Hi Seashor,
DOM is memory intensive coz it builds the whole tree in memory.
You can use a SAX parser today but in case you intend to retrieve
multiple nodes in future,
your program becomes clumsy and maintenance becomes an issue.

On the other hand, if you have a DOM, you can pass a collection of
XPaths and retrieve the corresponding nodes using a DOM.
Looking at performance, its better to use SAX but if you can think of
some changes in reqs in future, its safer to use DOM.
 
S

Seashor

Please do not top-post (placement of reply above material quoted).

DOM does indeed "eat" memory, more so the larger the document. SAX is fast,
efficient and can be coded to be very parsimonious of memory.

And it does not have to lose context, despite misinformation provided earlier.

You should use SAX or StAX.

-- Lew

Sorry , top-post is a hahit where china forums have .
I'm also a new here.
 
S

Seashor

Hi Seashor,
DOM is memory intensive coz it builds the whole tree in memory.
You can use a SAX parser today but in case you intend to retrieve
multiple nodes in future,
your program becomes clumsy and maintenance becomes an issue.

On the other hand, if you have a DOM, you can pass a collection of
XPaths and retrieve the corresponding nodes using a DOM.
Looking at performance, its better to use SAX but if you can think of
some changes in reqs in future, its safer to use DOM.

Thanks for advising. I'm thinking of doing it via SAX.
Although it's a stream process method, I'll make some file index ,
hope it can do well
 
L

Lew

Seashor said:
Thanks for advising. I'm thinking of doing it via SAX.
Although it's a stream process method, I'll make some file index ,
hope it can do well

File index? If you mean a numeric offset of character positions into the
file, that could make your solution much more complex.

Just chain together polymorphic implementations of tag Handlers that are
invoked on each tag entry. Have each one hold a reference to its
enclosing-tag handler so you can pop it back into "currentHandler" on the tag
exit.

XML is a strange bedfellow with file offsets. It's far, far better to stay
within XML semantics when doing XML processing.

Just to hint at the SAX way, which nowadays is a bit old-fashioned in favor of
StAX and things like the XMLStreadReader, you could use a ContentHandler for
each tag:

<foo>
<person>
<name>John Doe</name>
</person>
</foo>

You would declare an abstract FooHandler class that implements ContentHandler,
and has child classes for each tag, "foo", "person", "name", etc.

public abstract class AbstractFooHandler extends DefaultHandler
{
public static final class Context
{
XMLReader parser;
}

private Context context;
public final Context getContext()
{
return context;
}
public final void setContext( Context ctx )
{
this.context = ctx;
}

private AbstractFooHandler encloser;
protected final AbstractFooHandler getEncloser()
{
return encloser;
}
protected final void setEncloser( AbstractFooHandler fh )
{
this.encloser = fh;
}
}

public class FooParser
{
public static void main( String [] args )
{
XMLReader parser = XMLReaderFactory.createXMLReader();

InputSource is = createInputSource( args ); // however you do it

AbstractFooHandler.Context ctx = new AbstractFooHandler.Context();
ctx.parser = parser;

AbstractFooHandler fh = new FooHandler();
fh.setContext( ctx );

parser.setContentHandler( fh );
parser.parse( is );
}
}

public class FooHandler extends AbstractFooHandler
{
public void startElement(String uri,
String localName,
String qName,
Attributes attributes)
throws SAXException
{
if ( localName.equals( "person" ))
{
AbstractFooHandler afh = new PersonHandler();
afh.setContext( getContext() );
afh.setEncloser( this );
getContext().parser.setContentHandler( afh );
}
else
{
throw new SAXException( "Illegal tag \""+ localName +"\"." );
}
}
}

Then endElement() callback of PersonHandler would detect the closing "person"
tag and replace the current Handler with its own encloser. endElement() at
every level will emit events that you want to happen in response to the XML.

I hard-coded a few things in this example, which is a Bad Thing but would have
been too long in a newsgroup post. I'd keep a Map of Handlers keyed by tags
instead of hardcoding the tag and its handler. This is most definitely not an
SSCCE.

This will let you keep track of where you are and process your file in one
pass, keeping in memory only what each handler emits as necessary to keep in
memory. No file offsets, either.
 
S

Seashor

Seashor said:
Thanks for advising. I'm thinking of doing it via SAX.
Although it's a stream process method, I'll make some file index ,
hope it can do well

File index? If you mean a numeric offset of character positions into the
file, that could make your solution much more complex.

Just chain together polymorphic implementations of tag Handlers that are
invoked on each tag entry. Have each one hold a reference to its
enclosing-tag handler so you can pop it back into "currentHandler" on the tag
exit.

XML is a strange bedfellow with file offsets. It's far, far better to stay
within XML semantics when doing XML processing.

Just to hint at the SAX way, which nowadays is a bit old-fashioned in favor of
StAX and things like the XMLStreadReader, you could use a ContentHandler for
each tag:

<foo>
<person>
<name>John Doe</name>
</person>
</foo>

You would declare an abstract FooHandler class that implements ContentHandler,
and has child classes for each tag, "foo", "person", "name", etc.

public abstract class AbstractFooHandler extends DefaultHandler
{
public static final class Context
{
XMLReader parser;
}

private Context context;
public final Context getContext()
{
return context;
}
public final void setContext( Context ctx )
{
this.context = ctx;
}

private AbstractFooHandler encloser;
protected final AbstractFooHandler getEncloser()
{
return encloser;
}
protected final void setEncloser( AbstractFooHandler fh )
{
this.encloser = fh;
}

}

public class FooParser
{
public static void main( String [] args )
{
XMLReader parser = XMLReaderFactory.createXMLReader();

InputSource is = createInputSource( args ); // however you do it

AbstractFooHandler.Context ctx = new AbstractFooHandler.Context();
ctx.parser = parser;

AbstractFooHandler fh = new FooHandler();
fh.setContext( ctx );

parser.setContentHandler( fh );
parser.parse( is );
}

}

public class FooHandler extends AbstractFooHandler
{
public void startElement(String uri,
String localName,
String qName,
Attributes attributes)
throws SAXException
{
if ( localName.equals( "person" ))
{
AbstractFooHandler afh = new PersonHandler();
afh.setContext( getContext() );
afh.setEncloser( this );
getContext().parser.setContentHandler( afh );
}
else
{
throw new SAXException( "Illegal tag \""+ localName +"\"." );
}
}

}

Then endElement() callback of PersonHandler would detect the closing "person"
tag and replace the current Handler with its own encloser. endElement() at
every level will emit events that you want to happen in response to the XML.

I hard-coded a few things in this example, which is a Bad Thing but would have
been too long in a newsgroup post. I'd keep a Map of Handlers keyed by tags
instead of hardcoding the tag and its handler. This is most definitely not an
SSCCE.

This will let you keep track of where you are and process your file in one
pass, keeping in memory only what each handler emits as necessary to keep in
memory. No file offsets, either.
Thanks a lot! It helps me so much.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top