Using SAX parser how to identify values for duplicate tag name.

Sanjeev · Jun 23, 2008

Hello Gurus,

I am using SAX parser for reading XML file.
Below is the code snippets.
<?xml version="1.0" encoding="UTF-8"?>
<root>
<student>
<name>Sanjeev Atvankar</name>
<class>Fourth Year</class>
<subject>
<subjectType>Science</subjectType>
<subjectValue>Anatomy</subjectValue>
</subject>
<subject>
<subjectType>Language</subjectType>
<subjectValue>Hindi</subjectValue>
</subject>
</student>
<student>
. . . .
. . . .
</student>
private String name;
private String classRoom;
private String scienceSubject;
private String languageSubject;
. . . .
. . . .
public StudentParser(){
studentCollectionVO = new StudentCollectionVO();
}
public StudentCollectionVO runExample(String xmlMessage) {
parseDocument(xmlMessage);
return studentCollectionVO;
}
private void parseDocument(String xmlMessage) {
SAXParserFactory spf = SAXParserFactory.newInstance();
try {
SAXParser sp = spf.newSAXParser();
sp.parse(new InputSource(new
ByteArrayInputStream(xmlMessage.getBytes())), this);
}catch(Exception e) {
}
}
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
tempVal = "";
if(qName.equalsIgnoreCase("student")) {
studentVO = new StudentVO();
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
tempVal = new String(ch,start,length);
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
if(qName.equalsIgnoreCase("student")) {
studentCollectionVO.add(studentVO);
}else if (qName.equalsIgnoreCase("name")) {
studentVO.setName(tempVal);
}else if (qName.equalsIgnoreCase("class")) {
studentVO.setClassRoom(tempVal);
}else if (qName.equalsIgnoreCase("subjectValue")) {
studentVO.setScienceSubject(tempVal);
}else if (qName.equalsIgnoreCase("subjectValue")) {
studentVO.setLanguageSubject(tempVal);
}
}
. . . .
. . . .

Since each subject is given in following tag format
<subject>
<subjectType></subjectType>
<subjectValue></subjectValue>
</subject>

how can identify individual subject.

In above example Anatomy belongs to Science(subjectType) and
Hindi belongs to Language(subjectType).

Can anybody help me.

Thanking in advance
Sanjeev

Tom Anderson · Jun 23, 2008

Sanjeev said:
Sanjeev said:

I am using SAX parser for reading XML file.
Below is the code snippets.

Student.xml File

Click to expand...

<?xml version="1.0" encoding="UTF-8"?>
<root>
<student>
<name>Sanjeev Atvankar</name>
<class>Fourth Year</class>
<subject>
<subjectType>Science</subjectType>
<subjectValue>Anatomy</subjectValue>
</subject>
<subject>
<subjectType>Language</subjectType>
<subjectValue>Hindi</subjectValue>
</subject>
</student>
<student>
. . . .
. . . .
</student>

StudentVO.java (Java Bean) with following parameters

Click to expand...

private String name;
private String classRoom;
private String scienceSubject;
private String languageSubject;

StudentParser.java

Click to expand...

. . . .
. . . .
public StudentParser(){
studentCollectionVO = new StudentCollectionVO();
}
public StudentCollectionVO runExample(String xmlMessage) {
parseDocument(xmlMessage);
return studentCollectionVO;
}
private void parseDocument(String xmlMessage) {
SAXParserFactory spf = SAXParserFactory.newInstance();
try {
SAXParser sp = spf.newSAXParser();
sp.parse(new InputSource(new
ByteArrayInputStream(xmlMessage.getBytes())), this);
}catch(Exception e) {
}
}
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
tempVal = "";
if(qName.equalsIgnoreCase("student")) {
studentVO = new StudentVO();
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
tempVal = new String(ch,start,length);
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
if(qName.equalsIgnoreCase("student")) {
studentCollectionVO.add(studentVO);
}else if (qName.equalsIgnoreCase("name")) {
studentVO.setName(tempVal);
}else if (qName.equalsIgnoreCase("class")) {
studentVO.setClassRoom(tempVal);
}else if (qName.equalsIgnoreCase("subjectValue")) {
studentVO.setScienceSubject(tempVal);
}else if (qName.equalsIgnoreCase("subjectValue")) {
studentVO.setLanguageSubject(tempVal);
}
}
. . . .
. . . .

Since each subject is given in following tag format
<subject>
<subjectType></subjectType>
<subjectValue></subjectValue>
</subject>

how can identify individual subject.

In above example Anatomy belongs to Science(subjectType) and
Hindi belongs to Language(subjectType).

Click to expand...

Refactor. Have a separate parser class responsible for each tag, and
have each instance hold a reference to its parent. Thus, you'll have a
"StudentHandler", a "NameHandler", a "RoomHandler",
"ScienceSubjectHandler", etc. Plug the appropriate handler in at
startElement(), and pop to the prior one at the end of endElement().

Okay, i had to think about this for a bit, and write some code which was
entirely the wrong thing, but i think i get this idea, and it's pretty
cool. Here's an attempt (which i haven't tried to compile, and ignores
various details that would be required to do so):

// application-independent bits

interface ElementHandler {
public ElementHandler handleChild(String tag) ;
public String handleText(String text) ;
}

class ElementHandlingHandler implements org.xml.sax.ContentHandler {
private ElementHandler handler ;
private List<ElementHandler> handlerStack = new ArrayList<ElementHandler>() ;
private StringBuffer sbuf = new StringBuffer() ; // do buffering here, not in handlers

public ElementHandlingParser(String rootTag, ElementHandler rootHandler) {
handler = new RootHandler(rootTag, rootHandler) ;
}
public void characters(char[] buf, int off, int len) {
sbuf.append(buf, off, len) ;
}
public void startElement(String uri, String name, String qname, Attributes attrs) {
flush() ;
handlerStack.add(handler) ; // aka 'push'
handler = handler.handleChild(name) ;
}
public void endElement(String uri, String name, String qname) {
flush() ;
if (!handlerStack.isEmpty()) {
handler = handlerStack.remove(handlerStack.length() - 1) ; // aka 'pop'
}
else {
// we're done - null things so we puke if more methods are called
handler = null ;
sbuf = null ;
}
}
private void flush() {
if (sbuf.length() > 0) {
handler.handleText(sbuf.toString()) ;
sbuf.setLength(0) ;
}
}
}

// convenience class - override at least one of the methods to do anything useful!
abstract class ElementHandlerBase implements ElementHandler {
public ElementHandler handleChild(String tag) {
throw new IllegalStateException("element has no such child: " + tag) ;
}
public String handleText(String text) {
throw new IllegalStateException("element has no text") ;
}
}

// sort of weird adapter thing, see use above
class RootHandler extends ElementHandlerBase {
private String rootTag ;
private ElementHandler rootHandler ;

public RootHandler(String rootTag, ElementHandler rootHandler) {
this.rootTag = rootTag ;
this.rootHandler = rootHandler ;
}
public ElementHandler handleChild(String tag) {
if (!tag.equals(rootTag)) super.handleChild(tag) ;
return rootHandler ;
}
}

// application-specific bits

class Student {
// etc
}

class StudentListHandler extends ElementHandlerBase {
private List<Student> students = new ArrayList<Student>() ;

public List<Student> getStudents() {
return students ;
}
public ElementHandler handleChild(String tag) {
// note that i use super.handleChild to signal an error, here and below
if (!tag.equals("student")) super.handleChild(tag) ;
Student student = new Student() ;
students.add(student) ;
return new StudentHandler(student) ;
}
}

class StudentHandler extends ElementHandlerBase {
private Student student ;

public StudentHandler(Student student) {
this.student = student ;
}
public ElementHandler handleChild(String tag) {
if (tag.equals("name")) return new NameHandler(student) ;
else if (tag.equals("class")) return new ClassHandler(student) ;
else if (tag.equals("subject")) return new SubjectHandler(student) ;
else super.handleChild(tag) ;
}
}

class NameHandler extends ElementHandlerBase {
private Student student ;

public NameHandler(Student student) {
this.student = student ;
}
public String handleText(String text) {
student.setName(text) ;
}
}

class ClassHandler extends ElementHandlerBase {
private Student student ;

public ClassHandler(Student student) {
this.student = student ;
}
public String handleText(String text) {
student.setClass(text) ;
}
}

class SubjectHandler extends ElementHandlerBase {
private Student student ;
private String subjectType = null ;

public SubjectHandler(Student student) {
this.student = student ;
}
public ElementHandler handleChild(String tag) {
if (tag.equals("subjectType")) return new SubjectTypeHandler(this) ;
else if (tag.equals("subjectValue")) return new SubjectValueHandler(this) ;
else super.handleChild(tag) ;
}
public String setSubjectType(String type) {
if (subjectType != null) throw new IllegalStateException("subject type already set") ;
subjectType = type ;
}
public String setSubjectValue(String value) {
if (subjectType == null) throw new IllegalStateException("subject type not yet set") ;
if (subjectType.equals("language")) student.setLanguageSubject(value) ;
else if (subjectType.equals("science")) student.setScienceSubject(value) ;
else throw new IllegalArgumentException("no such subject type: " + subjectType) ; // this should really be thrown in setSubjectType!
subjectType = null ;
}
}

class SubjectTypeHandler extends ElementHandlerBase {
private SubjectHandler parent ;

public SubjectTypeHandler(SubjectHandler parent) {
this.parent = parent ;
}
public String handleText(String text) {
parent.setSubjectType(text) ;
}
}

class SubjectValueHandler extends ElementHandlerBase {
private SubjectHandler parent ;

public SubjectValueHandler(SubjectHandler parent) {
this.parent = parent ;
}
public String handleText(String text) {
parent.setSubjectValue(text) ;
}
}

public List<Student> parse(InputSource xml) {
SaxParser parser = SAXParserFactory.newInstance().newSAXParser() ;
StudentListHandler root = new StudentListHandler() ;
parser.parse(xml, new ElementHandlingHandler("root", root)) ;
// the root element should really be called studentList or something
return root.getStudents() ;
}

Is that anything like what you meant?

I manage the handler stack externally rather than internally, but that's a
somewhat orthogonal choice. Looking at that code, it might be better to,
as you do, handle it internally: it would avoid duplication in the case of
the sub-handlers of SubjectHandler, and it would let me do some cleverness
where SubjectHandler returns itself to handle the sub-elements, but uses a
state variable (WAITING_FOR_TYPE, WAITING_FOR_VALUE) to decide what to do
when it gets some text.

Instead of "if ( qname.equalsIgnoreCase()

DIGRESSION:

Ignore case? Why do you do that? XML is case sensitive. Don't ignore case.

END DIGRESSION.

)" use a Map:
Handler handler = handlers.get( qName );

Hmm. My version doesn't use a map - there are several hard-coded switches
which could be done with maps instead. It wouldn't save any lines of code,
but would be less crufty.

If the handler's parent is a ScienceHandler you have one thing, if the parent
is a LanguageHandler you have another.

I've written a number of SAX parsers using this stratagem and it works
well. It also eliminates that long if-chain. Maps are easier to
configure - they don't require recompilation every time you change the
rules.

Provided you're getting the map from an external source, rather than
defining it in the code. But i don't quite understand how that would work
here. I think i've got really grokked your design.

tom

Lew · Jun 23, 2008

Tom said:
Hmm. My version doesn't use a map - there are several hard-coded switches
which could be done with maps instead. It wouldn't save any lines of code,
but would be less crufty.

It would save lots of lines of code.

Instead of a long if-chain there's one call to the Map's get().

That's not the real advantage. The real advantage is the reduction of
the need to rebuild the source.

Provided you're getting the map from an external source, rather than
defining it in the code.

That is not necessary. You can get the Map reflectively, i.e., base
it on deployment descriptors, or you can have a module to build it
that does get recompiled, but other modules that use it need not be.

I never said a Map always eliminated *all* rebuilding, only that it
saves recompilation. The deployment-descriptor route does eliminate
all recompilation, but there are intermediate solutions that save only
some recompilation.

I derive all my handlers from a DefaultHandler subclass, and typically
do not maintain an explicit stack. It is enough that each handler has
a pointer to its parent. Naturally there are many ways to approach
this problem.

Tom Anderson · Jun 23, 2008

It would save lots of lines of code.

Instead of a long if-chain there's one call to the Map's get().

Yebbut, in my example at least, i'd also need lines like this (although
this wouldn't actually work - i'd need to refactor quite a bit):

private static final Map<String, ElementHandler> CHILD_HANDLERS = new HashMap() ;
static {
CHILD_HANDLERS.put("subjectType", new SubjectTypeHandler()) ;
CHILD_HANDLERS.put("subjectValue", new SubjectValueHandler()) ;
}

Which adds as many as you take away. Unless ...

That's not the real advantage. The real advantage is the reduction of
the need to rebuild the source.

That is not necessary. You can get the Map reflectively, i.e., base it
on deployment descriptors,

You do something like that. Then you need to write code to do the
generation, but it's O(1) rather than O(N) in terms of size. Although the
constant could be pretty large, i don't know. But then, for a complex XML
schema, N could be extremely large!

or you can have a module to build it that does get recompiled, but other
modules that use it need not be.

I never said a Map always eliminated *all* rebuilding, only that it
saves recompilation. The deployment-descriptor route does eliminate all
recompilation, but there are intermediate solutions that save only some
recompilation.

I'm slightly surprised by your interest in avoiding recompilation. An IDE
which does incremental compilation means that this is a non-issue when
developing. Even when doing everything by hand, you can just recompile the
relevant class. The only times you recompile everything are nightly builds
and the like, no? Or are you thinking about being able to change things at
runtime?

I derive all my handlers from a DefaultHandler subclass, and typically
do not maintain an explicit stack. It is enough that each handler has a
pointer to its parent. Naturally there are many ways to approach this
problem.

As ever!

tom

Tom Anderson · Jun 24, 2008

Why do you violate the naming conventions for the Map?

What naming conventions? For me 'final static' means a constant, and
constants get NAMES_LIKE_THIS. I assume yours are different.

And don't forget the generics in the 'new' expression.

Oops, my bad.

Which is one line per handler, whereas an if-chain would have several
lines per handler.

In the example i posted, it wasn't. Each line looked like:

if (tag.equals("subjectType")) return new SubjectTypeHandler() ;

Also, the lines there are simple and decoupled. In an if-chain they'd
be complicated and coupled - you couldn't abstract them away into a
loader class, nor load them reflectively from deployment descriptors.

The code to extract and use a handler becomes much simpler.

Quite true. I don't object to using a map - i think it's definitely much
cleaner, and i'd do that if i was doing this for real. I just don't think
it'll save actual lines of code.

public class Loader
{
private static final Map <String, Class<? extends ElementHandler>
handlers = new HashMap <String, Class<? extends ElementHandler> ();

static
{
handlers.put( "subjectType", SubjectTypeHandler.class );
...
}

and later, in startElement():

Class <? extends ElementHandler> clazz = handlers.get( tag );
if ( isOk( clazz ))
{
ElementHandler handler = clazz.newInstance();
handler.setParent( current.getHandler() );
current.setHandler( handler );
}

This assumes a holder class of which 'current' is an instance, and which
provides the handler to handle all tags until endElement() pops it to its
parent.

Looks good.

I am floored that you would be at all surprised. Recompilation means
new regression tests, and a new release and redeployment cycle. Yecch.

....

I guess we have different development practices, because i certainly don't
release every time i recompile. I thought we were Extreme, but Lew, you're
*hardcore*. Respect.

I recompile every minute or less (because i'm using Eclipse, which
recompiles automatically and incrementally), i run my tests every few
minutes (when i've added a feature or fixed a bug - or think i have, or
want to verify that i haven't introduced any new bugs), i integrate a few
times a day, and we release every iteration, which is a few weeks.

But ...

Spoken like someone who's never had to deploy nor maintain code. You
don't just recompile and throw something into production. Change
requires a disciplined test and release cycle.

.... okay, i think we're arguing at cross purposes. You're not talking
about avoiding recompilation, you're talking about avoiding having to send
new binaries to the client, ie avoiding release. Now, avoiding release is
something i can certainly get behind, because that's kind of a big deal.

However, you seem to be saying that by pushing the control into a
configuration file, you can make changes without having to go through the
release process. This rings alarm bells. Don't those changes also have to
go through the testing and release processes?

Developers are the least significant part of the software development chain,
folks. Stop acting important.

Yeah, cheers for the sermon, Lew. I'm sure none of us knew that.

Duhy.

I don't understand your process. How can you just recompile things
willy nilly? You have to control what you release, or you'll have a
maintenance nightmare.

Recompile != release.

I don't understand your process. Do you only recompile when you release?
Surely one needs to recompile almost constantly, every time one adds a
chunk of new code?

The advantage of not recompiling is - no additional testing required, no
re-release required,

Okay, but if you're releasing changes which alter program behaviour
without testing then ...

no version mismatch between installations, no carelessness, no sneaky
new bugs creeping in, no temptation to sneak in an unrelated new feature
when you are only recompiling to add one module - in other words, you
get less expensive, less error-prone, more reliable and controllable,
consistent deployments in the field.

.... you don't get any of those benefits.

If you think recompilation is cheap way to re-release code, then you
aren't testing, or indeed doing anything disciplined about your code
releases.

If i was recompiling and releasing straight away, then that would be
absolutely correct. I'm not, and i join you in condemning such practices.

tom

Lew · Jun 24, 2008

I guess we have different development practices, because i certainly don't
release every time i recompile. I thought we were Extreme, but Lew, you're
*hardcore*. Respect.

Not so much once we get our terminology in synch. See below.

I talk a good game, though. Actually, where I work there is such a
firm discipline with respect to releases (not recompilation: again,
see below), and it is not "Extreme" here at all - mostly waterfall.

I recompile every minute or less (because i'm using Eclipse, which
recompiles automatically and incrementally), i run my tests every few
minutes (when i've added a feature or fixed a bug - or think i have, or
want to verify that i haven't introduced any new bugs), i integrate a few
times a day, and we release every iteration, which is a few weeks.

Yeah, I do that, too. I was not actually talking about simple
recompilation but rebuilding for release.

But ...

... okay, i think we're arguing at cross purposes. You're not talking
about avoiding recompilation, you're talking about avoiding having to send
new binaries to the client, ie avoiding release. Now, avoiding release is
something i can certainly get behind, because that's kind of a big deal.

Yes, that's it. Now I agree with you about recompilation, but the
main thing I was talking about was rebuilding for release, not
incremental recompilation during the development phase.

However, you seem to be saying that by pushing the control into a
configuration file, you can make changes without having to go through the
release process. This rings alarm bells. Don't those changes also have to
go through the testing and release processes?

No. Deployment-time configuration is specific to an installation, and
not part of the release per se.

If deployment brings in independent modules of an application, those
would have been already tested before allowing them to be configured
at deployment time. Adding a new module then is a matter of testing
that module; the others are insulated from bad effects and don't
require retest.

If the nature of a change is that it would require retest and re-
release, then it's not a good candidate for deployment-time
configuration in the first place.

Yeah, cheers for the sermon, Lew. I'm sure none of us knew that.

I didn't used to. I thought we developers were top of the food
chain. Life amongst the ops folks for the last year or two has
radically lowered my opinion of myself.

Recompile != release.

I agree. I meant "rebuild for release" and shouldn't have said
"recompile".

I don't understand your process. Do you only recompile when you release?
Surely one needs to recompile almost constantly, every time one adds a
chunk of new code?
Ibid.

If i was recompiling and releasing straight away, then that would be
absolutely correct. I'm not, and i join you in condemning such practices.

Thank you for clarifying the terminologies. I stand corrected. Re-
evaluate what I said in terms of "re-release" instead of "recompile"
and it should make more sense to you.

Tom Anderson · Jun 25, 2008

If it were constant I'd agree with you, but it isn't, so I don't.

My standards, in accord with
<http://java.sun.com/docs/codeconv/html/CodeConventions.doc8.html#367>
are to name static final constants with all uppercase, otherwise I use camel
case with an initial lower-case letter for variables. Since the Map cannot
be constant, I would not use upper-case naming.

Do you follow a different standard?

No. But i consider the Map a constant. The constancy of the reference is
assured by the 'final', but of course the map itself is not made constant.
That doesn't bother me in this situation, as it's private, and there's
little risk of it getting messed about.

If it were public, it might be wise to wrap it in a
Collections.unmodifiableMap; you could do this if you moved all the setup
to a static block, like this:

private static final Map CHILD_HANDLERS ;
static {
Map childHandlers = new HashMap() ;
loadChildHandlers(childHandlers) ;
CHILD_HANDLERS = Collections.unmodifiableMap(childHandlers) ;
}

I wouldn't bother with this here.

tom

Tom Anderson · Jun 25, 2008

Not so much once we get our terminology in synch.

Yup, i think we've got this one resolved. But!

No. Deployment-time configuration is specific to an installation, and
not part of the release per se.

If deployment brings in independent modules of an application, those
would have been already tested before allowing them to be configured
at deployment time. Adding a new module then is a matter of testing
that module; the others are insulated from bad effects and don't
require retest.

If the nature of a change is that it would require retest and re-
release, then it's not a good candidate for deployment-time
configuration in the first place.

We're scratching the surface of a deeper question here: what kinds of
changes need to go through the test-release process? Clearly, changes to
code do. Equally clearly, changes to user configuration (eg the background
colour of the pages) don't - they're in the user's domain. But there's a
gray area in between. You talk in terms of combining modules, and that
sounds like an excellent rule of thumb: you test the modules, and let the
user freely combine them, safe in the knowledge that any combination will
work. Fair enough.

But this just leads to the question of what a module is. Is the case we
were discussing a matter of modules? I was imagining a configuration file
that would look something like this:

name org.tom.students.NameHandler
subjectType org.tom.students.SubjectTypeHandler
subjectValue org.tom.students.SubjectValueHandler

Which is a file which composes individual classes within the parser. I
wouldn't say those were modules, because they're not arbitraraily or even
flexibly composable. This is internal detail of the app, not user
configuration.

Possibly, in the way you've used this pattern, they *are* modules, and you
can fiddle with them quite a bit.

I didn't used to. I thought we developers were top of the food chain.
Life amongst the ops folks for the last year or two has radically
lowered my opinion of myself.

Fair enough, and that is a very good experience. We should all do some of
that.

tom

Tom Anderson · Jun 25, 2008

That is at variance with the definition in the JLS.
<http://java.sun.com/docs/books/jls/third_edition/html/expressions.html#5313>

"We call a variable, of primitive type or type String, that is final and
initialized with a compile-time constant expression (15.28) a constant
variable. Whether a variable is a constant variable or not may have
implications with respect to class initialization (12.4.1), binary
compatibility (13.1, 13.4.9) and definite assignment (16)."

So you don't call anything that isn't a primitive or String a constant?

I prefer to use the term 'constant' in Java as defined by the language,
rather than create an idiolectic definition. You, of course, might
choose otherwise.

I do. As does Sun - see, for instance:

http://java.sun.com/javase/6/docs/api/java/nio/ByteOrder.html

Other examples abound.

tom

Lew · Jun 25, 2008

http://java.sun.com/javase/6/docs/api/java/nio/ByteOrder.html

"A typesafe enumeration for byte orders."

Enumeration constants were added to the language after the spec was
written, but you are right, enumeration constants get upper-cased
also.

Note that a Map will never be an enumeration constant.

Why SAX parser reads truncated data ?	4	Aug 18, 2008
Validating XML with an external DTD	8	Aug 4, 2007
Can I read String (XML content) rather XML file using SAX parser	4	May 2, 2008
Java SAX parser. How to get the raw XML code of the currently parsingevent	4	Jul 2, 2008
SAX decoding problem	1	May 15, 2004
Mixed SAX and DOM processing: echoing with occassional changes.	1	Apr 11, 2006
SAX XMLReader, XMLFilter, ContentHandler and XMLWriter question	2	Feb 22, 2006
sax ignoring DTD?	2	Dec 7, 2003

Using SAX parser how to identify values for duplicate tag name.

Sanjeev

Tom Anderson

Lew

Tom Anderson

Tom Anderson

Lew

Tom Anderson

Tom Anderson

Tom Anderson

Lew

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads