XML parsing with Java


V

vk02720

What is the standard/safe/optimal way of parsing XML from Java
program? I have to use JDK 1.4 to begin with now but in few months
code should be compilable and workable with 1.5 as well.
What has been confusing is 1.4 including XML parser but not Xerces?
How can I use Xerces with 1.4? Should I just add the xerces jar file
to my project and use JAXP API? From some previous projects, I tried
to look for xerces jars - some have xerces.jar and some
xercesImpl.jar. What is the difference? Where can I download recent/
correct one to be able to use it with 1.4/JAXP?

After I build my program, is there any way I can know which parser is
being used? Anything I can call in my program to print some info about
parser?

I am currently required to do atleast:
- validate my input XML against a schema (in a seperate xsd file)
- to be able to use DOM and SAX basic APIs.
- to be able to use Xpath.

Any insights/advice appreciated.

TIA
 
Ad

Advertisements

L

Lew

What is the standard/safe/optimal way of parsing XML from Java
program? I have to use JDK 1.4 to begin with now but in few months

Java 1.4 has been completely retired for a few weeks now, and obsolescent for
quite some time.
code should be compilable and workable with 1.5 as well.
What has been confusing is 1.4 including XML parser but not Xerces?

It is Xerces.
How can I use Xerces with 1.4? Should I just add the xerces jar file

Just use the libraries that come with Java.
to my project and use JAXP API? From some previous projects, I tried
to look for xerces jars - some have xerces.jar and some
xercesImpl.jar. What is the difference? Where can I download recent/
correct one to be able to use it with 1.4/JAXP?

Just use the libraries that come with Java.
After I build my program, is there any way I can know which parser is
being used? Anything I can call in my program to print some info about
parser?

What does knowing which parser you're using tell you? How is that knowledge
going to serve you?

I did a cursory review of the Sun Javadocs and didn't see any dynamic means to
identify the parser, although the Javadocs themselves tell us that Java uses
the org.xml.sax libraries for SAX, and the org.w3c.dom libraries for DOM
parsing. A similarly cursory review of the saxproject.org docs referenced
from Sun's Javadocs didn't find me what you're asking for either. I guess
you'll need to do some searching with our friend Google and its cousins for this.

However, it is likely that knowing that information will not tell you anything
that matters.
 
A

Arne Vajhøj

What is the standard/safe/optimal way of parsing XML from Java
program? I have to use JDK 1.4 to begin with now but in few months
code should be compilable and workable with 1.5 as well.
What has been confusing is 1.4 including XML parser but not Xerces?

Java contains a XML parser that follows the JAXP standard.

If you use only the JAXP standard, then it should not matter.

Implementation wise Java 1.4 used Crimson and Java 1.5 and newer
uses Xerces.
How can I use Xerces with 1.4? Should I just add the xerces jar file
to my project and use JAXP API? From some previous projects, I tried
to look for xerces jars - some have xerces.jar and some
xercesImpl.jar. What is the difference? Where can I download recent/
correct one to be able to use it with 1.4/JAXP?

Get Xerces in the classpath and use:

System.setProperty("javax.xml.parsers.DocumentBuilderFactory",
"org.apache.xerces.jaxp.DocumentBuilderFactoryImpl");
After I build my program, is there any way I can know which parser is
being used? Anything I can call in my program to print some info about
parser?

Sure - just print the concrete class of some object with:
somexmlobj.getClass().getName()
and you can see from package name what it is.
I am currently required to do atleast:
- validate my input XML against a schema (in a seperate xsd file)

Easy. Both with DOM and SAX.
- to be able to use DOM and SAX basic APIs.
Yep.

- to be able to use Xpath.

Requires DOM.

Arne
 
A

Arne Vajhøj

Spud said:
I'd consider using stax instead. It's built into jdk 1.6 and yields much
cleaner code.

StAX is a good alternative to SAX. But the Java 1.6 implementation does
not support validation (at least I get an exception when I set the
property).

Arne
 
Ad

Advertisements

A

Arne Vajhøj

Lew said:
Oh, I stand corrected. Thanks.

They probably should have chosen Xerces already for
1.4, because Xerces were already better than Crimson
at the time, but Crimson was said to be slightly
faster *and* Crimson was donated to Apache by SUN
while Xerces was donated to Apache by IBM (XML4J).

Arne
 
V

vk02720

They probably should have chosen Xerces already for
1.4, because Xerces were already better than Crimson
at the time, but Crimson was said to be slightly
faster *and* Crimson was donated to Apache by SUN
while Xerces was donated to Apache by IBM (XML4J).

Arne

Thanks.
Java 1.4 does use Crimson by default. There is an option to print some
debug info using -Djaxp.debug=1 which shows how it selects the
FactoryImpl.
This is what gets printed if xerces jar is not included.
JAXP: loaded from fallback value:
org.apache.crimson.jaxp.SAXParserFactoryImpl

1.4 with xercer jar included uses Xerces
JAXP: found META-INF/services/javax.xml.parsers.SAXParserFactory
JAXP: loaded from services:
org.apache.xerces.jaxp.SAXParserFactoryImpl

1.5
JAXP: find factoryId =javax.xml.parsers.SAXParserFactory
JAXP: loaded from fallback value:
com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl

1.5 with xerces jar (not really necessary I guess)
JAXP: find factoryId =javax.xml.parsers.SAXParserFactory
JAXP: found jar resource=META-INF/services/
javax.xml.parsers.SAXParserFactory using ClassLoader: sun.misc.Launcher
[email protected]

1.4 without xerces jar could work for most purposes. However the
capabilities and differences do begin to show. For example for schema
validation, this did not work with crimson.
factory.setFeature("http://apache.org/xml/features/validation/
schema",true);
Got error :
org.xml.sax.SAXNotRecognizedException: Feature:
http://apache.org/xml/features/validation/schema
at org.apache.crimson.parser.XMLReaderImpl.setFeature(Unknown Source)

Does Crimson not support schema validation?
 
V

vk02720

I don't know. It may not. Crimson is from the age
of the DTD !

Arne

Well, in that case Java 1.4 without adding xerces would have that
limitation. 1.5 has validation API and xerces as default so no issues
there.
Anything new in 1.6?

Also how about dom4j or JDOM - do lot of people use it? Any of these
candidates for becoming a "standard" or making their way in the JDK
someday?
 
P

Peter D.

What is the standard/safe/optimal way of parsing XML from Java
program? I have to use JDK 1.4 to begin with now but in few months
code should be compilable and workable with 1.5 as well.
What has been confusing is 1.4 including XML parser but not Xerces?
How can I use Xerces with 1.4? Should I just add the xerces jar file
to my project and use JAXP API? From some previous projects, I tried
to look for xerces jars - some have xerces.jar and some
xercesImpl.jar. What is the difference? Where can I download recent/
correct one to be able to use it with 1.4/JAXP?

After I build my program, is there any way I can know which parser is
being used? Anything I can call in my program to print some info about
parser?

I am currently required to do atleast:
- validate my input XML against a schema (in a seperate xsd file)
- to be able to use DOM and SAX basic APIs.
- to be able to use Xpath.

Any insights/advice appreciated.

TIA

Anyone ever use JAXB? I think it's fantastic.

http://java.sun.com/developer/technicalArticles/WebServices/jaxb/
 
Ad

Advertisements

A

Arne Vajhøj

Well, in that case Java 1.4 without adding xerces would have that
limitation. 1.5 has validation API and xerces as default so no issues
there.
Anything new in 1.6?

StAX and JAXB API's were added.
Also how about dom4j or JDOM - do lot of people use it? Any of these
candidates for becoming a "standard" or making their way in the JDK
someday?

I don't think they will ever be added to Java, since they are more
user friendly oriented than standard oriented.

I have used JDOM a few times. It is simply easier to use than the
standard W3C DOM.

But the advantage with W3C DOM is that you can code the same way
in Java, C#, C, JS, VBS etc..

I know that dom4j also is popular with some projects, but I have not
used it myself.

Arne
 
V

vk02720

JAXB is good.

But probably not to the original posters problem (at least not
as described).

Arne

True. I was trying to look at more basic barebones XML/XPath API
although binding frameworks like JAXB are a good option if you can use
it. Unfortunately, one of the system I am interfacing with has a lot
of name/value pair type of info (in XML) and they dont commit on
publishing the XSD which I believe is a must for JAXB type frameworks.
 
M

Mike Davis

Lew said:
Java 1.4 has been completely retired for a few weeks now, and
obsolescent for quite some time.

Ha! That may be true, but I am now working on a project where that is
the only version of the language allowed. We found this out after
writing a few thousand lines with generics, enums, and a few other 1.5
features.

--mad
 
A

Arne Vajhøj

Mike said:
Ha! That may be true, but I am now working on a project where that is
the only version of the language allowed. We found this out after
writing a few thousand lines with generics, enums, and a few other 1.5
features.

If I were to guess at the Java version usage distribution I would say:

1.2.2 - 5%
1.3.1 - 15%
1.4.2 - 25%
1.5.0 - 35%
1.6.0 - 20%

(please ignore the fact that it is really impossible to quantify
usage in a meaningful way)

Arne
 
Ad

Advertisements

L

Lew

At my day job half the Java infrastructure is just coming in to Java 1.4, the
other half to 1.5. The problem is widespread.
If I were to guess at the Java version usage distribution I would say:

1.2.2 - 5%
1.3.1 - 15%
1.4.2 - 25%
1.5.0 - 35%
1.6.0 - 20%

(please ignore the fact that it is really impossible to quantify
usage in a meaningful way)

I would guess that the usage is higher for 1.4 than your guess, and Java 6 is
much lower. But I'm in the position of trying to guess the shape of an
elephant knowing only the feel of its ears.

If I had the ears of the decision makers where I work, I'd suggest to them
that the risk of continuing with Java 1.4, with its insufficient concurrent
memory model and slower performance than modern versions, exceeds that of the
conversion to Java 5, especially in our environment which involves multiple
nodes with multiple processors running multiple JVMs with various forms of
communication between them processing high peak volumes of information per
unit of time under tight time constraints and rigorous availability requirements.

Some similarly high-demand production Java code I've seen runs about three
times faster under Java 5 and the associated Java EE (J2EE) servers than it
did with older platforms. Not just CPU-bound code, but all sorts of different
stuff involving messages and files and databases and the like. Obviously Java
by itself is only a piece of that improvement - the app-server vendors were
busy improving their stuff, too.

The fear of upgrade that I've witnessed was based on considerations of product
reliability on a new platform, cost of code conversions (rooting out misuse of
the 'enum' keyword and the like), and operations costs associated with
migration to and maintenance of the new enterprise platform. Decision makers
seemed utterly unimpressed with claims of performance improvement; only risk
mattered.

Lately I have been meditating on the balance of risks between those that arise
from conversion and those that arise from the failure to convert to Java 5 or
later. I posit that risk comparison will carry more meaning to decision
makers than benefit comparison.
 
J

John B. Matthews

Arne Vajhøj said:
If I were to guess at the Java version usage distribution I would say:

1.2.2 - 5%
1.3.1 - 15%
1.4.2 - 25%
1.5.0 - 35%
1.6.0 - 20%

(please ignore the fact that it is really impossible to quantify
usage in a meaningful way)

Google - millions of hits:

java 1.1 - 22.8
java 1.2 - 16.1
java 1.3 - 12.0
java 1.4 - 12.6
java 1.5 - 38.1
java 1.6 - 10.2
java 1.7 - 5.2

Bimodal!?
 
L

Lew

Google - millions of hits:

Is it your intention to claim that "Google - millions of hits" is a meaningful
metric of Java platform usage?
java 1.1 - 22.8
java 1.2 - 16.1
java 1.3 - 12.0
java 1.4 - 12.6
java 1.5 - 38.1
java 1.6 - 10.2
java 1.7 - 5.2

Bimodal!?

The number of hits for Java 1.7 clearly doesn't reflect usage, since Java 7
isn't even fully defined yet and is therefore not yet in use at all.

Your hit counts ignored the new version numbering scheme whereby the two most
recent versions are "Java 5" and "Java 6".

Hit counts represent how many documents exist for a particular search term
set, but one has to show how that correlates to usage, if it even does.

Hits are cumulative, the longer something is around the more documents there
could be that pertain to it. That could explain the high count for Java 1.1
The high count for 1.5 might reflect hits on newsgroups, which are multiply
republished on a host of hosts, but who knows, really? Maybe the types of
hits for 1.5 are those more likely to be duplicated on multiple nodes,
inflating the hit count. Maybe it was more contemporaneous with heavy Web use
than earlier versions. Maybe it is in wider use than other versions. Who
knows? I can't tell from these hit count numbers.
 
Ad

Advertisements

J

John B. Matthews

Google - millions of hits:

Is it your intention to claim that "Google - millions of hits" is a
meaningful metric of Java platform usage?[/QUOTE]

Egad, no! I should have reiterated Arne's caveat. OTOH, the result is
not entirely unexpected and parallels Arne's (considerable) experience.
The number of hits for Java 1.7 clearly doesn't reflect usage, since
Java 7 isn't even fully defined yet and is therefore not yet in use
at all.

Your hit counts ignored the new version numbering scheme whereby the
two most recent versions are "Java 5" and "Java 6".

Yes, this is confounding; I was sticking with the developer version
numbers:

Hit counts represent how many documents exist for a particular search
term set, but one has to show how that correlates to usage, if it
even does.

Hits are cumulative, the longer something is around the more
documents there could be that pertain to it. That could explain the
high count for Java 1.1 The high count for 1.5 might reflect hits on
newsgroups, which are multiply republished on a host of hosts, but
who knows, really? Maybe the types of hits for 1.5 are those more
likely to be duplicated on multiple nodes, inflating the hit count.
Maybe it was more contemporaneous with heavy Web use than earlier
versions. Maybe it is in wider use than other versions. Who knows?
I can't tell from these hit count numbers.

Indeed, such numbers are almost meaningless, yet strangely fascinating.
Cf <http://www.google.com/intl/en/press/zeitgeist2008/index.html>

Here's a very rough measure of features/version from skimming the Java
1.5 API documentation (J2SE 5.0):

<code>
#!/bin/sh
DIR=/Developer/Documentation/Java/docs
ECHO=/bin/echo
for ((i=0; i<=6; i++)) ; do
${ECHO} -n "Since 1.${i}: "
grep -R "<DD>1.${i}" $DIR/* | wc -l
done
</code>

<console>
$ ./since.sh
Since 1.0: 26
Since 1.1: 89
Since 1.2: 965
Since 1.3: 550
Since 1.4: 1384
Since 1.5: 1321
Since 1.6: 0
</console>
 

Top