Automatically retrieving XML

T

Tassilo Horn

Hi all,

I'm using a SAXParser created with a SAXParserFactory with a custom
ContentHandler for parsing XML files and creating some graph
representation of the files contents.

Currently, I generate plain trees, but what I really want is to indicate
cross references as well, as given by element attributes of type
ID/IDREF.

However, Attributes.getType(int) reports only CDATA for all attributes.
So probably I need to create a validating parser. I tried that, but
still no luck, and SAXParserFactory.setValidating(boolean) says that
this is mostly for DTDs anyway and I should set a Schema using
setSchema().

However, the XML files I parse may be completely arbitrary, so I cannot
set some fixed schema. But at least I can presume that they all declare
their schemas properly (and in general there are mostly well-known
schemas), like

--8<---------------cut here---------------start------------->8---
<?xml version="1.0" encoding="UTF-8"?>
<uml:package xmi:version="2.1"
xmlns:xmi="http://schema.omg.org/spec/XMI/2.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:ecore="http://www.eclipse.org/emf/2002/Ecore"
xmlns:uml="http://schema.omg.org/spec/UML/2.1.1"
xsi:schemaLocation="http://schema.omg.org/spec/UML/2.1.1 http://www.eclipse.org/uml2/2.1.0/UML"
xmi:id="_FbjLzPlaEeCo98mycbxXgw" name="de.uko.nsd.NSDSchema">
--8<---------------cut here---------------end--------------->8---

Isn't there something that retrieves the needed schemas automatically by
the declarations given in the XML file?

(I remember some years back, I've used some open-source parser [maybe
xerces] that did exactly that when some option was set.)

Thanks for any pointers!

Bye,
Tassilo
 
T

Tassilo Horn

Ups,

the Subject should have been "Automatically retrieving XML *Schemas*"...

Bye,
Tassilo
 
S

Steven Simpson

I'm using a SAXParser created with a SAXParserFactory with a custom
ContentHandler for parsing XML files and creating some graph
representation of the files contents.

Currently, I generate plain trees, but what I really want is to indicate
cross references as well, as given by element attributes of type
ID/IDREF.

However, Attributes.getType(int) reports only CDATA for all attributes.
So probably I need to create a validating parser. I tried that, but
still no luck, and SAXParserFactory.setValidating(boolean) says that
this is mostly for DTDs anyway and I should set a Schema using
setSchema().

However, the XML files I parse may be completely arbitrary, so I cannot
set some fixed schema.

The example here shows a loaded Document subsequently being validated:

<http://download.oracle.com/javase/7/docs/api/javax/xml/validation/package-summary.html>

That method doesn't appear to give you much more information, but
there's another method that produces an augmented result:

<http://download.oracle.com/javase/7...transform.Source, javax.xml.transform.Result)>

I presume the augmentation includes recognition of ID/IDREF-type attributes.

For SAX, there's a ValidatorHandler which can take a ContentHandler to
receive the "augmented validation result":

<http://download.oracle.com/javase/7...setContentHandler(org.xml.sax.ContentHandler)>

Again, I'm guessing that it's what you're after.

Isn't there something that retrieves the needed schemas automatically by
the declarations given in the XML file?

I can only think of EntityResolvers that fetch DTDs...
 
T

Tassilo Horn

Hi Steven,
Again, I'm guessing that it's what you're after.

What I'm after is given a XML instance with a root element declaring a
namespace like

xmlns:xmi="http://schema.omg.org/spec/XMI/2.1"

including a schemaLocation like

xsi:schemaLocation="http://schema.omg.org/spec/UML/2.1.1 http://www.eclipse.org/uml2/2.1.0/UML"

is there a way to get a Schema object that I can give to
SAXParserFactory.setSchema() to create a new validating parser?

Currently, it seems that I need to use a non-validating parser first to
pick out the schema uri, then retrieve a Schema object using something
like

SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI)
.newSchema(new URL(mySchemaURL))

that I can use to generate another validating parser. Here, I don't
know what mySchemaURL should be. Is there a canonical translation for a
schemaLocation URL to the corresponding XSD?

For example, the XMI schema's XSD is

http://www.omg.org/spec/XMI/20071001/07-10-06.xsd

but how should I know?

Oh, now I've found the parameter-less SchemaFactory.newSchema() method.
It's docs say

For XML Schema, this method creates a Schema object that performs
validation by using location hints specified in documents.

And SAXParserFactory says in the docs of setSchema():

When a Schema is non-null, a parser will use a validator created from
it to validate documents before it passes information down to the
application.

Sounds exactly like what I need. However, if I provide such a Schema
object to my SAXParserFactory before creating the parser from it, the
parser does not validate. I explicitly added undefined elements and
attributes to the XML file to test that. :-(

Bye,
Tassilo
 
T

Tassilo Horn

Hi all,

as it turned out, that path I wanted to take, e.g., using validation to
get the correct attribute types, is not really practical at all, because
nearly all the XML files out there are invalid anyway. I spent one hour
searching the net for arbitray XML files corresponding to some XSD (XMI
files, SVG, whatnot), but none passed the online schema validators. And
doing validation only for getting to know what attribute is an ID/IDREF
is not really what I want, anyway.

Funnily, when I tested some XML with an embedded DTD with my tool, it
just worked right out of the box with no validation turned on.

Bye,
Tassilo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,836
Messages
2,569,750
Members
45,545
Latest member
rapter____0

Latest Threads

Top