Implementing a DTD-based XML validator

T

Tom Anderson

Afternoon all,

Call me mad, but i'm interested in writing an XML validator. Not as part
of a parser, but operating on DOM-like objects in a program. Basically, i
want to write a function createElement that looks a bit like:

Node a, b, c; // create these somehow
Element list = createElement("xhtml:p", new Node[] {a, b, c});

Where createElement is able to determine whether {a, b, c} is a valid
sequence of child elements for an xhtml:p element, and so throw an
exception of something if it isn't.

The idea would be to parse a DTD in order to create objects representing
the content model, then use those to validate the nodes.

The XML spec says:

More formally: a finite state automaton may be constructed from the
content model using the standard algorithms, e.g. algorithm 3.5 in
section 3.9 of Aho, Sethi, and Ullman [Aho/Ullman]. In many such
algorithms, a follow set is constructed for each position in the regular
expression (i.e., each leaf node in the syntax tree for the regular
expression); if any position has a follow set in which more than one
following position is labeled with the same element type name, then the
content model is in error and maybe reported as an error.

Firstly, roughly how hard is this? Expressed in, say,
milli-Dijkstra's-algorithms - 5000? 20 000? 100 000?

Secondly, i'm not keen to rush out and buy Aho et al's no doubt wonderful
book on compilers just so i can do this. Can anyone direct me to anything
i can read online where i can learn about this? That could be in English
or source code - presumably, there are numerous open-source projects which
have implemented XML validators, right?

It occurs to me that i could avoid having to write the validator myself by
using a grotesque hack - if i can map node types to strings, i can express
a node sequence as a string, and a content model as a regular expression,
and then just let a standard regexp library do the heavy lifting. In
python, operating on standard DOM objects:

def validateAsParagraph(nodelist):
nodeString = "".join(map(lambda node: "<" + node.nodeName + ">", nodelist))
pPattern = re.compile("(?:<(?:#PCDATA|br|span|bdo|map|tt|i|b|big|small|em|strong|dfn|code|q|samp|kbd|var|cite|abbr|acronym|sub|sup|input|select|textarea|label|button|ins|del|script)>)*")
m = pPattern.match(nodeString)
return (m != None) and (m.end() == len(nodeString))

I can't decide if this is brilliant or revolting, or both.

tom
 
S

Stanimir Stamenkov

Fri, 29 May 2009 13:38:08 +0100, /Tom Anderson/:
Call me mad, but i'm interested in writing an XML validator. Not as part
of a parser, but operating on DOM-like objects in a program.

JAXP 1.3 provides validation API which is implemented [1] by Xerces2
and which could operate on already parsed and built DOM.
Can anyone direct me
to anything i can read online where i can learn about this? That could
be in English or source code - presumably, there are numerous
open-source projects which have implemented XML validators, right?

You could read the Xerces2 Implementation API documentation [2] -
packages like org.apache.xerces.impl.dtd.models and
org.apache.xerces.impl.xs.models. You could browse the sources [3]
as well.

[1]
http://xerces.apache.org/xerces2-j/javadocs/api/javax/xml/validation/package-summary.html
[2] http://xerces.apache.org/xerces2-j/javadocs/xerces2/index.html
[3] http://xerces.apache.org/xerces2-j/source-repository.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top