Parsing multiple XML trees?

D

David Svoboda

I have a server program that takes commands and acts on them. The
server program can also take these commands from an input file or
standard input (mainly for testing purposes). As such, I often have
files full of input commands to feed to the server.

Right now the commands that the server takes are well-defined, but not
in XML. Since the commands are not self-delimiting, I have to prepend
each command with a 'length' number indicating how many chars the
command takes.

I would like to change the server to accept XML commands, and provide
a DTD (or Schema or RelaxNG or ...) to ensure that the server only
receives valid commands.

My question is this: Can I take the length number out of my input
files & network commands? Since XML is self-delimiting (tags must
balance) this should be possible. However, every time I try to run a
Xerces (Java) parser on a file full of XML commands (with no length
info), it silently discards all but the first command.

I guess what I want to know is, can Xerces take an input stream full
of multiple XML trees and give me each XML tree in turn w/o discarding
any of them? (I can use either SAX or DOM or SAX2 to accomplish this.)

Several friends have suggested that I wrap the entire input file
around a <root> tag, which would make the series of commands into one
big giant happy XML file. I suppose that could work, but that has
several problems: (1) it requires a different DTD to handle multiple
commands than it does to handle one command. (2) as a server it
precludes me from using DOM since I need to act on each command before
the entire stream has been parsed.

Maybe this is the wrong forum to ask, but it's not clear what the
right forum would be. Is this feature covered in SAX? DOM? Is it
specific to Xerces?

~David Svoboda
 
M

Martin Honnen

David Svoboda wrote:

However, every time I try to run a
Xerces (Java) parser on a file full of XML commands (with no length
info), it silently discards all but the first command.
Several friends have suggested that I wrap the entire input file
around a <root> tag, which would make the series of commands into one
big giant happy XML file. I suppose that could work, but that has
several problems: (1) it requires a different DTD to handle multiple
commands than it does to handle one command. (2) as a server it
precludes me from using DOM since I need to act on each command before
the entire stream has been parsed.

One of the requirements of markup to be called XML is a single root
element thus if you want to process some markup with XML tools then you
need to have a single root element e.g.
<commands>
<command />
<command />
</commands>
if you have e.g.
<command />
<command />
then that is not XML as that is not well-formed markup.
 
D

David Svoboda

Martin said:
David Svoboda wrote:




One of the requirements of markup to be called XML is a single root
element thus if you want to process some markup with XML tools then you
need to have a single root element e.g.
<commands>
<command />
<command />
</commands>
if you have e.g.
<command />
<command />
then that is not XML as that is not well-formed markup.

So does that mean if I'm running a server I can only send it one XML
command? That seems to mean that sending multiple XML commands is invalid.

What if a client sends two XML commands really quickly, and my server
'forgets' the second one? How does my server 'pop' exactly one XML
command off the socket?
~Dave
 
A

Andrew Schorr

David said:
Maybe this is the wrong forum to ask, but it's not clear what the
right forum would be. Is this feature covered in SAX? DOM? Is it
specific to Xerces?

I'm not sure this will be at all helpful, but we confronted this same
issue when designing an
XML parsing extension to gawk. If XMLMODE is positive, we allow only
a single XML document
to be parsed. But if XMLMODE is negative, we parse a stream of
concatenated documents
(issuing an "ENDDOCUMENT" event between documents).

We do this using the expat parser. The basic approach is to keep
parsing until an error
is encountered. When we get a parse error, we check to see whether the
current parse
depth is 0 and more than 0 elements have been parsed already. If so,
we infer that
we are done parsing a single XML document, so we issue the
"ENDDOCUMENT" event
and try to proceed with the next document. We do that by calling the
XML_GetCurrentByteIndex()
function to determine where in the input the error occurred. We use
that offset value to
identify where in the input to attempt to start parsing a new document.

If that's of any interest, you can take a look at the code here:
http://sourceforge.net/projects/xmlgawk
This could be directly useful (if you want to use xgawk's XML
extension), or the code
may serve as a guide for how to implement this in your environment.

Regards,
Andy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,040
Latest member
papereejit

Latest Threads

Top