Split file that contains multiple XML

D

Dominik Stadler

Hi,

We have a file that contains multiple XML-Messages in the form of:

<FIRSTMESSAGE>...</FIRSTMESSAGE><SECONDMESSAGE>...</SECONDMESSAGE>

I know this looks broken from the beginning, but we cannot change the
application that generates these kind of data, so we need a way to cope
with it.

How would I go about reading/splitting this? I don't think there is
functionality available to do this with Xerces, right?

I thought about using SAX to try to parse the complete text (we should get
an error at the second message) and then read the char/line information
from the errormessage, but this sounds like a hack to me, is there some
other way?

Thanks... Dominik.
 
A

Andrew Schorr

Hi,

I have actually done exactly what you suggested using the Expat SAX
parser
(and this feature is now included
in the XML gawk extension found at
http://sourceforge.net/projects/xmlgawk).
The key is to call XML_Parse until it returns XML_STATUS_ERROR.
At that point, one calls XML_GetCurrentByteIndex to find the location
of the error. You can then close out the parsing of the previous
document,
and then start parsing the new one that begins at the returned error
offset into the file. To see how this is done, you can look in
xml_puller.c in the sourceforge repository:

http://cvs.sourceforge.net/viewcvs.py/xmlgawk/xmlgawk/extension/xml_puller.c?rev=1.6&view=markup

Or you can just use xmlgawk and not worry about implementing this
yourself.

Regards,
Andy
 
R

Richard Tobin

Dominik Stadler said:
We have a file that contains multiple XML-Messages in the form of:

<FIRSTMESSAGE>...</FIRSTMESSAGE><SECONDMESSAGE>...</SECONDMESSAGE>
How would I go about reading/splitting this?

Wrap an element around it so it is

<x><FIRSTMESSAGE>...</FIRSTMESSAGE><SECONDMESSAGE>...</SECONDMESSAGE></x>

and then just extract the children of that element.

-- Richard
 
A

Andrew Schorr

It seems to me that this method will not work if any of the messages
contain XML declaration
headers. When I try your technique by feeding some concatenated
messages, each of which
contains an XML declaration, into the Expat parser, I get this error
message:

xml declaration not at start of external entity

But I expect your technique should work fine if there are no XML
declarations in the
messages.

Regards,
Andy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top