Split file that contains multiple XML

Discussion in 'XML' started by Dominik Stadler, Jun 23, 2005.

  1. Hi,

    We have a file that contains multiple XML-Messages in the form of:

    <FIRSTMESSAGE>...</FIRSTMESSAGE><SECONDMESSAGE>...</SECONDMESSAGE>

    I know this looks broken from the beginning, but we cannot change the
    application that generates these kind of data, so we need a way to cope
    with it.

    How would I go about reading/splitting this? I don't think there is
    functionality available to do this with Xerces, right?

    I thought about using SAX to try to parse the complete text (we should get
    an error at the second message) and then read the char/line information
    from the errormessage, but this sounds like a hack to me, is there some
    other way?

    Thanks... Dominik.
    Dominik Stadler, Jun 23, 2005
    #1
    1. Advertising

  2. Hi,

    I have actually done exactly what you suggested using the Expat SAX
    parser
    (and this feature is now included
    in the XML gawk extension found at
    http://sourceforge.net/projects/xmlgawk).
    The key is to call XML_Parse until it returns XML_STATUS_ERROR.
    At that point, one calls XML_GetCurrentByteIndex to find the location
    of the error. You can then close out the parsing of the previous
    document,
    and then start parsing the new one that begins at the returned error
    offset into the file. To see how this is done, you can look in
    xml_puller.c in the sourceforge repository:

    http://cvs.sourceforge.net/viewcvs.py/xmlgawk/xmlgawk/extension/xml_puller.c?rev=1.6&view=markup

    Or you can just use xmlgawk and not worry about implementing this
    yourself.

    Regards,
    Andy
    Andrew Schorr, Jun 24, 2005
    #2
    1. Advertising

  3. In article <>,
    Dominik Stadler <> wrote:

    >We have a file that contains multiple XML-Messages in the form of:
    >
    ><FIRSTMESSAGE>...</FIRSTMESSAGE><SECONDMESSAGE>...</SECONDMESSAGE>


    >How would I go about reading/splitting this?


    Wrap an element around it so it is

    <x><FIRSTMESSAGE>...</FIRSTMESSAGE><SECONDMESSAGE>...</SECONDMESSAGE></x>

    and then just extract the children of that element.

    -- Richard
    Richard Tobin, Jun 24, 2005
    #3
  4. It seems to me that this method will not work if any of the messages
    contain XML declaration
    headers. When I try your technique by feeding some concatenated
    messages, each of which
    contains an XML declaration, into the Expat parser, I get this error
    message:

    xml declaration not at start of external entity

    But I expect your technique should work fine if there are no XML
    declarations in the
    messages.

    Regards,
    Andy
    Andrew Schorr, Jun 27, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. James Dyer
    Replies:
    5
    Views:
    643
  2. Replies:
    2
    Views:
    466
  3. Carlos Ribeiro
    Replies:
    11
    Views:
    696
    Alex Martelli
    Sep 17, 2004
  4. trans.  (T. Onoma)

    split on '' (and another for split -1)

    trans. (T. Onoma), Dec 27, 2004, in forum: Ruby
    Replies:
    10
    Views:
    213
    Florian Gross
    Dec 28, 2004
  5. Sam Kong
    Replies:
    5
    Views:
    237
    Rick DeNatale
    Aug 12, 2006
Loading...

Share This Page