Hacks for parsing non well-formed XML ?

Discussion in 'XML' started by Andy Dingley, Mar 16, 2007.

  1. Andy Dingley

    Andy Dingley Guest

    Given this badly-formed fragment, any suggestions on how best to parse
    it?

    [...]
    <dc:title><Browse By Subject></dc:title>
    [...]

    The minimal problem is "unexpected < character at the beginning of
    character data"

    I don't know how it arises. I suspect that it's a character string
    with "<" in that isn't being encoded properly. Although it might be
    some crazy tag-name getting squirted into the wrong end of the XML
    generator. Anyway, it's the badly-formed output of a major bluechip
    dot-com and it's likely to stay that way. Our problem is how to chow
    down on it, despite its bad formation. 8-(

    It's not too important to preserve the content here. The good stuff is
    elsewhere in the document, this is just grit in the way.

    So, any suggestions on how best to abuse XML standards or tools and
    get it parsed with minimum work?

    I've wondered about hacking to recognise tag closure as being
    triggered by any whitespace, or by discarding starttags that aren't
    from a small known list. I don't much like either though. Most robust
    so far seems to be a parser where "<dc:title>" becomes part of the
    syntax itself and has special handling. Any better ideas?
    Andy Dingley, Mar 16, 2007
    #1
    1. Advertising

  2. In article <>,
    Andy Dingley <> wrote:

    >Given this badly-formed fragment, any suggestions on how best to parse
    >it?


    ><dc:title><Browse By Subject></dc:title>


    [...]

    >I've wondered about hacking to recognise tag closure as being
    >triggered by any whitespace, or by discarding starttags that aren't
    >from a small known list.


    You could make a pass through to determine probably-legal element
    names, by looking for end tags. "</Browse" is much less likely to
    occur than "<Browse". Then escape less-thans that don't precede an
    element name for which you found a plausible end tag. Empty tags
    are less clear cut, but you could probably find a 99% solution.

    -- Richard
    --
    "Consideration shall be given to the need for as many as 32 characters
    in some alphabets" - X3.4, 1963.
    Richard Tobin, Mar 16, 2007
    #2
    1. Advertising

  3. Andy Dingley wrote:
    > Given this badly-formed fragment, any suggestions on how best to parse
    > it?


    Best suggestions I've got are:

    1) XML tools won't touch this. Write a text-processing layer which finds
    and fixes these abuses before even thinking about it as XML. It's going
    to be messy, fragile, ad-hoc programming.

    2) Fix the code that generates it. Seriously. This is going to be an
    ongoing hassle, and cost, until you do.


    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Mar 16, 2007
    #3
  4. Andy Dingley

    Andy Dingley Guest

    On 16 Mar, 12:49, Joe Kesselman <> wrote:

    > 2) Fix the code that generates it. Seriously. This is going to be an
    > ongoing hassle, and cost, until you do.


    It's! a! big! famous! dotcom! not! my! own! code!
    (Can you guess who it is yet?)

    Do You Snafu! :cool:
    Andy Dingley, Mar 16, 2007
    #4
  5. Andy Dingley

    Simon Brooke Guest

    in message <>, Andy
    Dingley ('') wrote:

    > Given this badly-formed fragment, any suggestions on how best to parse
    > it?
    >
    > [...]
    > <dc:title><Browse By Subject></dc:title>
    > [...]
    >
    > The minimal problem is "unexpected < character at the beginning of
    > character data"


    sed 's/<Browse By Subject>//'

    There's no particular reason why you shouldn't use old and proven text
    manipulation tools on XML.

    --
    (Simon Brooke) http://www.jasmine.org.uk/~simon/

    A message from our sponsor: This site is now in free fall
    Simon Brooke, Mar 16, 2007
    #5
  6. > It's! a! big! famous! dotcom! not! my! own! code!

    Talk! To! Them! About! It!.

    Though you may find that this is a deliberate poison-pill to prevent
    unauthorized folks mining their servers... in which case you should
    probably be talking to them about getting more official access, since
    they're probably changing the poison on a regular basis and anything you
    attempt to do to bypass it is likely to break again in a few weeks.

    --
    Joe Kesselman / Beware the fury of a patient man. -- John Dryden
    Joseph Kesselman, Mar 16, 2007
    #6
  7. Andy Dingley

    Peter Flynn Guest

    Andy Dingley wrote:
    > On 16 Mar, 12:49, Joe Kesselman <> wrote:
    >
    >> 2) Fix the code that generates it. Seriously. This is going to be an
    >> ongoing hassle, and cost, until you do.

    >
    > It's! a! big! famous! dotcom! not! my! own! code!
    > (Can you guess who it is yet?)
    >
    > Do You Snafu! :cool:


    Nevertheless, charge them extra and mark it on the invoice as overhead
    for manual handling of non-XML material. If they're that big, they'll
    pay, and if they're that stupid, they'll continue to pay you rather than
    fix the bug.

    ///Peter
    Peter Flynn, Mar 16, 2007
    #7
  8. Andy Dingley

    Andy Dingley Guest

    On 16 Mar, 17:35, Joseph Kesselman <> wrote:

    > Though you may find that this is a deliberate poison-pill to prevent
    > unauthorized folks mining their servers...


    Oh, I _wish_ they were that smart.

    Just to clarify, it's a public interface to their services that they
    encourage(sic) the use of. The likelihood of them fixing it is on the
    avian-pig scale. It's also not a static string, so any sed-ing would
    need a slightly more sophisticated regex to work on it, although it's
    entirely viable. Sadly it's also an embedded app, so Unix tools just
    aren't present. A similar pre-processor approach seems best though,
    rather than frobbing a parser.

    Thanks for all your suggestions.
    Andy Dingley, Mar 19, 2007
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Timo Nentwig

    parsing non-well-formed XML (SAX)

    Timo Nentwig, Jun 4, 2004, in forum: Java
    Replies:
    2
    Views:
    838
    Timo Nentwig
    Jun 4, 2004
  2. Paul Flew

    Well-formed XML question

    Paul Flew, Jun 30, 2003, in forum: XML
    Replies:
    3
    Views:
    985
    Micah Cowan
    Jul 5, 2003
  3. Rimu Atkinson

    how is this XML not well-formed???

    Rimu Atkinson, Jul 9, 2003, in forum: XML
    Replies:
    1
    Views:
    1,072
    Peter Flynn
    Jul 15, 2003
  4. Replies:
    7
    Views:
    406
    Andy Dingley
    Apr 18, 2007
  5. Rich Fowler
    Replies:
    2
    Views:
    1,263
    Rich Fowler
    Jan 22, 2010
Loading...

Share This Page