Parsing generic XML

R

Roedy Green

I have some XML, namely PAD files, for which I have no schema, though
I probably could cook one up in a day or two.

Similarly I have some XHTML, I want to screenscrape where, I really
only care about the <table <tr and <td elements.

So what I am after is some sort of extremely relaxed schema that will
eat pretty well anything so long as the tags balance.

I tried parsing without any schema at all, and it choked on &nbsp;
entities.
 
O

Owen Jacobson

I have some XML, namely PAD files, for which I have no schema, though
I probably could cook one up in a day or two.  

Similarly I have some XHTML, I want to screenscrape where, I really
only care about the <table <tr and <td elements.  

So what I am after is some sort of extremely relaxed schema that will
eat pretty well anything so long as the tags balance.

I tried parsing without any schema at all, and it choked on &nbsp;
entities.

Entity references (&nbsp; and friends) only have meaning with respect
to a schema or DTD which maps them to entities (eg.,   in the
case of &nbsp;). XML documents which contain entity references MUST
contain a definition somewhere; there's not really any avoiding it.

Fortunately, for XHTML that's easy; there's a published DTD.

In the case of PAD files you may have to replace the entity references
with entities manually, if you can't find a schema that defines them.

Any basic XML parser (jdom, dom4j, sax, w3c dom, et multiple cetera)
should accept any well-formed document if you turn off validation.

-o
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top