Parsing generic XML

Roedy Green · Jun 11, 2008

I have some XML, namely PAD files, for which I have no schema, though
I probably could cook one up in a day or two.

Similarly I have some XHTML, I want to screenscrape where, I really
only care about the <table <tr and <td elements.

So what I am after is some sort of extremely relaxed schema that will
eat pretty well anything so long as the tags balance.

I tried parsing without any schema at all, and it choked on  
entities.

Owen Jacobson · Jun 11, 2008

I have some XML, namely PAD files, for which I have no schema, though
I probably could cook one up in a day or two.

Similarly I have some XHTML, I want to screenscrape where, I really
only care about the <table <tr and <td elements.

So what I am after is some sort of extremely relaxed schema that will
eat pretty well anything so long as the tags balance.

I tried parsing without any schema at all, and it choked on  
entities.

Entity references (  and friends) only have meaning with respect
to a schema or DTD which maps them to entities (eg., in the
case of &nbsp

. XML documents which contain entity references MUST
contain a definition somewhere; there's not really any avoiding it.

Fortunately, for XHTML that's easy; there's a published DTD.

In the case of PAD files you may have to replace the entity references
with entities manually, if you can't find a schema that defines them.

Any basic XML parser (jdom, dom4j, sax, w3c dom, et multiple cetera)
should accept any well-formed document if you turn off validation.

-o

XML parsing with Java	33	Dec 8, 2008
Whitespace problems, xml-parsing	5	Apr 15, 2008
XML parsing with python	1	Aug 17, 2009
A Unique XML Parsing Problem	5	Oct 24, 2010
XML/XHTML/HTML differences, bugs... and howto	0	Jan 23, 2013
HTML Works in IE Not Chrome or FireFox Why?	9	Aug 7, 2012
XML in XMPP	8	Jul 6, 2012
XML Parsing Puzzle	2	May 4, 2006

Parsing generic XML

Roedy Green

Owen Jacobson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads