XML Parser VS HTML Parser

Z

ZOCOR

Hi

Can a XML parser be used to parse a HTML document? even if it is not
well-formed?

If the answer is yes to both, can you recommend a Java XML parser class
(from the standard API)?

Cheers

ZOCOR
 
S

Sudsy

ZOCOR said:
Hi

Can a XML parser be used to parse a HTML document? even if it is not
well-formed?

No; an XML parser will balk on a lot of HTML. It's not well-formed.
If the answer is yes to both, can you recommend a Java XML parser class
(from the standard API)?

Search the archives for alternate approaches.
 
P

[private]

ZOCOR said:
Can a XML parser be used to parse a HTML document? even if it is not
well-formed?
It can parse it as long as the HTML is well-formed. XML isn't as
relaxed as HTML, so any open elements will throw an exception (probably
org.xml.sax.SAXException, but can't verify right now).
 
M

Martin Honnen

ZOCOR wrote:

Can a XML parser be used to parse a HTML document? even if it is not
well-formed?

No, an XML parser can't parse HTML, unless of course it is XHTML. But
HTML 3.2 or HTML 4.01 cannot be parsed with an XML parser.
 
D

Darryl L. Pierce

ZOCOR said:
Can a XML parser be used to parse a HTML document? even if it is not
well-formed?

A SAX or DOM parser will throw exceptions on data that's not well-formed.
So, the answer is no, it cannot.

--
/**
* @author Darryl L. Pierce <[email protected]>
* @see The Infobahn Offramp <http://mcpierce.mypage.org>
* @quote "Lobby, lobby, lobby, lobby, lobby, lobby..." - Adrian Monk
*/
 
T

Tor Iver Wilhelmsen

It can parse it as long as the HTML is well-formed.

Except for XHTML, HTML cannot be assumed to be well-formed since HTML
does not "end" empty elements properly; they are only empty by
implication, like <br>.

Also, real-world HTML is packed full of implicit begin and end tags a
parser needs to be aware of.
 
C

CarlosRivera

You could use tidy or similar to turn html into xhtml and then use an
XML parser.
 
Z

ZOCOR

Darryl L. Pierce said:
A SAX or DOM parser will throw exceptions on data that's not well-formed.
So, the answer is no, it cannot.

Well i can catch the exceptions so that processing can continue?

Whats the problem?

ZOCOR
 
T

Tor Iver Wilhelmsen

ZOCOR said:
Whats the problem?

<br> and the like, which are (implicitly) empty elements that a SAX
parser will not report an end element for, since they are start tags
for containing elements as far as the parser knows.

So you need to add a bunch of logic that handles optional start
elements, implicit end elements, and non-terminated empty elements.

But, hey, if you don't consider that a problem...
 
Z

ZOCOR

Whats the problem?
<br> and the like, which are (implicitly) empty elements that a SAX
parser will not report an end element for, since they are start tags
for containing elements as far as the parser knows.

So you need to add a bunch of logic that handles optional start
elements, implicit end elements, and non-terminated empty elements.

But, hey, if you don't consider that a problem...

Well im only after specific text contained in certain tags, which
fortunately have an end tag for. As for the other tags, I couldn't give 2
rats about.


ZOCOR
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,832
Latest member
GlennSmall

Latest Threads

Top