How to get Comments into SAX Parser?

B

Bernard

Hi,

I am trying to parse HTML documents.

After pulling my hair out with bugs in HTMLEditorKit.ParserCallback
(*), I am now trying to use org.xml.sax.helpers.DefaultHandler in the
JDK 1.4.

This seems to work when I extend the HTML data with XML tags e.g.:

<XML>
<HTML>
<HEAD>
<SCRIPT><!-
var test="&";//-->
</SCRIPT>
</HEAD>
</HTML>
</XML>

But there is a problem:

How do I get the content between the <SCRIPT></SCRIPT> tags?

I must enclose the script in comments otherwise the parser parses the
script and crashes on "&" with "org.xml.sax.SAXParseException: Illegal
character or entity reference syntax."

If I use the comments tags then neither ignorableWhitespace() nor
characters() is called.

I could of course pre-parse the file and store all comments but then
the parser does not give me character offsets in the callbacks that
would let me insert them back again. It would be a very messy
workaround.

Any ideas would are highly appreciated.

(*) HTML Parser bugs that I struggled with:
HTML Parser parses HTML between script tags
http://developer.java.sun.com/developer/bugParade/bugs/4912066.html
HTML Parser creates extraneous attribute endtag="true"
http://developer.java.sun.com/developer/bugParade/bugs/4912072.html
Parser implies <body> before <frameset>
http://developer.java.sun.com/developer/bugParade/bugs/4950344.html
HTML Parser ignores noscript tag
Duplicate of RFE:
http://developer.java.sun.com/developer/bugParade/bugs/4308782.html
duplicate of
http://developer.java.sun.com/developer/bugParade/bugs/4296022.html
HTML Parser ignores text between script tags placed in head section
http://developer.java.sun.com/developer/bugParade/bugs/4912066.html
Closed due to
http://developer.java.sun.com/developer/bugParade/bugs/4761273.html
HTML Parser ignores text between script tags placed in head section
http://developer.java.sun.com/developer/bugParade/bugs/4912066.html

Thanks,
Bernard
 
S

Sudsy

Bernard said:
Hi,

I am trying to parse HTML documents.

After pulling my hair out with bugs in HTMLEditorKit.ParserCallback
(*), I am now trying to use org.xml.sax.helpers.DefaultHandler in the
JDK 1.4.

This seems to work when I extend the HTML data with XML tags e.g.:

<XML>
<HTML>
<HEAD>
<SCRIPT><!-
var test="&";//-->
</SCRIPT>
</HEAD>
</HTML>
</XML>

But there is a problem:

How do I get the content between the <SCRIPT></SCRIPT> tags?


Can't you just use something like this?


<SCRIPT><![CDATA[var test="&";//]]></SCRIPT>
 
B

Bernard

Hi Sudsy,

Many thanks for the idea.
It's an acceptable workaround since I have to pre-process the file
anyway.

Bernard


Sudsy said:
Bernard said:
Hi,

I am trying to parse HTML documents.

After pulling my hair out with bugs in HTMLEditorKit.ParserCallback
(*), I am now trying to use org.xml.sax.helpers.DefaultHandler in the
JDK 1.4.

This seems to work when I extend the HTML data with XML tags e.g.:

<XML>
<HTML>
<HEAD>
<SCRIPT><!-
var test="&";//-->
</SCRIPT>
</HEAD>
</HTML>
</XML>

But there is a problem:

How do I get the content between the <SCRIPT></SCRIPT> tags?


Can't you just use something like this?


<SCRIPT><![CDATA[var test="&";//]]></SCRIPT>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top