B
Bernard
Hi,
I am trying to parse HTML documents.
After pulling my hair out with bugs in HTMLEditorKit.ParserCallback
(*), I am now trying to use org.xml.sax.helpers.DefaultHandler in the
JDK 1.4.
This seems to work when I extend the HTML data with XML tags e.g.:
<XML>
<HTML>
<HEAD>
<SCRIPT><!-
var test="&";//-->
</SCRIPT>
</HEAD>
</HTML>
</XML>
But there is a problem:
How do I get the content between the <SCRIPT></SCRIPT> tags?
I must enclose the script in comments otherwise the parser parses the
script and crashes on "&" with "org.xml.sax.SAXParseException: Illegal
character or entity reference syntax."
If I use the comments tags then neither ignorableWhitespace() nor
characters() is called.
I could of course pre-parse the file and store all comments but then
the parser does not give me character offsets in the callbacks that
would let me insert them back again. It would be a very messy
workaround.
Any ideas would are highly appreciated.
(*) HTML Parser bugs that I struggled with:
HTML Parser parses HTML between script tags
http://developer.java.sun.com/developer/bugParade/bugs/4912066.html
HTML Parser creates extraneous attribute endtag="true"
http://developer.java.sun.com/developer/bugParade/bugs/4912072.html
Parser implies <body> before <frameset>
http://developer.java.sun.com/developer/bugParade/bugs/4950344.html
HTML Parser ignores noscript tag
Duplicate of RFE:
http://developer.java.sun.com/developer/bugParade/bugs/4308782.html
duplicate of
http://developer.java.sun.com/developer/bugParade/bugs/4296022.html
HTML Parser ignores text between script tags placed in head section
http://developer.java.sun.com/developer/bugParade/bugs/4912066.html
Closed due to
http://developer.java.sun.com/developer/bugParade/bugs/4761273.html
HTML Parser ignores text between script tags placed in head section
http://developer.java.sun.com/developer/bugParade/bugs/4912066.html
Thanks,
Bernard
I am trying to parse HTML documents.
After pulling my hair out with bugs in HTMLEditorKit.ParserCallback
(*), I am now trying to use org.xml.sax.helpers.DefaultHandler in the
JDK 1.4.
This seems to work when I extend the HTML data with XML tags e.g.:
<XML>
<HTML>
<HEAD>
<SCRIPT><!-
var test="&";//-->
</SCRIPT>
</HEAD>
</HTML>
</XML>
But there is a problem:
How do I get the content between the <SCRIPT></SCRIPT> tags?
I must enclose the script in comments otherwise the parser parses the
script and crashes on "&" with "org.xml.sax.SAXParseException: Illegal
character or entity reference syntax."
If I use the comments tags then neither ignorableWhitespace() nor
characters() is called.
I could of course pre-parse the file and store all comments but then
the parser does not give me character offsets in the callbacks that
would let me insert them back again. It would be a very messy
workaround.
Any ideas would are highly appreciated.
(*) HTML Parser bugs that I struggled with:
HTML Parser parses HTML between script tags
http://developer.java.sun.com/developer/bugParade/bugs/4912066.html
HTML Parser creates extraneous attribute endtag="true"
http://developer.java.sun.com/developer/bugParade/bugs/4912072.html
Parser implies <body> before <frameset>
http://developer.java.sun.com/developer/bugParade/bugs/4950344.html
HTML Parser ignores noscript tag
Duplicate of RFE:
http://developer.java.sun.com/developer/bugParade/bugs/4308782.html
duplicate of
http://developer.java.sun.com/developer/bugParade/bugs/4296022.html
HTML Parser ignores text between script tags placed in head section
http://developer.java.sun.com/developer/bugParade/bugs/4912066.html
Closed due to
http://developer.java.sun.com/developer/bugParade/bugs/4761273.html
HTML Parser ignores text between script tags placed in head section
http://developer.java.sun.com/developer/bugParade/bugs/4912066.html
Thanks,
Bernard