HTML Processing in Java

H

Honza

Hello,

I would like to process html pages in java. The very first task would
be to ignore unnecessary information like comments (everything in <!--
-->) or images.
What would be the best start point?
I have found JTidy and HTML Parser in SourceForge, but none of them is
able of ignoring tags - or did I miss it?

Thank you for any clue
Honza
 
Z

zero

Hello,

I would like to process html pages in java. The very first task would
be to ignore unnecessary information like comments (everything in <!--
-->) or images.
What would be the best start point?
I have found JTidy and HTML Parser in SourceForge, but none of them is
able of ignoring tags - or did I miss it?

Thank you for any clue
Honza

I would be very surprised if either of those actually did anything with the
comments. If they do, why not just remove the code that handles them?
 
O

Oliver Wong

Honza said:
Hello,

I would like to process html pages in java. The very first task would
be to ignore unnecessary information like comments (everything in <!--
-->) or images.
What would be the best start point?
I have found JTidy and HTML Parser in SourceForge, but none of them is
able of ignoring tags - or did I miss it?

Thank you for any clue
Honza

Haven't used the parsers you're talking about, but if you find any SAX
based parser, you'll just receive a bunch of "events" representing the
discovery of "things" in an HTML document, and you can just ignore the
"comment" events.

- Oliver
 
A

Abhijat Vatsyayan

Honza said:
Hello,

I would like to process html pages in java. The very first task would
be to ignore unnecessary information like comments (everything in <!--
-->) or images.
What would be the best start point?
I have found JTidy and HTML Parser in SourceForge, but none of them is
able of ignoring tags - or did I miss it?

Thank you for any clue
Honza
Take a look at classes ParserDelegator and HTMLEditorKit.ParserCallback
in package javax.swing.text.html

You can implement(subclass) your own ParserCallback and use that in the
parse method of ParserDelegator object. This is quite like using SAX
parsers for XML documents.

Abhijat
 
H

Honza

Thank you guys, I will check the possibilities.

I have found another interesting application which could also be
solution of my problem. Its name is Muffin - http://muffin.doit.org/
It is highly customizable java writen proxy where you can filter html
content.
I am going to try it out tomorrow.

Thanks a lot
Honza
 
H

Honza

Hello Abhijat,

I have tested HTMLEditorKit today. It is really very easy to use and it
would be appropriate for my purpose...

BUT: I've tested it with "real world" HTML pages and I find it not
robust enough. The results are not accurate enough and number of errors
is too high if parsing any "badly written" HTML page.

I have found nice page benchmarking "real world" SAX HTML parsers. I
think I will use one of them...

Link: http://www.portletbridge.org/saxbenchmark

Honza
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top