HTML Processing in Java

Honza · Nov 29, 2005

Hello,

I would like to process html pages in java. The very first task would
be to ignore unnecessary information like comments (everything in ) or images.
What would be the best start point?
I have found JTidy and HTML Parser in SourceForge, but none of them is
able of ignoring tags - or did I miss it?

Thank you for any clue
Honza

Roedy Green · Nov 29, 2005

I would like to process html pages in java. The very first task would
be to ignore unnecessary information like comments (everything in ) or images.
What would be the best start point?

See http://mindprod.com/products1.html#ENTITIES
to strip the HTML out optionally convert the &xxx; entities back to
normal characters.

zero · Nov 29, 2005

Hello,

I would like to process html pages in java. The very first task would
be to ignore unnecessary information like comments (everything in ) or images.
What would be the best start point?
I have found JTidy and HTML Parser in SourceForge, but none of them is
able of ignoring tags - or did I miss it?

Thank you for any clue
Honza

I would be very surprised if either of those actually did anything with the
comments. If they do, why not just remove the code that handles them?

Oliver Wong · Nov 29, 2005

Honza said:
Hello,

I would like to process html pages in java. The very first task would
be to ignore unnecessary information like comments (everything in ) or images.
What would be the best start point?
I have found JTidy and HTML Parser in SourceForge, but none of them is
able of ignoring tags - or did I miss it?

Thank you for any clue
Honza

Haven't used the parsers you're talking about, but if you find any SAX
based parser, you'll just receive a bunch of "events" representing the
discovery of "things" in an HTML document, and you can just ignore the
"comment" events.

- Oliver

Roedy Green · Nov 29, 2005

See http://mindprod.com/products1.html#ENTITIES
to strip the HTML out optionally convert the &xxx; entities back to
normal characters.

With a simple modification, you could strip just comments, not all
HTML tags.

Abhijat Vatsyayan · Nov 29, 2005

Honza said:
Hello,

I would like to process html pages in java. The very first task would
be to ignore unnecessary information like comments (everything in ) or images.
What would be the best start point?
I have found JTidy and HTML Parser in SourceForge, but none of them is
able of ignoring tags - or did I miss it?

Thank you for any clue
Honza

Take a look at classes ParserDelegator and HTMLEditorKit.ParserCallback
in package javax.swing.text.html

You can implement(subclass) your own ParserCallback and use that in the
parse method of ParserDelegator object. This is quite like using SAX
parsers for XML documents.

Abhijat

Honza · Nov 29, 2005

Thank you guys, I will check the possibilities.

I have found another interesting application which could also be
solution of my problem. Its name is Muffin - http://muffin.doit.org/
It is highly customizable java writen proxy where you can filter html
content.
I am going to try it out tomorrow.

Thanks a lot
Honza

Honza · Nov 30, 2005

Hello Abhijat,

I have tested HTMLEditorKit today. It is really very easy to use and it
would be appropriate for my purpose...

BUT: I've tested it with "real world" HTML pages and I find it not
robust enough. The results are not accurate enough and number of errors
is too high if parsing any "badly written" HTML page.

I have found nice page benchmarking "real world" SAX HTML parsers. I
think I will use one of them...

Link: http://www.portletbridge.org/saxbenchmark

Honza

Text processing	29	Sep 26, 2011
Processing XML that's embedded in HTML	10	Jan 22, 2008
HTML Correctness and Validators	7	Dec 29, 2008
Java HTML Parser	4	Dec 13, 2005
emacs lisp text processing example (html5 figure/figcaption)	7	Jul 4, 2011
python - HTML processing - need tips	2	Aug 8, 2006
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Java and HTML parsing.	0	May 7, 2007

HTML Processing in Java

Honza

Roedy Green

zero

Oliver Wong

Roedy Green

Abhijat Vatsyayan

Honza

Honza

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads