Extracting bolds and italics from HTML

E

Ezee

Hi,

I am trying to make a web crawler which will be topic focused. For
this, I have to make some calculations on the contents of url before
adding that url into my database.
I had found a very useful program of Word Count from sun java forum,
but its problem is that it also includes the HTML tags in calculation.
Can anybody please tell me is there any Java api or online help
available for

i) A program which counts words in HTML file but doesnt include HTML
tags.
ii) A program which counts only Bolds and Italics in HTML file.

Thanx in anticipation :)
 
H

Harald

Ezee said:
Hi,

I am trying to make a web crawler which will be topic focused. For
this, I have to make some calculations on the contents of url before
adding that url into my database.
I had found a very useful program of Word Count from sun java forum,
but its problem is that it also includes the HTML tags in calculation.
Can anybody please tell me is there any Java api or online help
available for

i) A program which counts words in HTML file but doesnt include HTML
tags.

With http://www.ebi.ac.uk/~kirsch/monq-doc/monq/programs/Grep.html
you can do things like

java monq.programs.Grep '<[^>]+>' '' '[A-Za-z]+' '%0\n' <yourhtml.html

on the command line to get fetch all words that do not below to a
tag. The mechanism behind it is
http://www.ebi.ac.uk/~kirsch/monq-doc/monq/jfa/Nfa.html which you can
use progammatically.
ii) A program which counts only Bolds and Italics in HTML file.

This would require to look for `<b>' and `<em>' tags and can easily be
added as pattern/action pairs to the Nfa doing the word counting.

I am off to the pub now, otherwise I would've written the class, max
20 lines:) To download the software see signature.

Harald.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top