HTML / text parsing - best module(s) to use?

U

Unknown Poster

Assuming I have used LWP to get an HTML document into a response
object, I'd like to know what module(s) to use for the following task.
I would like to disregard the content of HTML tags (that is, everything
in angled brackets) and then break the "real" text into words so that
I can do a frequency count, etc.

Would someone experienced in this sort of task recommend
a combination of the following (or others I may have missed) ?

Text::WordParse
HTML::parse
HTML::parser
HTML::pullParser
HTML::TokeParser
 
A

A. Sinan Unur

(e-mail address removed) (Unknown Poster) wrote in
Assuming I have used LWP to get an HTML document into a response
object, I'd like to know what module(s) to use for the following task.
I would like to disregard the content of HTML tags (that is, everything
in angled brackets) and then break the "real" text into words so that
I can do a frequency count, etc.

HTML::parser will get you the plain text with little effort. See:

http://search.cpan.org/src/GAAS/HTML-Parser-3.28/eg/htext
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top