extracting text content from web page

K

kjhjhjhjadsasda

Im trying to write a perl script that in a meaningful way extracts text
content from a webpage. Ive tried through modules and reg expr but
havent found a good way yet.

To avoid "crappy" text slipping through, is there a way of extracting
only sentences? ex:

-clean the html from tags
-extract sentences through identifying number of words between
punctuations or something similar.

Any other ideas on how to nicely pick out content text from a webpage?

Thanks
M
 
R

Ron Savage

On Thu, 29 Sep 2005 06:06:03 +1000, (e-mail address removed) wrote:

Hi M

HTML::TokeParser is the one you want. The docs are excellent.

The author has also written a book - Perl & LWP - which I recommend.

Note: Download the list of misprints, though!
 
K

kjhjhjhjadsasda

Hi Ron

TokeParser is great. However, I still get a lot "menu text" and alt
tags etc. Is there a way to have it only accept "sentence length" text?

What do you mean by download missprints?

Thanks!
M

Ron Savage skrev:
 
S

Sherm Pendley

(e-mail address removed) writes:

Note - upside-down quoting fixed. Please don't do that.
Ron Savage skrev:


What do you mean by download missprints?

Errata. Typos and other errors in a book are often listed, along with the
corrections of course, on a publisher's web site.

For this particular book, the publisher is O'Reilly, and the errata is
listed here:

<http://www.oreilly.com/catalog/perllwp/>

sherm--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top