Retrieving only the text portion of a web page

G

googler

I want to get the content of specific web pages and do some processing
on them. I found that the LWP class can help with the first part. I
have never used LWP before and found some simple code like the one
below that returns a web page content.

my $url = 'http://www.yahoo.com';
use LWP::Simple;
my $content = get $url;

I am interested in only the text part of the web page (that is,
without any tags, cross links etc). Is there an easy way to get this
(without having to search through the entire content and filtering out
the part that I don't need)?
 
X

Xicheng Jia

I want to get the content of specific web pages and do some processing
on them. I found that the LWP class can help with the first part. I
have never used LWP before and found some simple code like the one
below that returns a web page content.

my $url = 'http://www.yahoo.com';
use LWP::Simple;
my $content = get $url;

I am interested in only the text part of the web page (that is,
without any tags, cross links etc). Is there an easy way to get this
(without having to search through the entire content and filtering out
the part that I don't need)?

You dont have to go with Perl, if you are under linux-box and have
lynx, then:

lynx -dump -nolist http://www.yahoo.com

(you can certainly try a Win32 version lynx)
Regards,
Xicheng
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top