Retrieving only the text portion of a web page

Discussion in 'Perl Misc' started by googler, May 9, 2007.

  1. googler

    googler Guest

    I want to get the content of specific web pages and do some processing
    on them. I found that the LWP class can help with the first part. I
    have never used LWP before and found some simple code like the one
    below that returns a web page content.

    my $url = 'http://www.yahoo.com';
    use LWP::Simple;
    my $content = get $url;

    I am interested in only the text part of the web page (that is,
    without any tags, cross links etc). Is there an easy way to get this
    (without having to search through the entire content and filtering out
    the part that I don't need)?
     
    googler, May 9, 2007
    #1
    1. Advertising

  2. googler

    Xicheng Jia Guest

    On May 8, 10:18 pm, googler <> wrote:
    > I want to get the content of specific web pages and do some processing
    > on them. I found that the LWP class can help with the first part. I
    > have never used LWP before and found some simple code like the one
    > below that returns a web page content.
    >
    > my $url = 'http://www.yahoo.com';
    > use LWP::Simple;
    > my $content = get $url;
    >
    > I am interested in only the text part of the web page (that is,
    > without any tags, cross links etc). Is there an easy way to get this
    > (without having to search through the entire content and filtering out
    > the part that I don't need)?


    You dont have to go with Perl, if you are under linux-box and have
    lynx, then:

    lynx -dump -nolist http://www.yahoo.com

    (you can certainly try a Win32 version lynx)
    Regards,
    Xicheng
     
    Xicheng Jia, May 9, 2007
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VB Programmer

    Border for only portion of a cell?

    VB Programmer, Jun 10, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    327
    VB Programmer
    Jun 10, 2004
  2. =?Utf-8?B?QW5kcmV3?=

    How to only refresh portion of a .aspx web page?

    =?Utf-8?B?QW5kcmV3?=, Sep 27, 2005, in forum: ASP .Net
    Replies:
    7
    Views:
    4,337
    Steve C. Orr [MVP, MCSD]
    Sep 28, 2005
  3. Savvoulidis Iordanis

    Reading only a specific portion of XML file.

    Savvoulidis Iordanis, Dec 14, 2009, in forum: ASP .Net
    Replies:
    3
    Views:
    435
    Alexey Smirnov
    Dec 15, 2009
  4. DIAMOND Mark R.
    Replies:
    5
    Views:
    126
    DIAMOND Mark R.
    Aug 10, 2004
  5. erik
    Replies:
    2
    Views:
    304
    John W. Krahn
    Aug 24, 2005
Loading...

Share This Page