extracting text content from web page

Discussion in 'Perl Misc' started by kjhjhjhjadsasda@urbanhabit.com, Sep 28, 2005.

  1. Guest

    Im trying to write a perl script that in a meaningful way extracts text
    content from a webpage. Ive tried through modules and reg expr but
    havent found a good way yet.

    To avoid "crappy" text slipping through, is there a way of extracting
    only sentences? ex:

    -clean the html from tags
    -extract sentences through identifying number of words between
    punctuations or something similar.

    Any other ideas on how to nicely pick out content text from a webpage?

    Thanks
    M
     
    , Sep 28, 2005
    #1
    1. Advertising

  2. Ron Savage Guest

    On Thu, 29 Sep 2005 06:06:03 +1000, wrote:

    Hi M

    HTML::TokeParser is the one you want. The docs are excellent.

    The author has also written a book - Perl & LWP - which I recommend.

    Note: Download the list of misprints, though!
     
    Ron Savage, Sep 29, 2005
    #2
    1. Advertising

  3. Dr.Ruud Guest

    Dr.Ruud, Sep 29, 2005
    #3
  4. Guest

    Hi Ron

    TokeParser is great. However, I still get a lot "menu text" and alt
    tags etc. Is there a way to have it only accept "sentence length" text?

    What do you mean by download missprints?

    Thanks!
    M

    Ron Savage skrev:

    > On Thu, 29 Sep 2005 06:06:03 +1000, wrote:
    >
    > Hi M
    >
    > HTML::TokeParser is the one you want. The docs are excellent.
    >
    > The author has also written a book - Perl & LWP - which I recommend.
    >
    > Note: Download the list of misprints, though!
     
    , Sep 30, 2005
    #4
  5. writes:

    Note - upside-down quoting fixed. Please don't do that.

    > Ron Savage skrev:
    >
    >> HTML::TokeParser is the one you want. The docs are excellent.
    >>
    >> The author has also written a book - Perl & LWP - which I recommend.
    >>
    >> Note: Download the list of misprints, though!

    >
    > What do you mean by download missprints?


    Errata. Typos and other errors in a book are often listed, along with the
    corrections of course, on a publisher's web site.

    For this particular book, the publisher is O'Reilly, and the errata is
    listed here:

    <http://www.oreilly.com/catalog/perllwp/>

    sherm--

    --
    Cocoa programming in Perl: http://camelbones.sourceforge.net
    Hire me! My resume: http://www.dot-app.org
     
    Sherm Pendley, Sep 30, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. TheKeith
    Replies:
    20
    Views:
    106,905
    Chris Morris
    Oct 29, 2003
  2. hazz
    Replies:
    6
    Views:
    49,797
    SkyUCHC
    Jun 9, 2010
  3. Replies:
    0
    Views:
    375
  4. Bernard Rankin
    Replies:
    0
    Views:
    298
    Bernard Rankin
    Jan 16, 2009
  5. Dave L
    Replies:
    3
    Views:
    3,287
    Göran Andersson
    Mar 4, 2010
Loading...

Share This Page