html2text but preserving text links

Discussion in 'Perl Misc' started by Felix, Feb 21, 2004.

  1. Felix

    Felix Guest

    The script below tries converting html to text while preserving
    textual links. However, with www.news.com it renders many lines all
    bunched together as opposed to neatly separated as lynx would do with
    the same page (see www.marcfest.com/qxi5/news.cgi to see what I mean).
    Anybody know how I can improve the script? Using backticks with lynx
    is not an option, btw.

    Thank you very much.

    Marc

    SCRIPT:

    #!/usr/bin/perl

    use LWP::Simple;
    use HTML::TagFilter;

    $content = get ("http://www.news.com");

    my $tf = HTML::TagFilter->new(strip_comments =>
    1,allow=>{a=>{'any'},br=>{'any'},p=>{'any'},script=>{'any'},style=>{'any'}});
    $content = $tf->filter($content);

    print $content; exit;
     
    Felix, Feb 21, 2004
    #1
    1. Advertising

  2. On Sat, 21 Feb 2004 06:51:05 -0800, Felix wrote:

    > The script below tries converting html to text while preserving
    > textual links. However, with www.news.com it renders many lines all
    > bunched together as opposed to neatly separated as lynx would do with
    > the same page (see www.marcfest.com/qxi5/news.cgi to see what I mean).
    > Anybody know how I can improve the script? Using backticks with lynx
    > is not an option, btw.


    What exactly are you trying to do? I understand the output doesn't look
    the greatest, but I'm not seeing your point. Are you trying to extract
    just the links? text?

    What command line options are you using with lynx? That may help.

    --
    Jim

    Copyright notice: all code written by the author in this post is
    released under the GPL. http://www.gnu.org/licenses/gpl.txt
    for more information.

    a fortune quote ...
    World War Three can be averted by adherence to a strictly
    enforced dress code!
     
    James Willmore, Feb 21, 2004
    #2
    1. Advertising

  3. Felix

    Bob Walton Guest

    Felix wrote:

    > The script below tries converting html to text while preserving
    > textual links. However, with www.news.com it renders many lines all
    > bunched together as opposed to neatly separated as lynx would do with
    > the same page (see www.marcfest.com/qxi5/news.cgi to see what I mean).
    > Anybody know how I can improve the script? Using backticks with lynx
    > is not an option, btw.

    ....

    Sounds like maybe you could:

    use Text::Wrap;

    to get more presentable text?

    --
    Bob Walton
    Email: http://bwalton.com/cgi-bin/emailbob.pl
     
    Bob Walton, Feb 23, 2004
    #3
  4. Felix

    Felix Guest

    I figured out a way to do this. Basically by using the existing code
    and converting div tags into <br> tags first.

    Thanks for getting back to me.

    Marc.


    > What command line options are you using with lynx? That may help.
    >
    > --
    > Jim
    >
    > Copyright notice: all code written by the author in this post is
    > released under the GPL. http://www.gnu.org/licenses/gpl.txt
    > for more information.
    >
    > a fortune quote ...
    > World War Three can be averted by adherence to a strictly
    > enforced dress code!
     
    Felix, Feb 23, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jeremy Phillips
    Replies:
    1
    Views:
    455
    J├╝rgen Exner
    Jul 23, 2004
  2. Matthew Rees-George
    Replies:
    6
    Views:
    400
    Steven Dilley
    Aug 1, 2003
  3. TTroy
    Replies:
    16
    Views:
    803
    Peter Nilsson
    Jan 31, 2005
  4. Replies:
    4
    Views:
    543
  5. Steve
    Replies:
    2
    Views:
    637
    Steve
    Aug 17, 2008
Loading...

Share This Page