Cleaning HTML ;-)

Discussion in 'Perl Misc' started by Reinhard Glauber, Jan 21, 2006.

  1. Hi Perl-Gurus,

    I need to clean a HTML file, so that I get plain text.
    So, now that I know that there is something called perldoc I searched and found

    $html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gs$html =~ s/\t//gs; $html =~ s/\r//gs; This works great, BUT, when I open the cleaned file in viI get a lot of blue ^M - SignsAlso there are way too many blanks in there.How do I get them out ? I know this really sounds like a bad Newbie Question, andofcourse it is ;-) Hopefully its not too bad.Screenshot: http://www.sabineschulte.de/perl.jpg
     
    Reinhard Glauber, Jan 21, 2006
    #1
    1. Advertising

  2. Reinhard Glauber

    Xicheng Guest

    Reinhard Glauber wrote:
    > Hi Perl-Gurus,
    >
    > I need to clean a HTML file, so that I get plain text.
    > So, now that I know that there is something called perldoc I searched and found
    >
    > $html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gs$html =~ s/\t//gs; This works great,

    add:
    $html =~ s/\s*\cM/\n/g; #"spaces^M" to "\n"
    $html =~ tr/\n//s; #squeeze \n or $html =~ s/\n+/\n/g;
    #squeeze

    or use a command line on the textfile you already got:

    perl -0777pe 's/\s*\cM/\n/g;tr/\n//s' my_file

    Xicheng
    >BUT, when I open the cleaned file in viI get a lot of blue ^M - SignsAlso there are way too >many blanks in there.How do I get them out ? I know this really sounds like a bad Newbie >Question, andofcourse it is ;-) Hopefully its not too bad.Screenshot: >http://www.sabineschulte.de/perl.jpg
     
    Xicheng, Jan 21, 2006
    #2
    1. Advertising

  3. "Reinhard Glauber" <> wrote in
    news:43d1fcbf$0$20788$-online.net:

    > I need to clean a HTML file, so that I get plain text.


    Use a parser to parse HTML, as the answer to the FAQ recommends:

    How do I remove HTML from a string?
    The most correct way (albeit not the fastest) is to use HTML::parser
    from CPAN.

    See http://search.cpan.org/~gaas/HTML-Parser-3.48/, especially:

    http://search.cpan.org/src/GAAS/HTML-Parser-3.48/eg/htext

    Sinan
    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Jan 21, 2006
    #3
  4. Reinhard Glauber

    Guest

    A. Sinan Unur <> wrote:
    > "Reinhard Glauber" <> wrote in
    > news:43d1fcbf$0$20788$-online.net:


    >> I need to clean a HTML file, so that I get plain text.


    > Use a parser to parse HTML, as the answer to the FAQ recommends:


    > How do I remove HTML from a string?
    > The most correct way (albeit not the fastest) is to use HTML::parser
    > from CPAN.


    I find the most efficient way to get plain text from an HTML
    file is to use 'lynx -dump'.

    Axel
     
    , Jan 21, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Matej Cepl

    Cleaning the mess of newssite HTML

    Matej Cepl, Oct 6, 2004, in forum: HTML
    Replies:
    0
    Views:
    380
    Matej Cepl
    Oct 6, 2004
  2. Matej Cepl

    Cleaning the mess of newssite HTML

    Matej Cepl, Oct 6, 2004, in forum: XML
    Replies:
    1
    Views:
    319
    Matej Cepl
    Oct 7, 2004
  3. edgy

    cleaning up html code

    edgy, Jul 8, 2006, in forum: HTML
    Replies:
    2
    Views:
    411
  4. Steve B.
    Replies:
    1
    Views:
    647
    Siva M
    Sep 4, 2006
  5. David R. Throop
    Replies:
    4
    Views:
    158
    Petri
    Feb 8, 2004
Loading...

Share This Page