Cleaning HTML ;-)

R

Reinhard Glauber

Hi Perl-Gurus,

I need to clean a HTML file, so that I get plain text.
So, now that I know that there is something called perldoc I searched and found

$html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gs$html =~ s/\t//gs; $html =~ s/\r//gs; This works great, BUT, when I open the cleaned file in viI get a lot of blue ^M - SignsAlso there are way too many blanks in there.How do I get them out ? I know this really sounds like a bad Newbie Question, andofcourse it is ;-) Hopefully its not too bad.Screenshot: http://www.sabineschulte.de/perl.jpg
 
X

Xicheng

Reinhard said:
Hi Perl-Gurus,

I need to clean a HTML file, so that I get plain text.
So, now that I know that there is something called perldoc I searched and found

$html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gs$html =~ s/\t//gs; This works great,
add:
$html =~ s/\s*\cM/\n/g; #"spaces^M" to "\n"
$html =~ tr/\n//s; #squeeze \n or $html =~ s/\n+/\n/g;
#squeeze

or use a command line on the textfile you already got:

perl -0777pe 's/\s*\cM/\n/g;tr/\n//s' my_file

Xicheng
 
A

axel

Use a parser to parse HTML, as the answer to the FAQ recommends:
How do I remove HTML from a string?
The most correct way (albeit not the fastest) is to use HTML::parser
from CPAN.

I find the most efficient way to get plain text from an HTML
file is to use 'lynx -dump'.

Axel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top