Cleaning HTML ;-)

Reinhard Glauber · Jan 21, 2006

Hi Perl-Gurus,

I need to clean a HTML file, so that I get plain text.
So, now that I know that there is something called perldoc I searched and found

$html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gs$html =~ s/\t//gs; $html =~ s/\r//gs; This works great, BUT, when I open the cleaned file in viI get a lot of blue ^M - SignsAlso there are way too many blanks in there.How do I get them out ? I know this really sounds like a bad Newbie Question, andofcourse it is ;-) Hopefully its not too bad.Screenshot: http://www.sabineschulte.de/perl.jpg

Xicheng · Jan 21, 2006

Reinhard said:
Hi Perl-Gurus,

I need to clean a HTML file, so that I get plain text.
So, now that I know that there is something called perldoc I searched and found

$html =~ s/<(?:[^>'"]*|(['"]).*?\1)*>//gs$html =~ s/\t//gs; This works great,

add:
$html =~ s/\s*\cM/\n/g; #"spaces^M" to "\n"
$html =~ tr/\n//s; #squeeze \n or $html =~ s/\n+/\n/g;
#squeeze

or use a command line on the textfile you already got:

perl -0777pe 's/\s*\cM/\n/g;tr/\n//s' my_file

Xicheng

A. Sinan Unur · Jan 21, 2006

I need to clean a HTML file, so that I get plain text.

Use a parser to parse HTML, as the answer to the FAQ recommends:

How do I remove HTML from a string?
The most correct way (albeit not the fastest) is to use HTML:

arser
from CPAN.

See http://search.cpan.org/~gaas/HTML-Parser-3.48/, especially:

http://search.cpan.org/src/GAAS/HTML-Parser-3.48/eg/htext

Sinan

axel · Jan 21, 2006

Use a parser to parse HTML, as the answer to the FAQ recommends:

How do I remove HTML from a string?
The most correct way (albeit not the fastest) is to use HTML:arser
from CPAN.

I find the most efficient way to get plain text from an HTML
file is to use 'lynx -dump'.

Axel

[SUMMARY] Code Cleaning (#26)	5	Apr 7, 2005
Cleaning the mess of newssite HTML	0	Oct 6, 2004
Generating HTML	0	Jul 29, 2013
how to get the list of words from html files	4	Oct 9, 2005
Simple web framework - improvements to makefile	0	Feb 1, 2023
Data cleaning issue involving bad wide characters in what ought to beascii data	11	Sep 3, 2009
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022
[QUIZ] Code Cleaning (#26)	10	Apr 1, 2005

Cleaning HTML ;-)

Reinhard Glauber

Xicheng

A. Sinan Unur

axel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads