Convert some files from html to plaintext

L

Luca Villa

I have many html files named like these:

c:\dir\femo-black.html
c:\dir\loren-white.html
c:\dir\spark-white.html
c:\dir\kim-black.html
c:\dir\paul-white.html

How can I convert only the files named "c:\dir\*-white.html" to
plaintext files named c:\dir\(original filename)-text.txt?

BTW do you know a better Perl module than HTML::FormatText (
http://search.cpan.org/~sburke/HTML-Format-2.04/lib/HTML/FormatText.pm)
to convert HTML to plaintext?
 
J

Jürgen Exner

Luca said:
I have many html files named like these:

c:\dir\femo-black.html
c:\dir\loren-white.html
c:\dir\spark-white.html
c:\dir\kim-black.html
c:\dir\paul-white.html

How can I convert only the files named "c:\dir\*-white.html"

perldoc -f glob
to plaintext files

Many ways, depending on what you consider the plaintext equivalent of an
HTML file. After all, HTML contains more information than plaintext and
therefore a lossless conversion is not possible. One way would be to use
lynx with the text-output option.
Another way is described in the Perl FAQ: "perldoc -q HTML"
"How do I remove HTML from a string?"
named c:\dir\(original filename)-text.txt?

Depending upon how you generate the target text e.g. by redirecting the
output of lynx to that file or buy writing to that file or ...

jue
 
L

Luca Villa

e.g. by redirecting the
output of lynx to that file or buy writing to that file or ...

Isn't there an equivalent of the Lynx rendering engine for Perl?
I know that "Lynx -dump" does a good conversion but I fear that
calling an external program thousand of times is a waste of
resources...
 
J

Jürgen Exner

Luca said:
Isn't there an equivalent of the Lynx rendering engine for Perl?

Why would Perl do HTML rendering? Anyway, which part of "perldoc -q HTML"

How do I remove HTML from a string?

don't you understand?

jue
 
L

Luca Villa

Jürgen Exner

the problem is that converting html to a good equivalent in plain text
is not a simple operation of "removing HTML from a string".

Think for example to an html table, with columns of different width
etc...
Textual browsers like Lynx, Links, Elinks, W3M do a good job in
presenting html tables in plain text. I'm searching for something of
this quality...
 
A

all mail refused

Isn't there an equivalent of the Lynx rendering engine for Perl?
I know that "Lynx -dump" does a good conversion but I fear that
calling an external program thousand of times is a waste of
resources...

But not a waste of your time - use "lynx -dump" and develop
your laziness more fully.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top