Convert some files from html to plaintext

Discussion in 'Perl Misc' started by Luca Villa, Nov 11, 2007.

  1. Luca Villa

    Luca Villa Guest

    I have many html files named like these:

    c:\dir\femo-black.html
    c:\dir\loren-white.html
    c:\dir\spark-white.html
    c:\dir\kim-black.html
    c:\dir\paul-white.html

    How can I convert only the files named "c:\dir\*-white.html" to
    plaintext files named c:\dir\(original filename)-text.txt?

    BTW do you know a better Perl module than HTML::FormatText (
    http://search.cpan.org/~sburke/HTML-Format-2.04/lib/HTML/FormatText.pm)
    to convert HTML to plaintext?
     
    Luca Villa, Nov 11, 2007
    #1
    1. Advertising

  2. Luca Villa wrote:
    > I have many html files named like these:
    >
    > c:\dir\femo-black.html
    > c:\dir\loren-white.html
    > c:\dir\spark-white.html
    > c:\dir\kim-black.html
    > c:\dir\paul-white.html
    >
    > How can I convert only the files named "c:\dir\*-white.html"


    perldoc -f glob

    > to plaintext files


    Many ways, depending on what you consider the plaintext equivalent of an
    HTML file. After all, HTML contains more information than plaintext and
    therefore a lossless conversion is not possible. One way would be to use
    lynx with the text-output option.
    Another way is described in the Perl FAQ: "perldoc -q HTML"
    "How do I remove HTML from a string?"

    > named c:\dir\(original filename)-text.txt?


    Depending upon how you generate the target text e.g. by redirecting the
    output of lynx to that file or buy writing to that file or ...

    jue
     
    Jürgen Exner, Nov 11, 2007
    #2
    1. Advertising

  3. Luca Villa

    Luca Villa Guest

    > e.g. by redirecting the
    > output of lynx to that file or buy writing to that file or ...


    Isn't there an equivalent of the Lynx rendering engine for Perl?
    I know that "Lynx -dump" does a good conversion but I fear that
    calling an external program thousand of times is a waste of
    resources...
     
    Luca Villa, Nov 11, 2007
    #3
  4. Luca Villa wrote:
    >> e.g. by redirecting the
    >> output of lynx to that file or buy writing to that file or ...

    >
    > Isn't there an equivalent of the Lynx rendering engine for Perl?


    Why would Perl do HTML rendering? Anyway, which part of "perldoc -q HTML"

    How do I remove HTML from a string?

    don't you understand?

    jue
     
    Jürgen Exner, Nov 11, 2007
    #4
  5. Luca Villa

    Luca Villa Guest

    Jürgen Exner

    the problem is that converting html to a good equivalent in plain text
    is not a simple operation of "removing HTML from a string".

    Think for example to an html table, with columns of different width
    etc...
    Textual browsers like Lynx, Links, Elinks, W3M do a good job in
    presenting html tables in plain text. I'm searching for something of
    this quality...
     
    Luca Villa, Nov 11, 2007
    #5
  6. On 2007-11-11, Luca Villa <> wrote:
    >> e.g. by redirecting the
    >> output of lynx to that file or buy writing to that file or ...

    >
    > Isn't there an equivalent of the Lynx rendering engine for Perl?
    > I know that "Lynx -dump" does a good conversion but I fear that
    > calling an external program thousand of times is a waste of
    > resources...


    But not a waste of your time - use "lynx -dump" and develop
    your laziness more fully.

    --
    Elvis Notargiacomo master AT barefaced DOT cheek
    http://www.notatla.org.uk/goen/
     
    all mail refused, Nov 12, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Re: plaintext

    , Jun 22, 2003, in forum: HTML
    Replies:
    3
    Views:
    1,085
    Headless
    Jun 23, 2003
  2. Ajay Brar
    Replies:
    5
    Views:
    789
    Peter Hansen
    Aug 4, 2004
  3. vasudevram
    Replies:
    0
    Views:
    312
    vasudevram
    Aug 19, 2006
  4. david.vantongerloo
    Replies:
    2
    Views:
    287
    david.vantongerloo
    Oct 27, 2006
  5. david.vantongerloo

    xml to plaintext ? newbie

    david.vantongerloo, Oct 31, 2006, in forum: ASP .Net
    Replies:
    2
    Views:
    313
    david.vantongerloo
    Nov 1, 2006
Loading...

Share This Page