Re: HTML to Text renderer

Discussion in 'Python' started by Ian Bicking, Nov 3, 2004.

  1. Ian Bicking

    Ian Bicking Guest

    Robert Brewer wrote:
    > Ian Bicking wrote:
    >
    >>Does anyone know of a module that can render HTML to text? Just a
    >>subset of HTML, really; I'd like to compose emails using <p> tags and
    >>whatnot, fill in all the values in the email template, then
    >>apply word
    >>wrapping and other formatting. Also, it'll make using Zope Page
    >>Templates with email easier.
    >>
    >>Even if all it supports is <p> and <br> that would be enough, but I'm
    >>hoping there's something even more complete out there. I don't need
    >>something as general as, say, Lynx; these templates would be written
    >>with a specific renderer in mind.

    >
    >
    > To clarify: you don't want the HTML tags merely stripped; you want to
    > replace e.g. br with a line break and p with, say, two line breaks?


    Right. And word wrapping too. Some other tags would also be
    interesting: <blockquote>, <pre>, <hr>, <table>, &nbsp;, and something
    to control alignment (e.g., <p align="">).

    --
    Ian Bicking / / http://blog.ianbicking.org
     
    Ian Bicking, Nov 3, 2004
    #1
    1. Advertising

  2. Ian Bicking

    Roger Binns Guest

    Ian Bicking wrote:
    > Right. And word wrapping too. Some other tags would also be
    > interesting: <blockquote>, <pre>, <hr>, <table>, &nbsp;, and something
    > to control alignment (e.g., <p align="">).


    Usually I resort to using one of the text based browsers (eg lynx/links/w3m)
    which all have a mode to dump plain text out formatted in that way.

    Roger
     
    Roger Binns, Nov 3, 2004
    #2
    1. Advertising

  3. Ian Bicking <> wrote:
    > Robert Brewer wrote:
    >> To clarify: you don't want the HTML tags merely stripped; you want to
    >> replace e.g. br with a line break and p with, say, two line breaks?

    >
    > Right. And word wrapping too. Some other tags would also be
    > interesting: <blockquote>, <pre>, <hr>, <table>, &nbsp;, and something
    > to control alignment (e.g., <p align="">).


    Have a look at htmllib.HTMLParser and formatter in the standard Python
    lib (but also look at the source of htmllib). Maybe they provide what
    you need.

    HTH
    Marc
     
    Marc Christiansen, Nov 9, 2004
    #3
  4. Ian Bicking

    Ivo Woltring Guest

    "Marc Christiansen" <-empire.de> wrote in message
    news:-empire.de...
    > Ian Bicking <> wrote:
    > > Robert Brewer wrote:
    > >> To clarify: you don't want the HTML tags merely stripped; you want to
    > >> replace e.g. br with a line break and p with, say, two line breaks?

    > >
    > > Right. And word wrapping too. Some other tags would also be
    > > interesting: <blockquote>, <pre>, <hr>, <table>, &nbsp;, and something
    > > to control alignment (e.g., <p align="">).

    >
    > Have a look at htmllib.HTMLParser and formatter in the standard Python
    > lib (but also look at the source of htmllib). Maybe they provide what
    > you need.
    >
    > HTH
    > Marc


    look at this code:

    ===CUT BELOW===
    from sgmllib import SGMLParser

    class html2txt(SGMLParser):
    """html2txt()
    """
    def reset(self):
    """reset() --> initialize the parser"""
    SGMLParser.reset(self)
    self.pieces = []

    def handle_data(self, text):
    """handle_data(text) --> appends the pieces to self.pieces
    handles all normal data not between brackets "<>"
    """
    self.pieces.append(text)

    def handle_entityref(self, ref):
    """called for each entity reference, e.g. for "&copy;", ref will be
    "copy"
    Reconstruct the original entity reference.
    """
    if ref=='amp':
    self.pieces.append("&")

    def output(self):
    """Return processed HTML as a single string"""
    return " ".join(self.pieces)

    if __name__=="__main__":
    html="""<h1>just a piece of html</h1>
    <div class="toc">
    <ul>
    <li><span class="section"><a
    href="index.html#install.choosing">1.1. Which Python is right for
    you?</a></span></li>
    <li><span class="section"><a href="windows.html">1.2. Python
    on Windows</a></span></li>
    <li><span class="section"><a href="macosx.html">1.3. Python
    on Mac OS X</a></span></li>
    <li><span class="section"><a href="macos9.html">1.4. Python
    on Mac OS 9</a></span></li>
    <li><span class="section"><a href="redhat.html">1.5. Python
    on RedHat Linux</a></span></li>
    <li><span class="section"><a href="debian.html">1.6. Python
    on Debian GNU/Linux</a></span></li>
    <li><span class="section"><a href="source.html">1.7. Python
    Installation from Source</a></span></li>
    <li><span class="section"><a href="shell.html">1.8. The
    Interactive Shell</a></span></li>
    <li><span class="section"><a href="summary.html">1.9.
    Summary</a></span></li>
    </ul>
    </div>
    """
    parser = html2txt()
    parser.reset()
    parser.feed(html)
    parser.close()
    print parser.output()
    === END CUT ===

    The html2txt class is of course extendable and changeble. For me it was
    important to convert html to text but the behavior of the class can be
    adjusted to change tags to do other stuff... hope it helps

    Ivo.
     
    Ivo Woltring, Nov 9, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xiaolei Li

    Looking for HTML Renderer

    Xiaolei Li, Oct 6, 2004, in forum: Java
    Replies:
    7
    Views:
    558
    Xiaolei Li
    Oct 7, 2004
  2. unbending
    Replies:
    0
    Views:
    365
    unbending
    Dec 17, 2004
  3. Robert kebernet Cooper

    html renderer

    Robert kebernet Cooper, Jun 27, 2005, in forum: Java
    Replies:
    2
    Views:
    2,314
    Roedy Green
    Jun 28, 2005
  4. Michele Simionato

    html renderer

    Michele Simionato, Nov 16, 2003, in forum: Python
    Replies:
    2
    Views:
    447
    John J. Lee
    Nov 17, 2003
  5. Ian Bicking

    HTML to Text renderer

    Ian Bicking, Nov 2, 2004, in forum: Python
    Replies:
    0
    Views:
    308
    Ian Bicking
    Nov 2, 2004
Loading...

Share This Page