text representation of HTML

Discussion in 'Python' started by Ksenia Marasanova, Jul 19, 2006.

  1. Hi,

    I am looking for a library that will give me very simple text
    representation of HTML.
    For example
    <div><h1>Title</h1><p>This is a <br />test</p></div>

    will be transformed to:

    Title

    This is a
    test


    i want to send plain text alternative of html email, and would prefer
    to do it automatically from HTML source.
    Any hints?

    Thanks!
    Ksenia.
    Ksenia Marasanova, Jul 19, 2006
    #1
    1. Advertising

  2. Ksenia Marasanova wrote:

    > Hi,
    >
    > I am looking for a library that will give me very simple text
    > representation of HTML.
    > For example
    > <div><h1>Title</h1><p>This is a <br />test</p></div>
    >
    > will be transformed to:
    >
    > Title
    >
    > This is a
    > test
    >
    >
    > i want to send plain text alternative of html email, and would prefer
    > to do it automatically from HTML source.
    > Any hints?


    html2text is a commandline tool. You can invoke it from python using
    subprocess.

    Diez
    Diez B. Roggisch, Jul 19, 2006
    #2
    1. Advertising

  3. Hi,

    I guess stripogram would be more pythonic :
    http://sourceforge.net/project/showfiles.php?group_id=1083

    Regards,

    Laurent

    Diez B. Roggisch wrote:

    > Ksenia Marasanova wrote:
    >
    >> Hi,
    >>
    >> I am looking for a library that will give me very simple text
    >> representation of HTML.
    >> For example
    >> <div><h1>Title</h1><p>This is a <br />test</p></div>
    >>
    >> will be transformed to:
    >>
    >> Title
    >>
    >> This is a
    >> test
    >>
    >>
    >> i want to send plain text alternative of html email, and would prefer
    >> to do it automatically from HTML source.
    >> Any hints?

    >
    > html2text is a commandline tool. You can invoke it from python using
    > subprocess.
    >
    > Diez
    Laurent Rahuel, Jul 19, 2006
    #3
  4. Ksenia Marasanova

    Guest

    Ksenia Marasanova <> wrote:
    > Hi,
    >
    > I am looking for a library that will give me very simple text
    > representation of HTML.
    > For example
    > <div><h1>Title</h1><p>This is a <br />test</p></div>
    >
    > will be transformed to:
    >
    > Title
    >
    > This is a
    > test
    >
    >
    > i want to send plain text alternative of html email, and would prefer
    > to do it automatically from HTML source.


    something like this:

    import re
    text = '<div><h1>Title</h1><p>This is a <br />test</p></div>'
    text = re.sub(r'[\n\ \t]+', ' ', text)
    text = re.sub(r'(?i)(\<p\>|\<br\>|\<h[1-6]\>)', '\n', text)
    result = re.sub('<.+?>', '', text)
    print result

    --
    -----------------------------------------------------------
    | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
    | __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
    -----------------------------------------------------------
    Antivirus alert: file .signature infected by signature virus.
    Hi! I'm a signature virus! Copy me into your signature file to help me spread!
    , Jul 20, 2006
    #4
  5. Ksenia Marasanova

    Duncan Booth Guest

    Ksenia Marasanova wrote:

    > I am looking for a library that will give me very simple text
    > representation of HTML.
    > For example
    ><div><h1>Title</h1><p>This is a <br />test</p></div>
    >
    > will be transformed to:
    >
    > Title
    >
    > This is a
    > test
    >
    >
    > i want to send plain text alternative of html email, and would prefer
    > to do it automatically from HTML source.
    > Any hints?


    Use htmllib:

    >>> import htmllib, formatter, StringIO
    >>> def cleanup(s):

    out = StringIO.StringIO()
    p = htmllib.HTMLParser(
    formatter.AbstractFormatter(formatter.DumbWriter(out)))
    p.feed(s)
    p.close()
    if p.anchorlist:
    print >>out
    for idx,anchor in enumerate(p.anchorlist):
    print >>out, "\n[%d]: %s" % (idx+1,anchor)
    return out.getvalue()

    >>> print cleanup('''<div><h1>Title</h1><p>This is a <br

    />test</p></div>''')

    Title

    This is a
    test
    >>> print cleanup('''<div><h1>Title</h1><p>This is a <br />test with <a

    href="http://python.org">a link</a> to the Python homepage</p></div>''')

    Title

    This is a
    test with a link[1] to the Python homepage

    [1]: http://python.org
    Duncan Booth, Jul 20, 2006
    #5
  6. Ksenia Marasanova

    Tim Williams Guest

    On 20 Jul 2006 15:12:27 GMT, Duncan Booth <> wrote:
    > Ksenia Marasanova wrote:
    > > i want to send plain text alternative of html email, and would prefer
    > > to do it automatically from HTML source.
    > > Any hints?

    >
    > Use htmllib:
    >
    > >>> import htmllib, formatter, StringIO
    > >>> def cleanup(s):

    > out = StringIO.StringIO()
    > p = htmllib.HTMLParser(
    > formatter.AbstractFormatter(formatter.DumbWriter(out)))
    > p.feed(s)
    > p.close()
    > if p.anchorlist:
    > print >>out
    > for idx,anchor in enumerate(p.anchorlist):
    > print >>out, "\n[%d]: %s" % (idx+1,anchor)
    > return out.getvalue()
    >
    > >>> print cleanup('''<div><h1>Title</h1><p>This is a <br

    > />test</p></div>''')
    >
    > Title
    >
    > This is a
    > test
    > >>> print cleanup('''<div><h1>Title</h1><p>This is a <br />test with <a

    > href="http://python.org">a link</a> to the Python homepage</p></div>''')
    >
    > Title
    >
    > This is a
    > test with a link[1] to the Python homepage
    >
    > [1]: http://python.org
    >


    cleanup() doesn't handle script and styles too well. html2text will
    do a much better job of these and give a more structured output
    (compatible with Markdown)

    http://www.aaronsw.com/2002/html2text/

    >>> import html2text
    >>> print html2text.html2text('''<div><h1>Title</h1><p>This is a <br

    />test with <a href="http://python.org">a link</a> to the Python
    homepage</p></div>''')

    # Title

    This is a
    test with [a link][1] to the Python homepage

    [1]: http://python.org


    HTH :)
    Tim Williams, Jul 20, 2006
    #6
  7. Sorry for the late reply... better too late than never :)
    Thanks to all for the tips. Stripogram is the winner, since it is the
    most configurable and accept line-length parameter, which is handy for
    email...

    Ksenia.

    On 7/19/06, Laurent Rahuel <> wrote:
    > Hi,
    >
    > I guess stripogram would be more pythonic :
    > http://sourceforge.net/project/showfiles.php?group_id=1083
    >
    > Regards,
    >
    > Laurent
    >
    > Diez B. Roggisch wrote:
    >
    > > Ksenia Marasanova wrote:
    > >
    > >> Hi,
    > >>
    > >> I am looking for a library that will give me very simple text
    > >> representation of HTML.
    > >> For example
    > >> <div><h1>Title</h1><p>This is a <br />test</p></div>
    > >>
    > >> will be transformed to:
    > >>
    > >> Title
    > >>
    > >> This is a
    > >> test
    > >>
    > >>
    > >> i want to send plain text alternative of html email, and would prefer
    > >> to do it automatically from HTML source.
    > >> Any hints?

    > >
    > > html2text is a commandline tool. You can invoke it from python using
    > > subprocess.
    > >
    > > Diez

    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
    Ksenia Marasanova, Sep 21, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mike
    Replies:
    4
    Views:
    444
    Joe Kesselman
    Aug 1, 2007
  2. dorayme
    Replies:
    3
    Views:
    342
    Neredbojias
    Oct 11, 2007
  3. Replies:
    5
    Views:
    336
    John B. Matthews
    Oct 23, 2008
  4. Replies:
    2
    Views:
    309
    Jorgen Grahn
    Oct 22, 2009
  5. Sara
    Replies:
    2
    Views:
    116
    Bob Walton
    May 27, 2004
Loading...

Share This Page