How to convert markup text to plain text in python?

Discussion in 'Python' started by geoffbache, Feb 1, 2008.

  1. geoffbache

    geoffbache Guest

    I have some marked up text and would like to convert it to plain text,
    by simply removing all the tags. Of course I can do it from first
    principles but I felt that among all Python's markup tools there must
    be something that would do this simply, without having to create an
    XML parser etc.

    I've looked around a bit but failed to find anything, any tips?

    (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")

    Regards,
    Geoff
    geoffbache, Feb 1, 2008
    #1
    1. Advertising

  2. geoffbache

    Tim Chase Guest

    > I have some marked up text and would like to convert it to plain text,
    > by simply removing all the tags. Of course I can do it from first
    > principles but I felt that among all Python's markup tools there must
    > be something that would do this simply, without having to create an
    > XML parser etc.
    >
    > I've looked around a bit but failed to find anything, any tips?
    >
    > (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")



    Well, if all you want to do is remove everything from a "<" to a
    ">", you can use

    >>> s = "<B>Today</B> is <U>Friday</U>"
    >>> import re
    >>> r = re.compile('<[^>]*>')
    >>> print r.sub('', s)

    Today is Friday

    it should even work for semi-pathological cases such as

    s = """You can find my <a
    href='http://example.com'>thesis</a
    > online"""


    where the tag contents are split across lines. There are more
    pathological cases where tags aren't well-formed, e.g.

    s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"

    in which case you get what you deserve for making such
    pathological conditions ;-)

    -tkc
    Tim Chase, Feb 1, 2008
    #2
    1. Advertising

  3. geoffbache

    ph Guest

    On 01-Feb-2008, geoffbache wrote:
    > I have some marked up text and would like to convert it to plain text,
    > by simply removing all the tags. Of course I can do it from first
    > principles but I felt that among all Python's markup tools there must
    > be something that would do this simply, without having to create an
    > XML parser etc.
    >
    > I've looked around a bit but failed to find anything, any tips?
    >
    > (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")


    Quick but very dirty way:

    data=urllib.urlopen('http://google.com').read()
    data=''.join([x.split('>',1)[-1] for x in data.split('<')])
    ph, Feb 1, 2008
    #3
  4. geoffbache

    Steve Holden Guest

    Tim Chase wrote:
    >> I have some marked up text and would like to convert it to plain text,
    >> by simply removing all the tags. Of course I can do it from first
    >> principles but I felt that among all Python's markup tools there must
    >> be something that would do this simply, without having to create an
    >> XML parser etc.
    >>
    >> I've looked around a bit but failed to find anything, any tips?
    >>
    >> (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")

    >
    >
    > Well, if all you want to do is remove everything from a "<" to a
    > ">", you can use
    >
    > >>> s = "<B>Today</B> is <U>Friday</U>"
    > >>> import re
    > >>> r = re.compile('<[^>]*>')
    > >>> print r.sub('', s)

    > Today is Friday
    >
    > it should even work for semi-pathological cases such as
    >
    > s = """You can find my <a
    > href='http://example.com'>thesis</a
    > > online"""

    >
    > where the tag contents are split across lines. There are more
    > pathological cases where tags aren't well-formed, e.g.
    >
    > s ="This <tag>has a > sign in it and <odd<ly>-nested> tags"
    >
    > in which case you get what you deserve for making such
    > pathological conditions ;-)
    >

    The real answer to this question is "learn how to use Beautiful Soup" --
    see http://www.crummy.com/software/BeautifulSoup/

    regards
    Steve
    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    Holden Web LLC http://www.holdenweb.com/
    Steve Holden, Feb 1, 2008
    #4
  5. geoffbache

    Tim Chase Guest

    >> Well, if all you want to do is remove everything from a "<" to a
    >> ">", you can use
    >>
    >> >>> s = "<B>Today</B> is <U>Friday</U>"
    >> >>> import re
    >> >>> r = re.compile('<[^>]*>')
    >> >>> print r.sub('', s)

    >> Today is Friday
    >>

    [Tim's ramblings about pathological cases snipped]
    >
    > The real answer to this question is "learn how to use Beautiful Soup" --
    > see http://www.crummy.com/software/BeautifulSoup/


    Yes, for more pathological cases, BS does a great job of parsing
    junk :)

    However, as BS isn't batteries-included [Aside: BS and pyparsing
    are two common solutions to problems that would make great
    additions to the standard library], using a RE to make a
    best-effort guess is a good first approximation of a solution
    without needing to download extra packages--no matter how useful
    those extra packages may be.

    -tkc
    Tim Chase, Feb 1, 2008
    #5
  6. geoffbache

    Paul McGuire Guest

    On Feb 1, 10:54 am, Tim Chase <> wrote:
    > >> Well, if all you want to do is remove everything from a "<" to a
    > >> ">", you can use

    >
    > >>   >>> s = "<B>Today</B> is <U>Friday</U>"
    > >>   >>> import re
    > >>   >>> r = re.compile('<[^>]*>')
    > >>   >>> print r.sub('', s)
    > >>   Today is Friday

    >
    > [Tim's ramblings about pathological cases snipped]


    pyparsing includes an example script for stripping tags from HTML
    source. See it on the wiki at http://pyparsing.wikispaces.com/space/showimage/htmlStripper.py.

    -- Paul
    Paul McGuire, Feb 1, 2008
    #6
  7. geoffbache

    Zentrader Guest

    On Feb 1, 8:07 am, geoffbache <> wrote:
    > I have some marked up text and would like to convert it to plain text,


    If this is just a quick and dirty problem, you can also use one of the
    lynx/elinks/links2 browsers and dump the contents to a file. On Linux
    it would be
    lynx -dump http://www.etc > text.txt
    Lynx is also available for MS Windows, but am not sure about the other
    two.
    Zentrader, Feb 2, 2008
    #7
  8. geoffbache wrote:
    > I have some marked up text and would like to convert it to plain text,
    > by simply removing all the tags. Of course I can do it from first
    > principles but I felt that among all Python's markup tools there must
    > be something that would do this simply, without having to create an
    > XML parser etc.
    >
    > I've looked around a bit but failed to find anything, any tips?
    >
    > (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")


    >>> import lxml.etree as et
    >>> doc = et.HTML("<b>Today</b> is <u>Friday</u>")
    >>> et.tostring(doc, method='text', encoding=unicode)

    u'Today is Friday'


    http://codespeak.net/lxml

    Stefan
    Stefan Behnel, Feb 3, 2008
    #8
  9. geoffbache wrote:
    > I have some marked up text and would like to convert it to plain text,
    > by simply removing all the tags. Of course I can do it from first
    > principles but I felt that among all Python's markup tools there must
    > be something that would do this simply, without having to create an
    > XML parser etc.
    >
    > I've looked around a bit but failed to find anything, any tips?
    >
    > (e.g. convert "<B>Today</B> is <U>Friday</U>" to "Today is Friday")


    This might be of interest:

    http://pypi.python.org/pypi/haufe.stripml

    Stefan
    Stefan Behnel, Feb 11, 2008
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    7
    Views:
    21,436
    kalyan_iitd
    Jul 4, 2006
  2. Marcel Kessler

    Convert HTML to plain text

    Marcel Kessler, Nov 13, 2006, in forum: Java
    Replies:
    3
    Views:
    1,677
    Karl Uppiano
    Nov 14, 2006
  3. mahesh
    Replies:
    2
    Views:
    1,188
    Real Gagnon
    Feb 17, 2007
  4. nospam
    Replies:
    11
    Views:
    589
    Thomas Dickey
    May 3, 2007
  5. Gerald Bauer
    Replies:
    1
    Views:
    170
    Emmanuel Oga
    Aug 16, 2008
Loading...

Share This Page