Re: Converting HTML to ASCII

Discussion in 'Python' started by Michael Spencer, Feb 25, 2005.

  1. gf gf wrote:
    > [wants to extract ASCII from badly-formed HTML and thinks BeautifulSoup is too complex]


    You haven't specified what you mean by "extracting" ASCII, but I'll assume that
    you want to start by eliminating html tags and comments, which is easy enough
    with a couple of regular expressions:

    >>> import re
    >>> comments = re.compile('<!--.*?-->', re.DOTALL)
    >>> tags = re.compile('<.*?>', re.DOTALL)

    ...
    >>> def striptags(text):

    ... text = re.sub(comments,'', text)
    ... text = re.sub(tags,'', text)
    ... return text
    ...
    >>> def collapsenewlines(text):

    ... return "\n".join(line for line in text.splitlines() if line)
    ...
    >>> import urllib2
    >>> f = urllib2.urlopen('http://www.python.org/')
    >>> source = f.read()
    >>> text = collapsenewlines(striptags(source))
    >>>


    This will of course fail if there is a "<" without a ">", probably in other
    cases too. But it is indifferent to whether the html is well-formed.

    This leaves you with the additional task of substituting the html escaped
    characters e.g., "&nbsp;", not all of which will have ASCII representations.

    HTH

    Michael
    Michael Spencer, Feb 25, 2005
    #1
    1. Advertising

  2. Michael Spencer

    Mike Meyer Guest

    Michael Spencer <> writes:

    > gf gf wrote:
    >> [wants to extract ASCII from badly-formed HTML and thinks BeautifulSoup is too complex]

    >
    > You haven't specified what you mean by "extracting" ASCII, but I'll
    > assume that you want to start by eliminating html tags and comments,
    > which is easy enough with a couple of regular expressions:
    >
    > >>> import re
    > >>> comments = re.compile('<!--.*?-->', re.DOTALL)
    > >>> tags = re.compile('<.*?>', re.DOTALL)

    > ...
    > >>> def striptags(text):

    > ... text = re.sub(comments,'', text)
    > ... text = re.sub(tags,'', text)
    > ... return text
    > ...
    > >>> def collapsenewlines(text):

    > ... return "\n".join(line for line in text.splitlines() if line)
    > ...
    > >>> import urllib2
    > >>> f = urllib2.urlopen('http://www.python.org/')
    > >>> source = f.read()
    > >>> text = collapsenewlines(striptags(source))
    > >>>

    >
    > This will of course fail if there is a "<" without a ">", probably in
    > other cases too. But it is indifferent to whether the html is
    > well-formed.


    It also fails on tags with a ">" in a string in the tag. That's
    well-formed but ill-used HTML.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
    Mike Meyer, Feb 25, 2005
    #2
    1. Advertising

  3. Mike Meyer wrote:

    >
    > It also fails on tags with a ">" in a string in the tag. That's
    > well-formed but ill-used HTML.
    >
    > <mike

    True enough...however, it doesn't fail too horribly:
    >>> striptags("""<sometag attribute = '>'>the text</sometag>""")

    "'>the text"
    >>>

    and I think that case could be rectified rather easily, by stripping any content
    up to '>' in the result without breaking anything else.

    BTW, I tool a first look at BeautifulSoup. As far as I could tell, there is no
    built-in way to extract text from its parse tree, however adding one is trivial:

    >>> from bsoup import BeautifulSoup, Tag

    ...
    >>> def extracttext(obj):

    ... if isinstance(obj,Tag):
    ... return "".join(extracttext(c) for c in obj.contents)
    ... else:
    ... return str(obj)
    ...
    >>> def bsouptext(text):

    ... souptree = BeautifulSoup(text)
    ... bodytext = extracttext(souptree.first())
    ... text = re.sub(comments,'', bodytext)
    ... text = collapsenewlines(text)
    ... return text
    ...
    ...
    >>>


    >>> bsouptext("""<sometag attribute = '>'>the text</sometag>""")

    "'>the text"

    On one 'real world test' (nytimes.com), I find the regexp approach to be more
    accurate, but I won't load up this message with the output to prove it ;-)

    Michael
    Michael Spencer, Feb 25, 2005
    #3
  4. Michael Spencer

    Mike Meyer Guest

    Michael Spencer <> writes:

    > Mike Meyer wrote:
    >
    >> It also fails on tags with a ">" in a string in the tag. That's
    >> well-formed but ill-used HTML.
    >> <mike

    > True enough...however, it doesn't fail too horribly:
    > >>> striptags("""<sometag attribute = '>'>the text</sometag>""")

    > "'>the text"
    > >>>


    Depends on your example:

    <sometage attribute='>' otherattribute='otherstuff' moreattribute='yet
    more stuff'>

    and so on. Then again, early browsers actually did the same kind of
    parsing as you do, so this type of thing is discouraged.

    > and I think that case could be rectified rather easily, by stripping
    > any content up to '>' in the result without breaking anything else.


    Yes, but then what happens with:

    <sometag>>text</sometag>

    ?

    <mike

    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
    Mike Meyer, Feb 27, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. gf gf

    Converting HTML to ASCII

    gf gf, Feb 25, 2005, in forum: Python
    Replies:
    3
    Views:
    345
    Kent Johnson
    Feb 26, 2005
  2. gf gf

    Converting HTML to ASCII

    gf gf, Feb 25, 2005, in forum: Python
    Replies:
    5
    Views:
    406
    Thomas Dickey
    Feb 27, 2005
  3. TOXiC
    Replies:
    5
    Views:
    1,248
    TOXiC
    Jan 31, 2007
  4. James O'Brien
    Replies:
    3
    Views:
    250
    Ben Morrow
    Mar 5, 2004
  5. Alextophi
    Replies:
    8
    Views:
    508
    Alan J. Flavell
    Dec 30, 2005
Loading...

Share This Page