Another BeautifulSoup crash on bad HTML

Discussion in 'Python' started by John Nagle, May 15, 2008.

  1. John Nagle

    John Nagle Guest

    Can't really blame BeautifulSoup for this, but our crawler hit a page
    ("http://clagnut.com/privacy/") with an out of range character escape:



    in this text:

    If you provide a name, email address and/or website and choose ‘Remember
    me, these details will be stored as a cookie on your computer.

    The author clearly meant "’", which is a single close quote.

    The traceback as BeautifulSoup aborts:

    SGMLParser.feed(self, markup or "")
    File "/usr/local/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
    File "/usr/local/lib/python2.5/sgmllib.py", line 181, in goahead
    self.handle_charref(name)
    File "/var/www/vhosts/sitetruth.com/cgi-bin/sitetruth/BeautifulSoup.py", line
    1250, in handle_charref
    data = unichr(int(ref))
    ValueError: unichr() arg not in range(0x10000) (narrow Python build)

    Another item in our ongoing saga of "What happens when you parse real-world
    HTML".

    A try-block in handle_charref would be appropriate.

    John Nagle
    SiteTruth
    John Nagle, May 15, 2008
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dan Stromberg

    HTML purifier using BeautifulSoup?

    Dan Stromberg, Dec 21, 2004, in forum: Python
    Replies:
    1
    Views:
    376
    Jonathan Clark
    Jan 7, 2005
  2. John Nagle
    Replies:
    11
    Views:
    1,325
    John Nagle
    May 14, 2007
  3. John Nagle
    Replies:
    3
    Views:
    619
    Waldemar Osuch
    Nov 10, 2007
  4. Johann Spies

    Parsing html with Beautifulsoup

    Johann Spies, Dec 10, 2009, in forum: Python
    Replies:
    0
    Views:
    490
    Johann Spies
    Dec 10, 2009
  5. rantingrick
    Replies:
    44
    Views:
    1,146
    Peter Pearson
    Jul 13, 2010
Loading...

Share This Page