Making sgmlib more liberal

Discussion in 'Python' started by Jeff Bowden, Aug 26, 2004.

  1. Jeff Bowden

    Jeff Bowden Guest

    I've written a simple class derived from sgmllib.SGMLParser to extract
    text from html pages. So far it's worked pretty well except for a few
    cases where I get exceptions. I've managed to work around these
    problems by overriding parse_declaration.

    Since parse_declaration is preceded by the comment

    # Internal -- parse declaration (for use by subclasses).

    I am thinking my workaround might possibly stop working with future
    versions of sgmllib so I'm looking for a more correct alternative.
    Any suggestions?

    Here's my code:

    _endTag = re.compile(r'>')

    class SGML2TextParser(sgmllib.SGMLParser):
    def __init__(self, f, ignoretags=['script']):
    sgmllib.SGMLParser.__init__(self)
    self.f = f
    self.ignoretags = ignoretags
    self.tag = ''

    def handle_starttag(self, tag, attrs):
    self.tag = tag

    def handle_data(self, data):
    if self.tag not in self.ignoretags:
    self.f.write(data)

    def handle_charref(self, name):
    try:
    n = int(name)
    self.handle_data(unichr(n))
    except ValueError:
    pass

    # DANGER: overriding internal function
    def parse_declaration(self, i):
    try:
    return sgmllib.SGMLParser.parse_declaration(self, i)
    except:
    match = _endTag.search(self.rawdata, i)
    return match and match.end(0) or -1

    def extractText(html_text):
    s = StringIO.StringIO()
    x = SGML2TextParser(s)
    x.feed(html_text)
    return s.getvalue()
     
    Jeff Bowden, Aug 26, 2004
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. bruce barker
    Replies:
    1
    Views:
    354
    Florian Marinoiu
    Jul 29, 2003
  2. smith Smith

    Making the cursor to blink more visibly...

    smith Smith, May 23, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    359
    Dan Brussee
    May 23, 2004
  3. Replies:
    5
    Views:
    1,878
    Andrew Thompson
    Jan 12, 2005
  4. Michael
    Replies:
    4
    Views:
    419
    Matt Hammond
    Jun 26, 2006
  5. Robert Klemme

    With a Ruby Yell: more, more more!

    Robert Klemme, Sep 28, 2005, in forum: Ruby
    Replies:
    5
    Views:
    218
    Jeff Wood
    Sep 29, 2005
Loading...

Share This Page