HTMLParser and write

Discussion in 'Python' started by Kai I Hendry, Mar 5, 2004.

  1. Kai I Hendry

    Kai I Hendry Guest

    I am finding the :
    http://www.python.org/doc/current/lib/htmlparser-example.html
    A little lacking.

    I want an example with parses and then writes the same html file (a fine test
    case!). Does anyone know where I can find such an example, as my initial attempt
    is proving tricky. For example do I really need to do things like: ' %s="%s" '
    % (name, value) with the attributes? What happens if a tag needs not be closed
    by handle_endtag? Why does my __init__ def not work? And what about the rest?
    From decl to parsing entities...


    #!/usr/bin/python2.3
    import sys
    from HTMLParser import HTMLParser

    class MyHTMLParser(HTMLParser):

    #def __init__(self):
    #self.tagsoup = []

    def handle_starttag(self, tag, attrs):
    self.tagsoup.append(tag)
    sys.stdout.write('<%s' % tag)
    for attr in attrs:
    name, value = attr
    sys.stdout.write(' %s="%s" ' % (name, value))
    sys.stdout.write('>')


    #This is the whole tag
    #But, how do know if it needs to be closed?
    #print self.get_starttag_text()


    def handle_data(self, data):
    sys.stdout.write(data)

    def handle_endtag(self, tag):
    self.tagsoup.remove(tag)
    sys.stdout.write('</%s>' % tag)

    #Something like this?
    #Or is there a better way?
    #print self.check_for_whole_start_tag

    if __name__ == "__main__":
    h = MyHTMLParser()

    # __init__ def results in some sort of rawdata error, hence:
    h.tagsoup = []

    #h.feed(sys.stdin.read())

    import urllib2
    html = urllib2.urlopen('http://www.cs.helsinki.fi/u/hendry/')
    h.feed(html.read())
    Kai I Hendry, Mar 5, 2004
    #1
    1. Advertising

  2. Kai I Hendry

    Stephen Ferg Guest

    You're right. The example is REALLY feeble. Maybe this will help:

    """HTMLParserDemoProgram
    Use HTMLParser to read in an HTML file and write it out again.
    This will put all tag and attribute names into lowercase.
    """

    """
    REVISION HISTORY
    2 2004-01-05 added handle_pi and improved attribute processing
    """

    from HTMLParser import HTMLParser

    class CustomizedParser(HTMLParser):

    def setOutfileName(self, argOutfileName):
    """Remember the output file, so it is easy to write to it.
    """
    self.OutfileName = argOutfileName
    self.Outfile = open(self.OutfileName, "w")

    def closeOutfile(self):
    self.Outfile.close()

    def write(self, argString):
    self.Outfile.write(argString)

    def handle_starttag(self, argTag, argAttrs):
    """ argAttrs is a list of tuples.
    Each tuple is a pair of (attribute_name, attribute_value)
    """
    attributes = "".join([' %s="%s"' % (key, value) for key, value in argAttrs])
    self.Outfile.write("<%s%s>" % (argTag, attributes))

    def handle_startendtag(self, argTag, argAttrs):
    """ argAttrs is a list of tuples.
    Each tuple is a pair of (attribute_name, attribute_value)
    """
    attributes = "".join([' %s="%s"' % (key, value) for key, value in argAttrs])
    self.Outfile.write("<%s%s/>" % (argTag, attributes))


    def handle_endtag(self, argTag):
    self.write("</%s>" % argTag)

    def handle_data(self, argString):
    self.write(argString)

    def handle_charref(self, argString):
    self.write("&#%s;" % argString)

    def handle_entityref(self, argString):
    self.write("&%s;" % argString)

    def handle_comment(self, argString):
    self.write("<!--%s-->" % argString)

    def handle_decl(self, argString):
    self.write("<!%s>" % argString)

    def handle_pi(self, argString):
    # handle a processing instruction
    self.write("<?%s>" % argString)

    def main(myInfileName, myOutfileName ):
    myInfile = open(myInfileName, "r")
    myParser = CustomizedParser()
    myParser.setOutfileName(myOutfileName)

    myParser.feed(myInfile.read())

    myInfile.close()
    myParser.closeOutfile()


    def dq(s):
    """Enclose a string argument in double quotes"""
    return '"'+ s + '"'

    if __name__ == "__main__":
    print "Starting HTMLParserDemoProgram"
    main("c:\junk\slide01.html", "c:\junk\slide01a.html")
    print "Ending HTMLParserDemoProgram"
    Stephen Ferg, Mar 5, 2004
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. mike
    Replies:
    0
    Views:
    425
  2. mike
    Replies:
    0
    Views:
    878
  3. Achim Domma

    htmllib.HTMLParser and unicode

    Achim Domma, Sep 17, 2003, in forum: Python
    Replies:
    0
    Views:
    472
    Achim Domma
    Sep 17, 2003
  4. Stephen Briley

    question on HTMLParser and parser.feed()

    Stephen Briley, Dec 6, 2003, in forum: Python
    Replies:
    1
    Views:
    521
    Peter Otten
    Dec 6, 2003
  5. Yaþar Arabacý

    HTMLParser and non-ascii html pages

    Yaþar Arabacý, Sep 20, 2011, in forum: Python
    Replies:
    0
    Views:
    200
    Yaþar Arabacý
    Sep 20, 2011
Loading...

Share This Page