sgmllib problem & proposed fix.

Discussion in 'Python' started by C. Titus Brown, Dec 17, 2004.

  1. Hi all,

    while playing with PBP/mechanize/ClientForm, I ran into a problem with
    the way htmllib.HTMLParser was handling encoded tag attributes.

    Specifically, the following HTML was not being handled correctly:

    <option value="Small (6&quot;)">Small (6)</option>

    The 'value' attr was being given the escaped value, not the
    correct unescaped value, 'Small (6")'.

    It turns out that sgmllib.SGMLParser (on which htmllib.HTMLParser is
    based) does not unescape tag attributes. However, HTMLParser.HTMLParser
    (the newer, more XHTML-friendly class) does do so.

    My proposed fix is to change sgmllib to unescape tags in the same way
    that HTMLParser.HTMLParser does. A context diff to sgmllib.py from
    Python 2.4 is at the bottom of this message.

    I'm posting to this newsgroup before submitting the patch because I'm
    not too familiar with these classes and I want to make sure this
    behavior is correct.

    One question I had was this: as you can see from the code below, a
    simple string.replace is done to replace encoded strings with their
    unencoded translations. Should handle_entityref be used instead, as
    with standard HTML text?

    Another question: should this fix, if appropriate, be back-ported to
    older versions of Python? (I doubt sgmllib has changed much, so it
    should be pretty simple to do.)

    thanks for any advice,
    --titus

    *** /u/t/software/Python-2.4/Lib/sgmllib.py 2004-09-08
    18:49:58.000000000 -0700
    --- sgmllib.py 2004-12-16 23:30:51.000000000 -0800
    ***************
    *** 272,277 ****
    --- 272,278 ----
    elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
    attrvalue[:1] == '"' == attrvalue[-1:]:
    attrvalue = attrvalue[1:-1]
    + attrvalue = self.unescape(attrvalue)
    attrs.append((attrname.lower(), attrvalue))
    k = match.end(0)
    if rawdata[j] == '>':
    ***************
    *** 414,419 ****
    --- 415,432 ----
    def unknown_charref(self, ref): pass
    def unknown_entityref(self, ref): pass

    + # Internal -- helper to remove special character quoting
    + def unescape(self, s):
    + if '&' not in s:
    + return s
    + s = s.replace("&lt;", "<")
    + s = s.replace("&gt;", ">")
    + s = s.replace("&apos;", "'")
    + s = s.replace("&quot;", '"')
    + s = s.replace("&amp;", "&") # Must be last
    +
    + return s
    +

    class TestSGMLParser(SGMLParser):
     
    C. Titus Brown, Dec 17, 2004
    #1
    1. Advertising

  2. Whoops! Forgot an executable example ;).

    Attached, and also available at

    http://issola.caltech.edu/~t/transfer/test-enc.py
    http://issola.caltech.edu/~t/transfer/test-enc.html

    Run 'python test-enc.py test-enc.html' and note that
    htmllib.HTMLParser-based parsers give different output than
    HTMLParser.HTMLParser-based parsers.

    cheers,
    --titus

    #!/usr/bin/env python2.4
    import htmllib
    import HTMLParser
    import formatter

    ### a simple mix-in to demonstrate the problem.

    class MixinTest:
    def start_option(self, attrs):
    print '==> OPTION starting', attrs

    # Definition of entities -- derived classes may override
    entitydefs = \
    {'lt': '<', 'gt': '>', 'amp': '&', 'quot': '"', 'apos': '\''}

    def handle_entityref(self, name):
    print '==> HANDLING ENTITY', name
    table = self.entitydefs
    if name in table:
    self.handle_data(table[name])
    else:
    self.unknown_entityref(name)
    return

    ####

    class htmllib_Parser(MixinTest, htmllib.HTMLParser):
    def __init__(self):
    htmllib.HTMLParser.__init__(self, formatter.NullFormatter())

    class nonhtmllib_Parser(MixinTest, HTMLParser.HTMLParser):
    def handle_starttag(self, name, attrs):
    "Redirect OPTION tag ==> MixinTest.start_option"
    if name == 'option':
    self.start_option(attrs)

    pass

    ###

    import sys
    data = open(sys.argv[1]).read()

    print 'PARSING with htmllib.HTMLParser'

    htmllib_p = htmllib_Parser()
    htmllib_p.feed(data)

    print '\nPARSING with HTMLParser.HTMLParser'

    nonhtmllib_p = nonhtmllib_Parser()
    nonhtmllib_p.feed(data)
     
    C. Titus Brown, Dec 17, 2004
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xah Lee
    Replies:
    22
    Views:
    1,170
    Tim Roberts
    Mar 21, 2006
  2. Harlin Seritt

    SGMLlib module

    Harlin Seritt, May 8, 2005, in forum: Python
    Replies:
    3
    Views:
    341
    John J. Lee
    May 8, 2005
  3. Sakcee
    Replies:
    1
    Views:
    325
  4. Xah Lee
    Replies:
    23
    Views:
    1,119
    Tim Roberts
    Mar 21, 2006
  5. Xah Lee
    Replies:
    21
    Views:
    839
    Tim Roberts
    Mar 21, 2006
Loading...

Share This Page