HTMLParser chokes on bad end tag in comment

Discussion in 'Python' started by Rene Pijlman, May 28, 2006.

  1. Rene Pijlman

    Rene Pijlman Guest

    The code below results in an exception (Python 2.4.2):

    HTMLParser.HTMLParseError: bad end tag: "</foo' + 'bar>", at line 4,
    column 6

    Should it? The end tag it chokes on is in comment, isn't it?

    import HTMLParser
    HTMLParser.HTMLParser().feed("""
    <html><head><title></title></head><body><script>
    <!--
    x = '</foo' + 'bar>'
    // -->
    </script></body></html>
    """)

    --
    René Pijlman
     
    Rene Pijlman, May 28, 2006
    #1
    1. Advertising

  2. Rene Pijlman wrote:

    > The code below results in an exception (Python 2.4.2):
    >
    > HTMLParser.HTMLParseError: bad end tag: "</foo' + 'bar>", at line 4,
    > column 6
    >
    > Should it? The end tag it chokes on is in comment, isn't it?


    no. STYLE and SCRIPT elements contain character data, not parsed
    character data, so comments are treated as characters, and the first
    "</" ends the element.

    if you have broken documents, you can tweak this by setting the
    CDATA_CONTENT_ELEMENTS parser attribute before you start parsing.

    </F>
     
    Fredrik Lundh, May 29, 2006
    #2
    1. Advertising

  3. Rene Pijlman

    Rene Pijlman Guest

    Fredrik Lundh:
    >Rene Pijlman:
    >[end tag in html comment in script element]
    >The end tag it chokes on is in comment, isn't it?
    >
    >no. STYLE and SCRIPT elements contain character data, not parsed
    >character data, so comments are treated as characters, and the first
    >"</" ends the element.


    Ah, I see. I'll report the problem to the application that's generating
    this broken code (vBulletin forum)...

    >if you have broken documents, you can tweak this by setting the
    >CDATA_CONTENT_ELEMENTS parser attribute before you start parsing.


    .... and in the mean time that's a good workaround.

    Thank you very much Fredrik.

    --
    René Pijlman
     
    Rene Pijlman, May 29, 2006
    #3
  4. Rene Pijlman

    Miki Guest

    Miki, May 29, 2006
    #4
  5. Rene Pijlman

    Rene Pijlman Guest

    Miki:
    >You can also check out BeautifulSoup
    >(http://www.crummy.com/software/BeautifulSoup/) which is less strict
    >than the regular HTML parser.


    Yes, thanks. Ik this case it was my sitechecker which checks for syntax
    and broken links, so it was supposed to find the syntax error.
    BeautifulSoup is not very well suited for validators :)

    --
    René Pijlman
     
    Rene Pijlman, May 29, 2006
    #5
  6. Fredrik Lundh wrote:

    >> Should it? The end tag it chokes on is in comment, isn't it?

    >
    > no. STYLE and SCRIPT elements contain character data, not parsed
    > character data, so comments are treated as characters, and the first
    > "</" ends the element.


    Rather than take your word for it, I checked the W3C HTML4 DTD and found
    this:

    http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data

    Element content

    When script or style data is the content of an element (SCRIPT and STYLE),
    the data begins immediately after the element start tag and ends at the
    first ETAGO ("</") delimiter followed by a name start character ([a-zA-Z]);
    note that this may not be the element's end tag. Authors should therefore
    escape "</" within the content. Escape mechanisms are specific to each
    scripting or style sheet language.

    ILLEGAL EXAMPLE:
    The following script data incorrectly contains a "</" sequence (as part of
    "</EM>") before the SCRIPT end tag:

    <SCRIPT type="text/javascript">
    document.write ("<EM>This won't work</EM>")
    </SCRIPT>

    In JavaScript, this code can be expressed legally by hiding the ETAGO
    delimiter before an SGML name start character:

    <SCRIPT type="text/javascript">
    document.write ("<EM>This will work<\/EM>")
    </SCRIPT>


    Guess you learn something new every day. Too bad there's so much illegal
    code in the wild. :(

    --
    Edward Elliott
    UC Berkeley School of Law (Boalt Hall)
    complangpython at eddeye dot net
     
    Edward Elliott, May 29, 2006
    #6
  7. Edward Elliott wrote:

    > Guess you learn something new every day. Too bad there's so much illegal
    > code in the wild. :(


    if more people learned something new every day, the wild would look a
    lot different.

    </F>
     
    Fredrik Lundh, May 29, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. mike
    Replies:
    0
    Views:
    895
  2. shruds
    Replies:
    1
    Views:
    870
    John C. Bollinger
    Jan 27, 2006
  3. rantingrick
    Replies:
    44
    Views:
    1,247
    Peter Pearson
    Jul 13, 2010
  4. Frank Millman

    2to3 chokes on bad character

    Frank Millman, Feb 23, 2011, in forum: Python
    Replies:
    7
    Views:
    549
    Peter Otten
    Feb 25, 2011
  5. Ron M
    Replies:
    1
    Views:
    129
Loading...

Share This Page