Should HTML entity translation accept "&amp"?

Discussion in 'Python' started by John Nagle, Jan 7, 2008.

  1. John Nagle

    John Nagle Guest

    Another in our ongoing series on "Parsing Real-World HTML".

    It's wrong, of course. But Firefox will accept as HTML escapes

    &amp
    &gt
    &lt

    as well as the correct forms

    &
    >
    <

    To be "compatible", a Python screen scraper at

    http://zesty.ca/python/scrape.py

    has a function "htmldecode", which is supposed to recognize
    HTML escapes and generate Unicode. (Why isn't this a standard
    Python library function? Its inverse is available.)

    This uses the regular expression

    charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)

    to recognize HTML escapes.

    Note the ";?", which makes the closing ";" optional.

    This seems fine until we hit something valid but unusual like

    http://www.example.com?foo=1&#1234567

    for which "htmldecode" tries to convert "1234567" into
    a Unicode character with that decimal number, and gets a
    Unicode overflow.

    For our own purposes, I rewrote "htmldecode" to require a
    sequence ending in ";", which means some bogus HTML escapes won't
    be recognized, but correct HTML will be processed correctly.
    What's general opinion of this behavior? Too strict, or OK?

    John Nagle
    SiteTruth
     
    John Nagle, Jan 7, 2008
    #1
    1. Advertising

  2. John Nagle

    Ben Finney Guest

    John Nagle <> writes:

    > For our own purposes, I rewrote "htmldecode" to require a sequence
    > ending in ";", which means some bogus HTML escapes won't be
    > recognized, but correct HTML will be processed correctly. What's
    > general opinion of this behavior? Too strict, or OK?


    I think it's fine. In the face of ambiguity (and deviation from the
    published standards), refuse the temptation to guess.

    More specifically, I don't see any reason to contort your code to
    understand some non-entity sequence that would be flagged as invalid
    by HTML validator tools.

    --
    \ "Those who write software only for pay should go hurt some |
    `\ other field." -- Erik Naggum, in _gnu.misc.discuss_ |
    _o__) |
    Ben Finney
     
    Ben Finney, Jan 7, 2008
    #2
    1. Advertising

  3. On Mon, 07 Jan 2008 12:25:07 +1100, Ben Finney wrote:

    > John Nagle <> writes:
    >
    >> For our own purposes, I rewrote "htmldecode" to require a sequence
    >> ending in ";", which means some bogus HTML escapes won't be recognized,
    >> but correct HTML will be processed correctly. What's general opinion of
    >> this behavior? Too strict, or OK?

    >
    > I think it's fine. In the face of ambiguity (and deviation from the
    > published standards), refuse the temptation to guess.


    That's good advice for a library function. But...

    > More specifically, I don't see any reason to contort your code to
    > understand some non-entity sequence that would be flagged as invalid by
    > HTML validator tools.


    .... it is questionable advice for a program which is designed to make
    sense of invalid HTML.

    Like it or not, real-world applications sometimes have to work with bad
    data. I think we can all agree that the world would have been better off
    if the major browsers had followed your advice, but given that they do
    not, and thus leave open the opportunity for websites to exist with
    invalid HTML, John is left in the painful position of having to write
    code that has to make sense of invalid HTML.

    I think only John can really answer his own question. What are the
    consequences of false positives versus false negatives? If it raises an
    exception, can he shunt the code to another function and use some
    heuristics to make sense of it, or is it "game over, another site can't
    be analyzed"?



    --
    Steven
     
    Steven D'Aprano, Jan 7, 2008
    #3
  4. John Nagle

    Paddy Guest

    On Jan 7, 1:09 am, John Nagle <> wrote:
    > Another in our ongoing series on "Parsing Real-World HTML".
    >
    > It's wrong, of course. But Firefox will accept as HTML escapes
    >
    > &amp
    > &gt
    > &lt
    >
    > as well as the correct forms
    >
    > &amp;
    > &gt;
    > &lt;
    >
    > To be "compatible", a Python screen scraper at
    >
    > http://zesty.ca/python/scrape.py
    >
    > has a function "htmldecode", which is supposed to recognize
    > HTML escapes and generate Unicode. (Why isn't this a standard
    > Python library function? Its inverse is available.)
    >
    > This uses the regular expression
    >
    > charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)
    >
    > to recognize HTML escapes.
    >
    > Note the ";?", which makes the closing ";" optional.
    >
    > This seems fine until we hit something valid but unusual like
    >
    > http://www.example.com?foo=1??
    >
    > for which "htmldecode" tries to convert "1234567" into
    > a Unicode character with that decimal number, and gets a
    > Unicode overflow.
    >
    > For our own purposes, I rewrote "htmldecode" to require a
    > sequence ending in ";", which means some bogus HTML escapes won't
    > be recognized, but correct HTML will be processed correctly.
    > What's general opinion of this behavior? Too strict, or OK?
    >
    > John Nagle
    > SiteTruth


    Maybe htmltidy could help:
    http://tidy.sourceforge.net/
    ?
     
    Paddy, Jan 7, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Frost-Bridges

    &amp; entity

    Robert Frost-Bridges, Sep 2, 2003, in forum: HTML
    Replies:
    4
    Views:
    517
    Robert Frost-Bridges
    Sep 2, 2003
  2. Guest
    Replies:
    2
    Views:
    323
    Guest
    Dec 20, 2006
  3. Samuel van Laere

    Entity Name or Entity Number?

    Samuel van Laere, Feb 24, 2007, in forum: HTML
    Replies:
    4
    Views:
    1,742
    Jukka K. Korpela
    Feb 24, 2007
  4. markla
    Replies:
    1
    Views:
    581
    Steven Cheng
    Oct 6, 2008
  5. Norm
    Replies:
    3
    Views:
    2,883
Loading...

Share This Page