Should HTML entity translation accept "&amp"?

J

John Nagle

Another in our ongoing series on "Parsing Real-World HTML".

It's wrong, of course. But Firefox will accept as HTML escapes

&amp
&gt
&lt

as well as the correct forms

&
>
<

To be "compatible", a Python screen scraper at

http://zesty.ca/python/scrape.py

has a function "htmldecode", which is supposed to recognize
HTML escapes and generate Unicode. (Why isn't this a standard
Python library function? Its inverse is available.)

This uses the regular expression

charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)

to recognize HTML escapes.

Note the ";?", which makes the closing ";" optional.

This seems fine until we hit something valid but unusual like

http://www.example.com?foo=1&#1234567

for which "htmldecode" tries to convert "1234567" into
a Unicode character with that decimal number, and gets a
Unicode overflow.

For our own purposes, I rewrote "htmldecode" to require a
sequence ending in ";", which means some bogus HTML escapes won't
be recognized, but correct HTML will be processed correctly.
What's general opinion of this behavior? Too strict, or OK?

John Nagle
SiteTruth
 
B

Ben Finney

John Nagle said:
For our own purposes, I rewrote "htmldecode" to require a sequence
ending in ";", which means some bogus HTML escapes won't be
recognized, but correct HTML will be processed correctly. What's
general opinion of this behavior? Too strict, or OK?

I think it's fine. In the face of ambiguity (and deviation from the
published standards), refuse the temptation to guess.

More specifically, I don't see any reason to contort your code to
understand some non-entity sequence that would be flagged as invalid
by HTML validator tools.
 
S

Steven D'Aprano

I think it's fine. In the face of ambiguity (and deviation from the
published standards), refuse the temptation to guess.

That's good advice for a library function. But...
More specifically, I don't see any reason to contort your code to
understand some non-entity sequence that would be flagged as invalid by
HTML validator tools.

.... it is questionable advice for a program which is designed to make
sense of invalid HTML.

Like it or not, real-world applications sometimes have to work with bad
data. I think we can all agree that the world would have been better off
if the major browsers had followed your advice, but given that they do
not, and thus leave open the opportunity for websites to exist with
invalid HTML, John is left in the painful position of having to write
code that has to make sense of invalid HTML.

I think only John can really answer his own question. What are the
consequences of false positives versus false negatives? If it raises an
exception, can he shunt the code to another function and use some
heuristics to make sense of it, or is it "game over, another site can't
be analyzed"?
 
P

Paddy

Another in our ongoing series on "Parsing Real-World HTML".

It's wrong, of course. But Firefox will accept as HTML escapes

&amp
&gt
&lt

as well as the correct forms

&
>
<

To be "compatible", a Python screen scraper at

http://zesty.ca/python/scrape.py

has a function "htmldecode", which is supposed to recognize
HTML escapes and generate Unicode. (Why isn't this a standard
Python library function? Its inverse is available.)

This uses the regular expression

charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE)

to recognize HTML escapes.

Note the ";?", which makes the closing ";" optional.

This seems fine until we hit something valid but unusual like

http://www.example.com?foo=1??

for which "htmldecode" tries to convert "1234567" into
a Unicode character with that decimal number, and gets a
Unicode overflow.

For our own purposes, I rewrote "htmldecode" to require a
sequence ending in ";", which means some bogus HTML escapes won't
be recognized, but correct HTML will be processed correctly.
What's general opinion of this behavior? Too strict, or OK?

John Nagle
SiteTruth

Maybe htmltidy could help:
http://tidy.sourceforge.net/
?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top