BeautifulSoup bug when ">>>" found in attribute value

J

John Nagle

This, which is from a real web site, went into BeautifulSoup:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&blinkt=Click here
And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&amp;blinkt=Click here
&gt;&gt;&gt;&amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408"></param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a missing
quote mark did the wrong thing.

John Nagle
 
D

Duncan Booth

John Nagle said:
And this came out, via prettify:

<addresssnippet siteurl="http%3A//apartmentsapart.com"
url="http%3A//www.apartmentsapart.com/Europe/Spain/Madrid/FAQ">
<param name="movie"
value="/images/offersBanners/sw04.swf?binfot=We offer
fantastic rates for selected weeks or days!!&amp;blinkt=Click here
&gt;&gt;&gt;&amp;linkurl=/Europe/Spain/Madrid/Apartments/Offer/2408">
</param>

BeautifulSoup seems to have become confused by the ">>>" within
a quoted attribute value. It first parsed it right, but then stuck
in an extra, totally bogus line. Note the entity "&linkurl;", which
appears nowhere in the original. It looks like code to handle a
missing quote mark did the wrong thing.

I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.
 
J

John Nagle

Duncan said:
I don't think I would quibble with what BeautifulSoup extracted from that
mess. The input isn't valid HTML so any output has to be guessing at what
was meant. A lot of code for parsing html would assume that there was a
quote missing and the tag was terminated by the first '>'. IE and Firefox
seem to assume that the '>' is allowed inside the attribute. BeautifulSoup
seems to have given you the best of both worlds: the attribute is parsed to
the closing quote, but the tag itself ends at the first '>'.

As for inserting a semicolon after linkurl, I think you'll find it is just
being nice and cleaning up an unterminated entity. Browsers (or at least
IE) will often accept entities without the terminating semicolon, so that's
a common problem in badly formed html that BeautifulSoup can fix.

It's worse than that. Look at the last line of BeautifulSoup output:

&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.

John Nagle
 
D

Duncan Booth

John Nagle said:
It's worse than that. Look at the last line of BeautifulSoup
output:

&linkurl;=/Europe/Spain/Madrid/Apartments/Offer/2408" />

That "/>" doesn't match anything. We're outside a tag at that point.
And it was introduced by BeautifulSoup. That's both wrong and
puzzling; given that this was created from a parse tree, that type
of error shouldn't ever happen. This looks like the parser didn't
delete a string item after deciding it was actually part of a tag.

The /> was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We
offer fantastic rates for selected weeks or days!!&blinkt=Click here
You don't actually *have* to escape > when it appears in html.

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first > although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</param> to close the unclosed param tag.

.... some time later ...

Ok, it looks like I was wrong and this is a bug in BeautifulSoup: it
seems that it *is* legal to have an unescaped > in an attribute value,
although it should (not must) be escaped:

From the HTML 4.01 spec:
Similarly, authors should use "&gt;" (ASCII decimal 62) in text
instead of ">" to avoid problems with older user agents that
incorrectly perceive this as the end of a tag (tag close delimiter)
when it appears in quoted attribute values.

Thank you, it looks like I just learned something new.

Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.
 
A

Anne van Kesteren

Duncan Booth schreef:
The /> was in the original input that you gave it:

<param name="movie" value="/images/offersBanners/sw04.swf?binfot=We
offer fantastic rates for selected weeks or days!!&blinkt=Click here

You don't actually *have* to escape > when it appears in html.

You don't have to escape it in XML either, except when it's preceded by
]].

As I said before, it looks like BeautifulSoup decided that the tag ended
at the first > although it took text beyond that up to the closing " as
the value of the attribute. The remaining text was then simply treated
as text content of the unclosed param tag. Finally it inserted a
</param> to close the unclosed param tag.

The param element doesn't have a closing tag.

http://www.w3.org/TR/html401/struct/objects.html#h-13.3.2

Mind you, the sentence before that says 'should' for quoting < characters
which is just plain silly.

For quoted attribute values it isn't silly at all. It's actually part
of how HTML works.
 
D

Duncan Booth

Anne van Kesteren said:
For quoted attribute values it isn't silly at all. It's actually part
of how HTML works.

Yes, but the sentence I was complaining about isn't talking specifically
about attribute values. It says:
Authors wishing to put the "<" character in text should use "&lt;"
(ASCII decimal 60) to avoid possible confusion with the beginning of a
tag (start tag open delimiter).

Not requiring "<" to be quoted in text is, IMHO, silly. However I fully
admit that all the browsers I tried will happily accept < followed by a
space character as not starting a tag.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top