Some <head> clauses cases BeautifulSoup to choke?

Frank Stutzman · Nov 19, 2007

I've got a simple script that looks like (watch the wrap):
---------------------------------------------------
import BeautifulSoup,urllib

ifile = urllib.urlopen("http://www.naco.faa.gov/digital_tpp_search.asp?fldId
ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search").read()

soup=BeautifulSoup.BeautifulSoup(ifile)
print soup.prettify()
----------------------------------------------------

and all I get out of it is garbage. Other simular urls from the same site
work fine (use http://www.naco.faa.gov/digital_tpp_search.asp?fldId
ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search as one example).

I did some poking and proding and it seems that there is something in the
<head> clause that is causing the problem. Heck if I can see what it is.

I'm new to BeautifulSoup (heck, I'm new to python). If I'm doing something
dumb, you don't need to be gentle.

Chris Mellon · Nov 19, 2007

I've got a simple script that looks like (watch the wrap):
---------------------------------------------------
import BeautifulSoup,urllib

ifile = urllib.urlopen("http://www.naco.faa.gov/digital_tpp_search.asp?fldId
ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search").read()

soup=BeautifulSoup.BeautifulSoup(ifile)
print soup.prettify()
----------------------------------------------------

and all I get out of it is garbage. Other simular urls from the same site
work fine (use http://www.naco.faa.gov/digital_tpp_search.asp?fldId
ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search as one example).

I did some poking and proding and it seems that there is something in the
<head> clause that is causing the problem. Heck if I can see what it is.

I'm new to BeautifulSoup (heck, I'm new to python). If I'm doing something
dumb, you don't need to be gentle.

You have the same URL as both your good and bad example.

Marc Christiansen · Nov 19, 2007

Frank Stutzman said:
I've got a simple script that looks like (watch the wrap):
---------------------------------------------------
import BeautifulSoup,urllib

ifile = urllib.urlopen("http://www.naco.faa.gov/digital_tpp_search.asp?fldId
ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search").read()

soup=BeautifulSoup.BeautifulSoup(ifile)
print soup.prettify()

Same for me.

I did some poking and proding and it seems that there is something in the
<head> clause that is causing the problem. Heck if I can see what it is.

The problem is this line:
<META http-equiv="Content-Type" content="text/html; charset=UTF-16">

Which is wrong. The content is not utf-16 encoded. The line after that
declares the charset as utf-8, which is correct, although ascii would be
ok too.

If I save the search result and remove this line, everything works. So,
you could:
- ignore problematic pages
- save and edit them, then reparse them (not always practical)
- use the fromEncoding argument:
soup=BeautifulSoup.BeautifulSoup(ifile, fromEncoding="utf-8")
(or 'ascii'). Of course this only works if you guess/predict the
encoding correctly

Which can be difficult. Since BeautifulSoup uses
"an encoding discovered in the document itself" (quote from
<http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful Soup Gives You Unicode, Dammit>)
when the encoding you supply does not work, using fromEncoding="ascii"
should not hurt too much. But this being usenet, I'm sure someone will
tell me that I'm wrong and there is some weird 7bit encoding in use
somewhere on the web...

I'm new to BeautifulSoup (heck, I'm new to python). If I'm doing something
dumb, you don't need to be gentle.

No, you did nothing dumb. The server sent you broken content.

Ciao
Marc

Duncan Booth · Nov 19, 2007

Frank Stutzman said:
I did some poking and proding and it seems that there is something in
the
<head> clause that is causing the problem. Heck if I can see what it
is.

Maybe Beautifulsoup believes the incorrect encoding in the meta tags?

Chris Mellon · Nov 19, 2007

Same for me.

The problem is this line:
<META http-equiv="Content-Type" content="text/html; charset=UTF-16">

Which is wrong. The content is not utf-16 encoded. The line after that
declares the charset as utf-8, which is correct, although ascii would be
ok too.

If I save the search result and remove this line, everything works. So,
you could:
- ignore problematic pages
- save and edit them, then reparse them (not always practical)
- use the fromEncoding argument:
soup=BeautifulSoup.BeautifulSoup(ifile, fromEncoding="utf-8")
(or 'ascii'). Of course this only works if you guess/predict the
encoding correctly Which can be difficult. Since BeautifulSoup uses
"an encoding discovered in the document itself" (quote from
<http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful Soup Gives You Unicode, Dammit>)
when the encoding you supply does not work, using fromEncoding="ascii"
should not hurt too much. But this being usenet, I'm sure someone will
tell me that I'm wrong and there is some weird 7bit encoding in use
somewhere on the web...

No, you did nothing dumb. The server sent you broken content.

Correct. However, this is the sort of real-life broken HTML that BS is
tasked to handle. It looks like the major browers handle this by using
the last content type (header or meta tag) encountered before other
content. Right now, it looks like BS has a number of fallback
mechanisms but it's meta-tag fallback only looks at the first tag.

Posting a feature request or whatever through whatever mechanism BS
uses to handle this sort of thing would probably be nice.

Frank Stutzman · Nov 20, 2007

Some kind person replied:

You have the same URL as both your good and bad example.

Oops, dang emacs cut buffer (yeah, thats what did it). A working
example url would be (again, mind the wrap):

http://www.naco.faa.gov/digital_tpp...t_type=ICAO&ver=0711&bnSubmit=Complete+Search

Marc Christiansen said:
The problem is this line:
<META http-equiv="Content-Type" content="text/html; charset=UTF-16">

Which is wrong. The content is not utf-16 encoded. The line after that
declares the charset as utf-8, which is correct, although ascii would be
ok too.

Ah, er, hmmm. Take a look the 'good' URL I mentioned above. You will
notice that it has the same utf-16, utf-8 encoding that the 'bad' one
has. And BeautifulSoup works great on it.

I'm still scratchin' ma head...

If I save the search result and remove this line, everything works. So,
you could:
- ignore problematic pages

Not an option for my application.

- save and edit them, then reparse them (not always practical)

Thats what I'm doing at the moment during my development. Sure
seems inelegant.

- use the fromEncoding argument:
soup=BeautifulSoup.BeautifulSoup(ifile, fromEncoding="utf-8")
(or 'ascii'). Of course this only works if you guess/predict the
encoding correctly Which can be difficult. Since BeautifulSoup uses
"an encoding discovered in the document itself" (quote from
<http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful Soup Gives You Unicode, Dammit>)

I'll try that. For what I'm doing it ought to be safe enough.

Much appreciate all the comments so far.

Marc Christiansen · Nov 20, 2007

Frank Stutzman said:
Some kind person replied:

Oops, dang emacs cut buffer (yeah, thats what did it). A working
example url would be (again, mind the wrap):

http://www.naco.faa.gov/digital_tpp...t_type=ICAO&ver=0711&bnSubmit=Complete+Search

Ah, er, hmmm. Take a look the 'good' URL I mentioned above. You will
notice that it has the same utf-16, utf-8 encoding that the 'bad' one
has. And BeautifulSoup works great on it.

I'm still scratchin' ma head...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/encodings/utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 41176: truncated data

bad contains the content of the 'bad' url, good the content of the
'good' url. Because of the UnicodeDecodeError, BeautifulSoup tries
either the next encoding or the next step from the url below.

Much appreciate all the comments so far.

You're welcome.

Marc

Some <head> clauses cases BeautifulSoup to choke?

Frank Stutzman

Chris Mellon

Marc Christiansen

Duncan Booth

Chris Mellon

Frank Stutzman

Marc Christiansen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads