HTMLParser problem

V

Valkyrie

I've fed some data to the HTML parser constructed by myself. Here is the
beginning of the content of the fed data:
=====
<!doctype html public "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html><head><meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1"><link rel="stylesheet"
href="http://us.i1.yimg.com/us.yimg.com/lib/s/yschx_040927.css" type="text/css"
media="all">

<![if !IE]>
....
=====
however, when "<![if !IE]>" is encountered, I found that handle_data() is called
but not handle_decl(), (since I've let the function handle_decl to print sth on
the screen, but nothing happened) and the following error is displayed:

.......
HTMLParser.HTMLParseError: unknown declaration: 'if !IE', at line 4, column 1

May I ask why such error is raised? Thanks in advance!
 
R

Richard Brodie

Valkyrie said:
<![if !IE]>

HTMLParser.HTMLParseError: unknown declaration: 'if !IE', at line 4, column 1

May I ask why such error is raised?

HTMLParser isn't very forgiving of bad HTML; you feed it syntactically invalid HTML,
it tends to give you errors. That includes Microsoft only extensions like <![if !IE.
Unless you know you have known valid sources it may be best to use one of
the forgiving parsers: Beautiful Soup, UTidylib, libxml etc.. (see many past discussions).
Uche's article: http://www.xml.com/pub/a/2004/09/08/pyxml.html may be of interest.
 
V

Valkyrie

Thank you. That means there is no way to deal with it using simple python
built-in functions?


Richard said:
<![if !IE]>

HTMLParser.HTMLParseError: unknown declaration: 'if !IE', at line 4, column 1

May I ask why such error is raised?


HTMLParser isn't very forgiving of bad HTML; you feed it syntactically invalid HTML,
it tends to give you errors. That includes Microsoft only extensions like <![if !IE.
Unless you know you have known valid sources it may be best to use one of
the forgiving parsers: Beautiful Soup, UTidylib, libxml etc.. (see many past discussions).
Uche's article: http://www.xml.com/pub/a/2004/09/08/pyxml.html may be of interest.
 
D

Dirk-Jan C. Binnema

Thank you. That means there is no way to deal with it using simple python
built-in functions?

Well, you can always preprocess your HTML by replacing dubious
constructs. It's ugly but it works. You might even do something smart
and replace thing back after processing.

Good luck,
Dirk.

-------------------------------------
Dirk-Jan C. Binnema (djcb)
mail: djcb [at] djcbsoftware [dot] nl
blog: www.djcbsoftware.nl/ChangeLog
im : (e-mail address removed)
-------------------------------------
 
V

Valkyrie

I've used regular expression to deal with the problem finally, sigh...

Thanks everyone!


Dirk-Jan C. Binnema said:
Thank you. That means there is no way to deal with it using simple python
built-in functions?


Well, you can always preprocess your HTML by replacing dubious
constructs. It's ugly but it works. You might even do something smart
and replace thing back after processing.

Good luck,
Dirk.

-------------------------------------
Dirk-Jan C. Binnema (djcb)
mail: djcb [at] djcbsoftware [dot] nl
blog: www.djcbsoftware.nl/ChangeLog
im : (e-mail address removed)
-------------------------------------
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top