HTMLParser problem

Valkyrie · Nov 16, 2004

I've fed some data to the HTML parser constructed by myself. Here is the
beginning of the content of the fed data:
=====
<!doctype html public "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html><head><meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1"><link rel="stylesheet"
href="http://us.i1.yimg.com/us.yimg.com/lib/s/yschx_040927.css" type="text/css"
media="all">

<![if !IE]>
....
=====
however, when "<![if !IE]>" is encountered, I found that handle_data() is called
but not handle_decl(), (since I've let the function handle_decl to print sth on
the screen, but nothing happened) and the following error is displayed:

.......
HTMLParser.HTMLParseError: unknown declaration: 'if !IE', at line 4, column 1

May I ask why such error is raised? Thanks in advance!

Richard Brodie · Nov 16, 2004

Valkyrie said:
<![if !IE]>

HTMLParser.HTMLParseError: unknown declaration: 'if !IE', at line 4, column 1

May I ask why such error is raised?

HTMLParser isn't very forgiving of bad HTML; you feed it syntactically invalid HTML,
it tends to give you errors. That includes Microsoft only extensions like <![if !IE.
Unless you know you have known valid sources it may be best to use one of
the forgiving parsers: Beautiful Soup, UTidylib, libxml etc.. (see many past discussions).
Uche's article: http://www.xml.com/pub/a/2004/09/08/pyxml.html may be of interest.

Valkyrie · Nov 16, 2004

Thank you. That means there is no way to deal with it using simple python
built-in functions?

Richard said:
<![if !IE]>

HTMLParser.HTMLParseError: unknown declaration: 'if !IE', at line 4, column 1

May I ask why such error is raised?

Click to expand...

HTMLParser isn't very forgiving of bad HTML; you feed it syntactically invalid HTML,
it tends to give you errors. That includes Microsoft only extensions like <![if !IE.
Unless you know you have known valid sources it may be best to use one of
the forgiving parsers: Beautiful Soup, UTidylib, libxml etc.. (see many past discussions).
Uche's article: http://www.xml.com/pub/a/2004/09/08/pyxml.html may be of interest.

Dirk-Jan C. Binnema · Nov 16, 2004

Thank you. That means there is no way to deal with it using simple python
built-in functions?

Well, you can always preprocess your HTML by replacing dubious
constructs. It's ugly but it works. You might even do something smart
and replace thing back after processing.

Good luck,
Dirk.

-------------------------------------
Dirk-Jan C. Binnema (djcb)
mail: djcb [at] djcbsoftware [dot] nl
blog: www.djcbsoftware.nl/ChangeLog
im : (e-mail address removed)
-------------------------------------

Valkyrie · Nov 17, 2004

I've used regular expression to deal with the problem finally, sigh...

Thanks everyone!

Dirk-Jan C. Binnema said:
Thank you. That means there is no way to deal with it using simple python
built-in functions?

Click to expand...

Well, you can always preprocess your HTML by replacing dubious
constructs. It's ugly but it works. You might even do something smart
and replace thing back after processing.

Good luck,
Dirk.

-------------------------------------
Dirk-Jan C. Binnema (djcb)
mail: djcb [at] djcbsoftware [dot] nl
blog: www.djcbsoftware.nl/ChangeLog
im : (e-mail address removed)
-------------------------------------

HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
HTML form to csv file on server	1	Feb 12, 2025
Align separate li to right	2	Jun 19, 2024
How to position the tooltip comment on these buttons?	9	Nov 4, 2023
Align img inside nav tabs section	5	Dec 29, 2023
How to have two html audio players on one page?	0	May 3, 2022
UTF8 & HTMLParser	2	Nov 30, 2006
An unknown bug doesn't allow the quotes app to work. What's the issue?	3	Apr 23, 2023

HTMLParser problem

Valkyrie

Richard Brodie

Valkyrie

Dirk-Jan C. Binnema

Valkyrie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads