HTML File Parsing

Felipe De Bene · Oct 28, 2008

I'm having problems parsing an HTML file with the following syntax :

<TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'>
<TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH>
<TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%'
BGCOLOR='#c0c0c0'>Date</TH>
and so on....

whenever I feed the parser with such file I get the error :

Traceback (most recent call last):
File "C:\Documents and Settings\Administrator\My Documents\workspace
\thread\src\parser.py", line 91, in <module>
p.parse(thechange)
File "C:\Documents and Settings\Administrator\My Documents\workspace
\thread\src\parser.py", line 16, in parse
self.feed(s)
File "C:\Python25\lib\HTMLParser.py", line 110, in feed
self.goahead(0)
File "C:\Python25\lib\HTMLParser.py", line 152, in goahead
k = self.parse_endtag(i)
File "C:\Python25\lib\HTMLParser.py", line 316, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "C:\Python25\lib\HTMLParser.py", line 117, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at
line 515, column 45

Googling around I've found a solution to a similar situation, over and
over again :
http://64.233.169.104/search?q=cach...&hl=pt-BR&ct=clnk&cd=5&gl=br&client=firefox-a

but coding :

you can disable proper parsing by setting the CDATA_CONTENT_ELEMENTS
attribute on the parser instance, before you start parsing. by
default, it is
set to
CDATA_CONTENT_ELEMENTS = ("script", "style")
setting it to an empty tuple disables HTML-compliant handling for
these
elements:
p = HTMLParser()
p.CDATA_CONTENT_ELEMENTS = ()
p.feed(f.read())

didn't solve my problem. I've made a little modification then to
HTMLParser.py instead that solved the problem, as follows:
original: endtagfind = re.compile('</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)?(.*)
\s*>')
my version : endtagfind = re.compile('</\s*([a-zA-Z][-.a-zA-Z0-9:_]*)
\s*>')

it worked ok for all the files I needed and also for a different file
I also parse using the same library. I know it might sound stupid but
I was just wondering if there's a better way of solving that problem
than just modifying the standard library. Any clue ?

thx in advance,
Felipe.

Stefan Behnel · Oct 28, 2008

Felipe said:
I'm having problems parsing an HTML file with the following syntax :

<TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'>
<TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH>
<TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%'
BGCOLOR='#c0c0c0'>Date</TH>
and so on....

whenever I feed the parser with such file I get the error :

HTMLParser.HTMLParseError: bad end tag: "</TH BGCOLOR='#c0c0c0'>", at
line 515, column 45

Your HTML page is not HTML, i.e. it is broken. Python's HTMLParser is not made
for parsing broken HTML. However, you can use the parse of lxml.html to fix up
your HTML for you.

http://codespeak.net/lxml/

Stefan

Felipe De Bene · Oct 30, 2008

YourHTMLpage is notHTML, i.e. it is broken. Python's HTMLParser is not made
for parsing brokenHTML. However, you can use the parse of lxml.htmlto fix up
yourHTMLfor you.

http://codespeak.net/lxml/

Stefan

Actually i fetch from an application that i thought it should act like
this and as I told you, the program is ready to be shipped so
rewriting an entire class that has public methods would be a real
pain. I really had to find a way to work this out by using the
python's parser instead of external libraries. But thanks anyway for
the clue, I might start working on a similar project next and this
library may be a good and a less painful path. Thanks

Felipe.

worldgnat · Dec 1, 2008

Your HTML page is not HTML, i.e. it is broken. Python's HTMLParser is not made
for parsing broken HTML. However, you can use the parse of lxml.html to fix up
your HTML for you.

http://codespeak.net/lxml/

Stefan

It doesn't just choke on bad HTML, it also chokes on javascript that
writes HTML, e.g. document.write('<scr'+'ipt language="javascript1.1"
src="http:/... will also result in an error.

However, when I did:

parser = aqparser() #An implementation of HTMLParser
parser.CDATA_CONTENT_ELEMENTS = ()

it worked. Strange...

-Peter

HTMLParser and non-ascii html pages	0	Sep 20, 2011
BeautifulSoup	8	Jan 13, 2010
Help w/ HTMLParser lib	4	May 20, 2004
Urllib2, problems with a webserver	1	Aug 30, 2004
Newbie, list has no attribute iteritems	2	Jul 4, 2008
html parser , unexpected '<' char in declaration	9	Feb 20, 2006
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
HTMLParser can't read japanese	3	Apr 13, 2010

HTML File Parsing

Felipe De Bene

Stefan Behnel

Felipe De Bene

worldgnat

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads