html parser , unexpected '<' char in declaration

Sakcee · Feb 20, 2006

html =
'<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
k = self.parse_declaration(i)
File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
self.error(
File "/usr/lib/python2.4/htmllib.py", line 40, in error
raise HTMLParseError(message)
htmllib.HTMLParseError: unexpected '<' char in declaration

the error is generated by unclosed DOCTYPE declaration

what is the best way to handle this kind of document. should I use
regex to check and strip, or does HTMLParser offers something? , can i
override default sgmllib behaviour
I have to work with this htmllib because of existing modules .

thanks

Jesus Rivero - (Neurogeek) · Feb 21, 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

html =
'<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah
</body></html>'

html =
"""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff">
Foo foo , blah blah
</body>
</html>
"""

Try checking your html code. It looks really messy. ' char is not for
multiple line strings. You can try the code above.

As a suggestion, you should really focus on learning html basics

Regards

Jesus (Neurogeek)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
k = self.parse_declaration(i)
File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
self.error(
File "/usr/lib/python2.4/htmllib.py", line 40, in error
raise HTMLParseError(message)
htmllib.HTMLParseError: unexpected '<' char in declaration

the error is generated by unclosed DOCTYPE declaration

what is the best way to handle this kind of document. should I use
regex to check and strip, or does HTMLParser offers something? , can i
override default sgmllib behaviour
I have to work with this htmllib because of existing modules .

thanks

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+mZzdIssYB9vBoMRAoWXAJ9KuAnLLXhZVv4t6fDBpu3RW6oxFgCeM/1S
iNScofTDdJxLfOkaAR9Ejws=
=+LTo
-----END PGP SIGNATURE-----

Sakcee · Feb 21, 2006

thanks for the reply

well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest

Jesus Rivero - (Neurogeek) · Feb 21, 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.

Regards,

Jesus (Neurogeek)

thanks for the reply

well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+n5gdIssYB9vBoMRAvIHAJ9H+IQWtaEMa9FBYFvDAQXcIO2SRwCfX3yj
BEvNJ6yWht1b+dBc6ohkwYI=
=X1JL
-----END PGP SIGNATURE-----

Dennis Lee Bieber · Feb 21, 2006

thanks for the reply

well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I'd suggest fixing the process that GENERATED that malformed mess,
rather than trying to read minds afterwards. Especially if all the
messages you need to process have the same flaw.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest

I'm cynical today... I'd pick #4: Reject the message; anything that
can't produce proper HTML isn't important enough to be read.
--

Dennis Lee Bieber · Feb 21, 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.

No... From the original text, the problem is that the <!DOCTYPE does
NOT HAVE a >, meaning the <head> in the sample is seen /inside/ an open
tag -- and HTML does not like nest open tags.
--

Tim Roberts · Feb 21, 2006

Jesus Rivero - (Neurogeek) said:
hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.

Well, Jesus, you are 0 for 2. Sakcee pointed out what the exact problem
was in his original message. The HTML he is being given is ill-formed; the

If this is happening with more than one message, you could check for it
rather easily with a regular expression, or even just ''.find, and then
either insert a closing '>' or delete everything up to the <html> before
parsing it.

Sakcee · Feb 21, 2006

thanks for the suggestions,

this is not happening frequently, actually this is the first time I
have seen this exception in the system, which means that some spam
message was generated with ill-formated html.
i guess the best way would be to check using regular expression and
delete the unclosed tags.

Jesus Rivero (Neurogeek) · Feb 21, 2006

Oopss!

You are totally right guys, i did miss the closing '>' thinking about
maybe errors in the use of ' or ".

Jesus

Dennis Lee Bieber · Feb 22, 2006

i guess the best way would be to check using regular expression and
delete the unclosed tags.

Which isn't the best response either... What will you do with, say:

<table
<tr><th>stuff</th><th>more</th></tr
<tr<td>word</td><td>things</td></tr>
</table>

If you delete the "<table", a good parser should then complain about
a table row outside of a table. If you delete the "</tr" & "<tr" pair,
you've now made the second row a continuation of the first row.

A simplistic method might be to maintain a running sum of < (+1) and

(-1)... If you hit a < when the sum is non-zero, insert a > (decrement

sum) immediately before it and resume processing.
--

I want to Display Excel As HTML In js	2	Feb 24, 2023
Python HTML parser chokes on UTF-8 input	5	Oct 9, 2008
Once again a unicode question	2	Mar 26, 2005
sgmllib bug in Python 2.5, works in 2.4.	2	Feb 5, 2007
Another BeautifulSoup crash on bad HTML	0	May 15, 2008
Using Tools/freeze.py on AIX -- having problems	1	Dec 22, 2006
freeze.py builds, but binary doesn't even run locally (shared GTK problem?)	1	Apr 8, 2006
HTML File Parsing	3	Oct 28, 2008

html parser , unexpected '<' char in declaration

Sakcee

Jesus Rivero - (Neurogeek)

Sakcee

Jesus Rivero - (Neurogeek)

Dennis Lee Bieber

Dennis Lee Bieber

Tim Roberts

Sakcee

Jesus Rivero (Neurogeek)

Dennis Lee Bieber

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads