html parser , unexpected '<' char in declaration

S

Sakcee

html =
'<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
k = self.parse_declaration(i)
File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
self.error(
File "/usr/lib/python2.4/htmllib.py", line 40, in error
raise HTMLParseError(message)
htmllib.HTMLParseError: unexpected '<' char in declaration


the error is generated by unclosed DOCTYPE declaration

what is the best way to handle this kind of document. should I use
regex to check and strip, or does HTMLParser offers something? , can i
override default sgmllib behaviour
I have to work with this htmllib because of existing modules .


thanks
 
J

Jesus Rivero - (Neurogeek)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
html =
'<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah
</body></html>'

html =
"""
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff">
Foo foo , blah blah
</body>
</html>
"""

Try checking your html code. It looks really messy. ' char is not for
multiple line strings. You can try the code above.

As a suggestion, you should really focus on learning html basics ;)

Regards

Jesus (Neurogeek)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
self.goahead(0)
File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
k = self.parse_declaration(i)
File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
self.error(
File "/usr/lib/python2.4/htmllib.py", line 40, in error
raise HTMLParseError(message)
htmllib.HTMLParseError: unexpected '<' char in declaration


the error is generated by unclosed DOCTYPE declaration

what is the best way to handle this kind of document. should I use
regex to check and strip, or does HTMLParser offers something? , can i
override default sgmllib behaviour
I have to work with this htmllib because of existing modules .


thanks

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+mZzdIssYB9vBoMRAoWXAJ9KuAnLLXhZVv4t6fDBpu3RW6oxFgCeM/1S
iNScofTDdJxLfOkaAR9Ejws=
=+LTo
-----END PGP SIGNATURE-----
 
S

Sakcee

thanks for the reply

well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest
 
J

Jesus Rivero - (Neurogeek)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.


Regards,

Jesus (Neurogeek)
thanks for the reply

well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I dont think I have the choice of rewriting the message!? and I dont
want to reject the message alltogether.

I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+n5gdIssYB9vBoMRAvIHAJ9H+IQWtaEMa9FBYFvDAQXcIO2SRwCfX3yj
BEvNJ6yWht1b+dBc6ohkwYI=
=X1JL
-----END PGP SIGNATURE-----
 
D

Dennis Lee Bieber

thanks for the reply

well probabbly I should explain more. this is part of an email . after
the mta delivers the email, it is stored in a local dir.
After that the email is being parsed by the parser inside an web based
imap client at display time.

I'd suggest fixing the process that GENERATED that malformed mess,
rather than trying to read minds afterwards. Especially if all the
messages you need to process have the same flaw.
I can either 1-fix the incoming html by tidying it up
or 2- strip only plain text out and dispaly that you have spam, 3 - or
ignore that mal-formatted tag and display the rest

I'm cynical today... I'd pick #4: Reject the message; anything that
can't produce proper HTML isn't important enough to be read.
--
 
D

Dennis Lee Bieber

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.
No... From the original text, the problem is that the <!DOCTYPE does
NOT HAVE a >, meaning the <head> in the sample is seen /inside/ an open
tag -- and HTML does not like nest open tags.
--
 
T

Tim Roberts

Jesus Rivero - (Neurogeek) said:
hmmm, that's kind of different issue then.

I can guess, from the error you pasted earlier, that the problem shown
is due to the fact Python is interpreting a "<" as an expression and not
as a char. review your code or try to figure out the exact input you're
receving within the mta.

Well, Jesus, you are 0 for 2. Sakcee pointed out what the exact problem
was in his original message. The HTML he is being given is ill-formed; the

If this is happening with more than one message, you could check for it
rather easily with a regular expression, or even just ''.find, and then
either insert a closing '>' or delete everything up to the <html> before
parsing it.
 
S

Sakcee

thanks for the suggestions,

this is not happening frequently, actually this is the first time I
have seen this exception in the system, which means that some spam
message was generated with ill-formated html.
i guess the best way would be to check using regular expression and
delete the unclosed tags.
 
J

Jesus Rivero (Neurogeek)

Oopss!

You are totally right guys, i did miss the closing '>' thinking about
maybe errors in the use of ' or ".

Jesus
 
D

Dennis Lee Bieber

i guess the best way would be to check using regular expression and
delete the unclosed tags.

Which isn't the best response either... What will you do with, say:

<table
<tr><th>stuff</th><th>more</th></tr
<tr<td>word</td><td>things</td></tr>
</table>

If you delete the "<table", a good parser should then complain about
a table row outside of a table. If you delete the "</tr" & "<tr" pair,
you've now made the second row a continuation of the first row.

A simplistic method might be to maintain a running sum of < (+1) and
(-1)... If you hit a < when the sum is non-zero, insert a > (decrement
sum) immediately before it and resume processing.
--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,108
Latest member
AlbertEste
Top