htmllib.py and parsing malformed HTML

Discussion in 'Python' started by KC, Sep 2, 2003.

  1. KC

    KC Guest

    I have written a parser using htmllib.HTMLParser and it functions fine
    unless the HTML is malformed. For example, is some instances, the
    provider of the HTML leaves out the <TR> tags but includes the </TR> tags.

    Apparently, htmllib and more likely sgmllib do not parse an end tag if a
    corresponding start tag was not found. Does anyone know a way to "fool"
    the parser into handling the end tag is a start tag was not found?

    Thanks,

    Kevin
     
    KC, Sep 2, 2003
    #1
    1. Advertising

  2. KC wrote:

    > I have written a parser using htmllib.HTMLParser and it functions fine
    > unless the HTML is malformed. For example, is some instances, the
    > provider of the HTML leaves out the <TR> tags but includes the </TR> tags.
    >
    > Apparently, htmllib and more likely sgmllib do not parse an end tag if a
    > corresponding start tag was not found. Does anyone know a way to "fool"
    > the parser into handling the end tag is a start tag was not found?


    Hi,

    You could use tidy (http://www.w3.org/People/Raggett/tidy/) before you
    parse the html.

    thomas
     
    Thomas =?ISO-8859-15?Q?G=FCttler?=, Sep 2, 2003
    #2
    1. Advertising

  3. KC

    KC Guest

    Thomas Güttler wrote:
    >
    > Hi,
    >
    > You could use tidy (http://www.w3.org/People/Raggett/tidy/) before you
    > parse the html.


    I appreciate the suggestion but unfortunately this will not work well
    for me as the parser runs as part of a cron job. I wouldn't be able to
    review the tidy error log in a timely fashion if there was a problem.

    What would be really nice is a way to tell the parser it was "inside" a
    <TR> when I encountered a <TD> after a closing </TR>. Browsers still
    display the HTML correctly without a starting <TR>, but if the closing
    </TR> is omitted everything gets mangled.

    Any other suggestions?
     
    KC, Sep 2, 2003
    #3
  4. KC

    KC Guest

    Re: htmllib.py and parsing malformed HTML [SOLVED]

    KC wrote:
    >
    > What would be really nice is a way to tell the parser it was "inside" a
    > <TR> when I encountered a <TD> after a closing </TR>. Browsers still
    > display the HTML correctly without a starting <TR>, but if the closing
    > </TR> is omitted everything gets mangled.
    >

    I solved this problem, perhaps not the most elegant way, but it is still
    solved. Any suggestions on improvements are welcome. I added the
    following method to my parser class to make this work:


    def parse_endtag(self, i) :
    rawdata = self.rawdata
    tag = rawdata[i+2:i+4].strip().lower()
    if tag == 'tr' :
    self.fmtr.writer.send_tag('</TR>')
    return htmllib.HTMLParser.parse_endtag(self, i)


    I should also mention that I added the send_tag method to my writer
    implementation which simply writes the given text to the output stream.
     
    KC, Sep 2, 2003
    #4
  5. KC

    John J. Lee Guest

    KC <> writes:

    > Thomas Güttler wrote:
    > > Hi,
    > > You could use tidy (http://www.w3.org/People/Raggett/tidy/) before
    > > you
    > > parse the html.

    >
    > I appreciate the suggestion but unfortunately this will not work well
    > for me as the parser runs as part of a cron job. I wouldn't be able
    > to review the tidy error log in a timely fashion if there was a
    > problem.

    [...]

    So, what about *your* code's error log (or the equivalent --
    presumably an unhandled traceback)?? It's not obvious that your
    solution (in a later post) will be any more robust than just piping
    everything through HTMLTidy. In fact, since you will find a great
    variety of nonsense in 'HTML as deployed', it seems likely that
    HTMLTidy will do the better job.


    John
     
    John J. Lee, Sep 2, 2003
    #5
  6. KC

    KC Guest

    John J. Lee wrote:

    >
    > So, what about *your* code's error log (or the equivalent --
    > presumably an unhandled traceback)?? It's not obvious that your
    > solution (in a later post) will be any more robust than just piping
    > everything through HTMLTidy. In fact, since you will find a great
    > variety of nonsense in 'HTML as deployed', it seems likely that
    > HTMLTidy will do the better job.
    >


    If this parser was handling a "great variety of nonsense" I would
    wholeheartedly agree with you. However, since this HTML is from a
    single vendor and that vendor is a government entity, this solution was
    better than integrating a third-party product. As with most
    organizations, changing *our* code is much more acceptable to the powers
    that be, than bringing in a third-party product that will have to be
    evaluated and have countless meetings over its approval. For many of
    us, business and policy decisions often forge the direction for
    technology usage within our organizations.
     
    KC, Sep 4, 2003
    #6
  7. KC

    John J. Lee Guest

    KC <> writes:

    > John J. Lee wrote:
    >
    > > So, what about *your* code's error log (or the equivalent --
    > > presumably an unhandled traceback)?? It's not obvious that your

    [...]
    > If this parser was handling a "great variety of nonsense" I would
    > wholeheartedly agree with you. However, since this HTML is from a
    > single vendor and that vendor is a government entity, this solution


    Oh, got you. Fair enough


    [...]
    > for technology usage within our organizations.


    You can always tell when someone's 'business button' has been pushed
    when they use the word 'within' ;-)


    John
     
    John J. Lee, Sep 4, 2003
    #7
  8. On Thu, 04 Sep 2003 11:50:07 -0400, KC wrote:
    > As with most organizations,
    > changing *our* code is much more acceptable to the powers that be, than
    > bringing in a third-party product that will have to be evaluated and have
    > countless meetings over its approval. For many of us, business and policy
    > decisions often forge the direction for technology usage within our
    > organizations.


    If you are having real problems with poor HTML, HTMLTidy may be worth
    going to bat over. If you can find a simple solution that works on the
    HTML you are processing, great, go with it, and it's worth researching in
    your situation first. But HTML can go bad in more ways then you can
    imagine (which is in fact part of the problem); if you are getting HTML
    that's bad in a lot of little ways, you'll find the "apply a hack to fix
    this file, apply a hack to fix that file" will start stepping on its own
    toes.

    HTMLTidy represents a ***lot*** of grunt work and a ***lot*** of
    functionality that you can *not* replicate in a reasonable amount of time;
    it's one of those packages that isn't so much a program that "does
    something" as a program that represents many, many man-years of "knowledge
    acquired".

    I'm not trying to push anything, since I don't know your situation, but
    HTMLTidy is one of those rare projects that you really shouldn't allow NMH
    to scuttle unless you *really* need to. (Again, I mention if there's some
    simple way you can characterize the bad HTML coming out of one single
    program, go ahead and try to fix it; maybe you'll get lucky and a regex
    will be enough.)
     
    Jeremy Bowers, Sep 5, 2003
    #8
  9. KC

    KC Guest

    Jeremy Bowers wrote:
    > On Thu, 04 Sep 2003 11:50:07 -0400, KC wrote:
    >

    ....

    > that's bad in a lot of little ways, you'll find the "apply a hack to fix
    > this file, apply a hack to fix that file" will start stepping on its own
    > toes.

    Oh yeah, I couldn't agree more. Any more requests for "hacks" and
    HTMLTidy gets brought into the picture.
    >
    > HTMLTidy represents a ***lot*** of grunt work and a ***lot*** of
    > functionality that you can *not* replicate in a reasonable amount of time;
    > it's one of those packages that isn't so much a program that "does
    > something" as a program that represents many, many man-years of "knowledge
    > acquired".
    >

    Agreed. I like HTMLTidy very much and it's obvious it could save us
    developers a lot of effort.
     
    KC, Sep 5, 2003
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Achim Domma

    htmllib.HTMLParser and unicode

    Achim Domma, Sep 17, 2003, in forum: Python
    Replies:
    0
    Views:
    482
    Achim Domma
    Sep 17, 2003
  2. jennyw
    Replies:
    7
    Views:
    390
    Dennis Lee Bieber
    Nov 6, 2003
  3. Dfenestr8

    An example using htmllib?

    Dfenestr8, Nov 8, 2003, in forum: Python
    Replies:
    0
    Views:
    509
    Dfenestr8
    Nov 8, 2003
  4. Morten W. Petersen

    Behaviour of htmllib's HTML parser and formatter

    Morten W. Petersen, Mar 11, 2005, in forum: Python
    Replies:
    0
    Views:
    335
    Morten W. Petersen
    Mar 11, 2005
  5. Cannot import htmllib

    , Apr 13, 2006, in forum: Python
    Replies:
    3
    Views:
    386
Loading...

Share This Page