Unexpected behaviour with HTMLParser...

Discussion in 'Python' started by Just Another Victim of the Ambient Morality, Oct 9, 2007.

  1. HTMLParser is behaving in, what I find to be, strange ways and I would
    like to better understand what it is doing and why.

    First, it doesn't appear to translate HTML escape characters. I don't
    know the actual terminology but things like & don't get translated into
    & as one would like. Furthermore, not only does HTMLParser not translate it
    properly, it seems to omit it altogether! This prevents me from even doing
    the translation myself, so I can't even working around the issue.
    Why is it doing this? Is there some mode I need to set? Can anyone
    else duplicate this behaviour? Is it a bug?

    Secondly, HTMLParser often calls handle_data() consecutively, without
    any calls to handle_starttag() in between. I did not expect this. In HTML,
    you either have text or you have tags. Why split up my text into successive
    handle_data() calls? This makes no sense to me. At the very least, it does
    this in response to text with & like escape sequences (or whatever
    they're called), so that it may successively avoid those translations.
    Again, why is it doing this? Is there some mode I need to set? Can
    anyone else duplicate this behaviour? Is it a bug?

    These are serious problems for me and I would greatly appreciate a
    deeper understanding of these issues.
    Thank you...
     
    Just Another Victim of the Ambient Morality, Oct 9, 2007
    #1
    1. Advertising

  2. Just Another Victim of the Ambient Morality schrieb:
    > HTMLParser is behaving in, what I find to be, strange ways and I would
    > like to better understand what it is doing and why.
    >
    > First, it doesn't appear to translate HTML escape characters. I don't
    > know the actual terminology but things like & don't get translated into
    > & as one would like. Furthermore, not only does HTMLParser not translate it
    > properly, it seems to omit it altogether! This prevents me from even doing
    > the translation myself, so I can't even working around the issue.
    > Why is it doing this? Is there some mode I need to set? Can anyone
    > else duplicate this behaviour? Is it a bug?


    Without code, that's hard to determine. But you are aware of e.g.

    handle_entityref(name)
    handle_charref(ref)

    ?

    > Secondly, HTMLParser often calls handle_data() consecutively, without
    > any calls to handle_starttag() in between. I did not expect this. In HTML,
    > you either have text or you have tags. Why split up my text into successive
    > handle_data() calls? This makes no sense to me. At the very least, it does
    > this in response to text with & like escape sequences (or whatever
    > they're called), so that it may successively avoid those translations.


    That's the way XML/HTML is defined - there is no guarantee that you get
    text as whole. If you must, you can collect the snippets yourself, and
    on the next end-tag deliver them as whole.


    > Again, why is it doing this? Is there some mode I need to set? Can
    > anyone else duplicate this behaviour? Is it a bug?


    No. It's the way it is, because it would require buffering with
    unlimited capacity to ensure this property.

    > These are serious problems for me and I would greatly appreciate a
    > deeper understanding of these issues.


    HTH, and read the docs.

    Diez
     
    Diez B. Roggisch, Oct 9, 2007
    #2
    1. Advertising

  3. "Diez B. Roggisch" <> wrote in message
    news:-berlin.de...
    > Just Another Victim of the Ambient Morality schrieb:
    >> HTMLParser is behaving in, what I find to be, strange ways and I
    >> would like to better understand what it is doing and why.
    >>
    >> First, it doesn't appear to translate HTML escape characters. I
    >> don't know the actual terminology but things like &amp; don't get
    >> translated into & as one would like. Furthermore, not only does
    >> HTMLParser not translate it properly, it seems to omit it altogether!
    >> This prevents me from even doing the translation myself, so I can't even
    >> working around the issue.
    >> Why is it doing this? Is there some mode I need to set? Can anyone
    >> else duplicate this behaviour? Is it a bug?

    >
    > Without code, that's hard to determine. But you are aware of e.g.
    >
    > handle_entityref(name)
    > handle_charref(ref)
    >
    > ?


    Actually, I am not aware of these methods but I will certainly look into
    them!
    I was hoping that the issue would be known or simple before I commited
    to posting code, something that is, to my chagrin, not easily done with my
    news client...


    >> Secondly, HTMLParser often calls handle_data() consecutively, without
    >> any calls to handle_starttag() in between. I did not expect this. In
    >> HTML, you either have text or you have tags. Why split up my text into
    >> successive handle_data() calls? This makes no sense to me. At the very
    >> least, it does this in response to text with &amp; like escape sequences
    >> (or whatever they're called), so that it may successively avoid those
    >> translations.

    >
    > That's the way XML/HTML is defined - there is no guarantee that you get
    > text as whole. If you must, you can collect the snippets yourself, and on
    > the next end-tag deliver them as whole.


    I think there's some miscommunication, here.
    You can't mean "That's the way XML/HTML is defined" because those format
    specifications say nothing about how the format must be parsed. As far as I
    can tell, you either meant to say that that's the way HTMLParser is
    specified or you're referring to how text in XML/HTML can be broken up by
    tags, in which case I've already addressed that in my post. I expected to
    see handle_starttag() calls in between calls to handle_data().
    Unless I'm missing something, it simply makes no sense to break up
    contiguous text into multiple handle_data() calls...


    >> Again, why is it doing this? Is there some mode I need to set? Can
    >> anyone else duplicate this behaviour? Is it a bug?

    >
    > No. It's the way it is, because it would require buffering with unlimited
    > capacity to ensure this property.


    It depends on what you mean by "unlimited capacity." Is it so bad to
    buffer with as much memory as you have? ...or, at least, have a setting for
    such operation? Moreover, you know that you'll never have to buffer more
    than there is HTML, so you hardly need "unlimited capacity..." For
    instance, I believe Xerces does this translation for you 'cause, really, why
    wouldn't you want it to?


    >> These are serious problems for me and I would greatly appreciate a
    >> deeper understanding of these issues.

    >
    > HTH, and read the docs.


    This does help, thank you. I have obviously read the docs, since I can
    use HTMLParser enough to find this behaviour. I don't find the docs to be
    very explanatory (perhaps I'm reading the wrong docs) and I think they
    assume you already know a lot about HTML and parsing, which may be necessary
    assumptions but are not necessarily true...
     
    Just Another Victim of the Ambient Morality, Oct 9, 2007
    #3
  4. Just Another Victim of the Ambient Morality schrieb:
    > "Diez B. Roggisch" <> wrote in message
    > news:-berlin.de...
    >> Just Another Victim of the Ambient Morality schrieb:
    >>> HTMLParser is behaving in, what I find to be, strange ways and I
    >>> would like to better understand what it is doing and why.
    >>>
    >>> First, it doesn't appear to translate HTML escape characters. I
    >>> don't know the actual terminology but things like &amp; don't get
    >>> translated into & as one would like. Furthermore, not only does
    >>> HTMLParser not translate it properly, it seems to omit it altogether!
    >>> This prevents me from even doing the translation myself, so I can't even
    >>> working around the issue.
    >>> Why is it doing this? Is there some mode I need to set? Can anyone
    >>> else duplicate this behaviour? Is it a bug?

    >> Without code, that's hard to determine. But you are aware of e.g.
    >>
    >> handle_entityref(name)
    >> handle_charref(ref)
    >>
    >> ?

    >
    > Actually, I am not aware of these methods but I will certainly look into
    > them!
    > I was hoping that the issue would be known or simple before I commited
    > to posting code, something that is, to my chagrin, not easily done with my
    > news client...
    >
    >
    >>> Secondly, HTMLParser often calls handle_data() consecutively, without
    >>> any calls to handle_starttag() in between. I did not expect this. In
    >>> HTML, you either have text or you have tags. Why split up my text into
    >>> successive handle_data() calls? This makes no sense to me. At the very
    >>> least, it does this in response to text with &amp; like escape sequences
    >>> (or whatever they're called), so that it may successively avoid those
    >>> translations.

    >> That's the way XML/HTML is defined - there is no guarantee that you get
    >> text as whole. If you must, you can collect the snippets yourself, and on
    >> the next end-tag deliver them as whole.

    >
    > I think there's some miscommunication, here.
    > You can't mean "That's the way XML/HTML is defined" because those format
    > specifications say nothing about how the format must be parsed. As far as I
    > can tell, you either meant to say that that's the way HTMLParser is
    > specified or you're referring to how text in XML/HTML can be broken up by
    > tags, in which case I've already addressed that in my post. I expected to
    > see handle_starttag() calls in between calls to handle_data().
    > Unless I'm missing something, it simply makes no sense to break up
    > contiguous text into multiple handle_data() calls...



    I meant that's the way XML/HTML-parsing is defined, yes.

    >>> Again, why is it doing this? Is there some mode I need to set? Can
    >>> anyone else duplicate this behaviour? Is it a bug?

    >> No. It's the way it is, because it would require buffering with unlimited
    >> capacity to ensure this property.

    >
    > It depends on what you mean by "unlimited capacity." Is it so bad to
    > buffer with as much memory as you have? ...or, at least, have a setting for
    > such operation? Moreover, you know that you'll never have to buffer more
    > than there is HTML, so you hardly need "unlimited capacity..." For
    > instance, I believe Xerces does this translation for you 'cause, really, why
    > wouldn't you want it to?


    I've been dealing with XML-files that are several gigbytes of size and
    never fit into physical memory. So buffering would severely impact the
    whole system if it was the default of the parser.

    And you are wrong - xerces (the SAX-parser, which is the equivalent to
    HTMLParser) explicitly does not do that. It is not guaranteed that the
    character-data is passed in one chunk.

    DOM is an etirely different subject, it _has_ to be fully parsed. But
    then, it's often problematic because of that.

    >>> These are serious problems for me and I would greatly appreciate a
    >>> deeper understanding of these issues.

    >> HTH, and read the docs.

    >
    > This does help, thank you. I have obviously read the docs, since I can
    > use HTMLParser enough to find this behaviour. I don't find the docs to be
    > very explanatory (perhaps I'm reading the wrong docs) and I think they
    > assume you already know a lot about HTML and parsing, which may be necessary
    > assumptions but are not necessarily true...


    Well, you at least overlooked the methods I mentioned.

    Diez
     
    Diez B. Roggisch, Oct 10, 2007
    #4
  5. Just Another Victim of the Ambient Morality wrote:
    > HTMLParser is behaving in, what I find to be, strange ways and I would
    > like to better understand what it is doing and why.


    In case you also want an HTML library that is easy to use (and powerful and
    flexible and...), look at lxml.html.

    http://codespeak.net/lxml/dev/lxmlhtml.html

    It's part of lxml 2.0, which is currently in alpha status (which does not mean
    it's unstable or something, just not as complete as its authors want it to be).

    http://codespeak.net/lxml/dev/

    Stefan
     
    Stefan Behnel, Oct 10, 2007
    #5
  6. On 10/9/07, Just Another Victim of the Ambient Morality
    <> wrote:
    >
    > "Diez B. Roggisch" <> wrote in message
    > news:-berlin.de...
    > >
    > > Without code, that's hard to determine. But you are aware of e.g.
    > >
    > > handle_entityref(name)
    > > handle_charref(ref)
    > >
    > > ?

    >
    > Actually, I am not aware of these methods but I will certainly look into
    > them!
    > I was hoping that the issue would be known or simple before I commited
    > to posting code, something that is, to my chagrin, not easily done with my
    > news client...


    For example, here's something simple/simplistic you can do to handle
    character and entity references:

    from htmlentitydefs import name2codepoint

    ....

    def handle_charref(self, ref):
    try:
    if ref.startswith('x'):
    char = unichr(int(ref[1:], 16))
    else:
    char = unichr(int(ref))
    except (TypeError, ValueError):
    char = ' '
    # Do something with char

    def handle_entityref(self, ref):
    try:
    char = unichr(name2codepoint[ref])
    except (KeyError, ValueError):
    char = ' '
    # Do something with char


    A.
     
    Andrew Durdin, Oct 10, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mark
    Replies:
    4
    Views:
    2,973
    scoude
    Jan 12, 2011
  2. mike
    Replies:
    0
    Views:
    915
  3. Steven Van den Berghe

    unexpected map behaviour

    Steven Van den Berghe, Aug 28, 2003, in forum: C++
    Replies:
    2
    Views:
    432
    Christian Jan├čen
    Aug 28, 2003
  4. Old Wolf
    Replies:
    1
    Views:
    404
    Victor Bazarov
    Feb 4, 2004
  5. Ioannis Vranos

    Unexpected behaviour

    Ioannis Vranos, Sep 23, 2004, in forum: C++
    Replies:
    36
    Views:
    919
    Rolf Magnus
    Sep 24, 2004
Loading...

Share This Page