Is there a HTML parser who can reconstruct the original html EXACTLY?

Discussion in 'Python' started by ioscas@gmail.com, Jan 23, 2008.

  1. Guest

    Hi, I am looking for a HTML parser who can parse a given page into
    a DOM tree, and can reconstruct the exact original html sources.
    Strictly speaking, I should be allowed to retrieve the original
    sources at each internal nodes of the DOM tree.
    I have tried Beautiful Soup who is really nice when dealing with
    those god damned ill-formed documents, but it's a pity for me to find
    that this guy cannot retrieve original sources due to its great tidy
    job.
    Since Beautiful Soup, like most of the other HTML parsers in
    python, is a subclass of sgmllib.SGMLParser to some extent, I have
    investigated the source code of sgmllib.SGMLParser, see if there is
    anything I can do to tell Beautiful Soup where he can find every tag
    segment from HTML source, but this will be a time-consuming job.
    so... any ideas?


    cheers
    kai liu
    , Jan 23, 2008
    #1
    1. Advertising

  2. A.T.Hofkamp Guest

    On 2008-01-23, <> wrote:
    > Hi, I am looking for a HTML parser who can parse a given page into
    > a DOM tree, and can reconstruct the exact original html sources.


    Why not keep a copy of the original data instead?

    That would be VERY MUCH SIMPLER than trying to reconstruct a parsed tree back
    to original source text.


    sincerely,
    Albert
    A.T.Hofkamp, Jan 23, 2008
    #2
    1. Advertising

  3. kliu Guest

    Re: Is there a HTML parser who can reconstruct the original htmlEXACTLY?

    On Jan 23, 7:39 pm, "A.T.Hofkamp" <> wrote:
    > On 2008-01-23, <> wrote:
    >
    > > Hi, I am looking for a HTML parser who can parse a given page into
    > > a DOM tree, and can reconstruct the exact original html sources.

    >
    > Why not keep a copy of the original data instead?
    >
    > That would be VERY MUCH SIMPLER than trying to reconstruct a parsed tree back
    > to original source text.
    >
    > sincerely,
    > Albert


    Thank u for your reply. but what I really need is the mapping between
    each DOM nodes and
    the corresponding original source segment.
    kliu, Jan 23, 2008
    #3
  4. Paul Boddie Guest

    Re: Is there a HTML parser who can reconstruct the original htmlEXACTLY?

    On 23 Jan, 14:20, kliu <> wrote:
    >
    > Thank u for your reply. but what I really need is the mapping between
    > each DOM nodes and the corresponding original source segment.


    At the risk of promoting unfashionable DOM technologies, you can at
    least serialise fragments of the DOM in libxml2dom [1]:

    import libxml2dom
    d = libxml2dom.parseURI("http://www.diveintopython.org/", html=1)
    print d.xpath("//p")[7].toString()

    Storage and retrieval of the original line and offset information may
    be supported by libxml2, but such information isn't exposed by
    libxml2dom.

    Paul

    [1] http://www.python.org/pypi/libxml2dom
    Paul Boddie, Jan 23, 2008
    #4
  5. Re: Is there a HTML parser who can reconstruct the original htmlEXACTLY?

    Hi,

    kliu wrote:
    > what I really need is the mapping between each DOM nodes and
    > the corresponding original source segment.


    I don't think that will be easy to achieve. You could get away with a parser
    that provides access to the position of an element in the source, and then map
    changes back into the document. But that won't work well in the case where the
    parser inserts or deletes content to fix up the structure.

    Anyway, the normal focus of broken HTML parsing is in fixing the source
    document, not in writing out a broken document. Maybe we could help you better
    if you explained what your actual intention is?

    Stefan
    Stefan Behnel, Jan 23, 2008
    #5
  6. A.T.Hofkamp Guest

    Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

    On 2008-01-23, kliu <> wrote:
    > On Jan 23, 7:39 pm, "A.T.Hofkamp" <> wrote:
    >> On 2008-01-23, <> wrote:
    >>
    >> > Hi, I am looking for a HTML parser who can parse a given page into
    >> > a DOM tree, and can reconstruct the exact original html sources.

    >>
    >> Why not keep a copy of the original data instead?
    >>
    >> That would be VERY MUCH SIMPLER than trying to reconstruct a parsed tree back
    >> to original source text.

    >
    > Thank u for your reply. but what I really need is the mapping between
    > each DOM nodes and
    > the corresponding original source segment.


    Why do you think there is a simple one-to-one relation between nodes in some
    abstract DOM tree, and pieces of source?, For example, the outermost tag
    <HTML>...</HTML> is not an explicit point in the tree. If if it is, what piece
    of source should be attached to it? Everything? Just the text before and after
    it? If so, what about the source text of the second tag? Last but not least,
    what do you intend to do with the source-text before the <HTML> and after
    the </HTML> tags?

    In other words, you are going to have a huge problem deciding what
    "corresponding original source segment" means for each tag. This is exactly the
    reason why current tools do not do what you want.

    If you really want this, you probably have to do it yourself mostly from
    scratch (ie starting with a parsing framework and writing a custom parser
    yourself). That usually boils down to attaching source text to tokens in the
    lexical parsing phase. If you have a good understanding of the meaning of
    "corresponding original source segment", AND you have perfect HTML, this is
    doable, but not very nice.

    There exist parsers that can do what you want IF YOU HAVE PERFECT HTML, but
    using those tools implies a very steep learning curve of about 2-3 months under
    the assumption that you know functional languages (if you don't, add 2-3 months
    or so steep learning curve :) ).


    If you don't have perfect HTML, you are probably more or less lost. Most tools
    cannot deal with that situation, and those that can do smart re-shuffling to
    make things parsable, which means there is really no one-to-one mapping any
    more (after re-shuffling).


    In other words, I think you really don't want what you want, at least not in
    the way that you consider now.


    Please give us information about your goal, so we can think about alternative
    approaches to solve your problem.

    sincerely,
    Albert
    A.T.Hofkamp, Jan 23, 2008
    #6
  7. Fuzzyman Guest

    Re: Is there a HTML parser who can reconstruct the original htmlEXACTLY?

    wrote:
    > Hi, I am looking for a HTML parser who can parse a given page into
    > a DOM tree, and can reconstruct the exact original html sources.
    > Strictly speaking, I should be allowed to retrieve the original
    > sources at each internal nodes of the DOM tree.
    > I have tried Beautiful Soup who is really nice when dealing with
    > those god damned ill-formed documents, but it's a pity for me to find
    > that this guy cannot retrieve original sources due to its great tidy
    > job.
    > Since Beautiful Soup, like most of the other HTML parsers in
    > python, is a subclass of sgmllib.SGMLParser to some extent, I have
    > investigated the source code of sgmllib.SGMLParser, see if there is
    > anything I can do to tell Beautiful Soup where he can find every tag
    > segment from HTML source, but this will be a time-consuming job.
    > so... any ideas?
    >



    A while ago I had a similar need, but my solution may not solve your
    problem.

    I wanted to rewrite URLs contained in links and images etc, but not
    modify any of the rest of the HTML. I created an HTML parser (based on
    sgmllib) with callbacks as it encounters tags and attributes etc.

    It is easy to process a stream without 'damaging' the beautiful
    orginal structure of crap HTML - but it doesn't provide a DOM.


    http://www.voidspace.org.uk/python/recipebook.shtml#scraper

    All the best,

    Michael Foord
    http://www.manning.com/foord

    >
    > cheers
    > kai liu
    Fuzzyman, Jan 23, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ZOCOR

    XML Parser VS HTML Parser

    ZOCOR, Oct 3, 2004, in forum: Java
    Replies:
    11
    Views:
    798
    Paul King
    Oct 5, 2004
  2. Replies:
    1
    Views:
    6,134
    Raymond DeCampo
    Jan 24, 2006
  3. Stu
    Replies:
    2
    Views:
    777
    Rob McAninch
    Apr 6, 2004
  4. Steve Perry
    Replies:
    0
    Views:
    254
    Steve Perry
    Aug 16, 2004
  5. Zach Dennis

    HTML-Parser / SGML-Parser

    Zach Dennis, Oct 1, 2003, in forum: Ruby
    Replies:
    5
    Views:
    377
    Bernard Delmée
    Oct 1, 2003
Loading...

Share This Page