HTML Parser which allows low-keyed local changes?

Discussion in 'Python' started by Robert, Jan 31, 2010.

  1. Robert

    Robert Guest

    I tried lxml, but after walking and making changes in the element
    tree, I'm forced to do a full serialization of the whole document
    (etree.tostring(tree)) - which destroys the "human edited" format
    of the original HTML code.
    makes it rather unreadable.

    is there an existing HTML parser which supports tracking/writing
    back particular changes in a cautious way by just making local
    changes? or a least tracks the tag start/end positions in the file?


    Robert
     
    Robert, Jan 31, 2010
    #1
    1. Advertising

  2. Robert, 31.01.2010 20:57:
    > I tried lxml, but after walking and making changes in the element tree,
    > I'm forced to do a full serialization of the whole document
    > (etree.tostring(tree)) - which destroys the "human edited" format of the
    > original HTML code. makes it rather unreadable.


    What do you mean? Could you give an example? lxml certainly does not
    destroy anything it parsed, unless you tell it to do so.

    Stefan
     
    Stefan Behnel, Feb 1, 2010
    #2
    1. Advertising

  3. Robert

    Robert Guest

    Re: HTML Parser which allows low-keyed local changes (upon serialization)

    Stefan Behnel wrote:
    > Robert, 31.01.2010 20:57:
    >> I tried lxml, but after walking and making changes in the element tree,
    >> I'm forced to do a full serialization of the whole document
    >> (etree.tostring(tree)) - which destroys the "human edited" format of the
    >> original HTML code. makes it rather unreadable.

    >
    > What do you mean? Could you give an example? lxml certainly does not
    > destroy anything it parsed, unless you tell it to do so.
    >


    of course it does not destroy during parsing.(?)

    I mean: I want to walk with a Python script through the parsed
    tree HTML and modify here and there things (auto alt tags from
    DB/similar, link corrections, text sections/translated
    sentences... due to HTML code and content checks.)

    Then I want to output the changed tree - but as close to the
    original format as far as possible. No changes to my white space
    identation, etc.. Only lokal changes, where really tags where
    changed.

    Thats similiar like that what a good HTML editor does: After you
    made little changes, it doesn't reformat/re-spit-out your whole
    code layout from tree/attribute logic only. you have lokal changes
    only.
    But a simple HTML editor like that in Mozilla-Seamonkey outputs a
    whole new HTML, produces the HTML from logical tree only
    (regarding his (ugly) style), destroys my whitspace layout and
    much more - forgetting anything about the original layout.

    Such a "good HTML editor" must somehow track the original
    positions of the tags in the file. And during each logical change
    in the tree it must tracks the file position changes/offsets. That
    thing seems to miss in lxml and BeautifulSoup which I tried so far.

    This is a frequent need I have. Nobody else's?

    Seems I need to write my own or patch BS to do that extra tracking?


    Robert
     
    Robert, Feb 1, 2010
    #3
  4. Robert

    Robert Guest

    Re: HTML Parser which allows low-keyed local changes (upon serialization)

    Robert wrote:
    > Stefan Behnel wrote:
    >> Robert, 31.01.2010 20:57:
    >>> I tried lxml, but after walking and making changes in the element tree,
    >>> I'm forced to do a full serialization of the whole document
    >>> (etree.tostring(tree)) - which destroys the "human edited" format of the
    >>> original HTML code. makes it rather unreadable.

    >>
    >> What do you mean? Could you give an example? lxml certainly does not
    >> destroy anything it parsed, unless you tell it to do so.
    >>

    >
    > of course it does not destroy during parsing.(?)
    >
    > I mean: I want to walk with a Python script through the parsed tree HTML
    > and modify here and there things (auto alt tags from DB/similar, link
    > corrections, text sections/translated sentences... due to HTML code and
    > content checks.)
    >
    > Then I want to output the changed tree - but as close to the original
    > format as far as possible. No changes to my white space identation,
    > etc.. Only lokal changes, where really tags where changed.
    >
    > Thats similiar like that what a good HTML editor does: After you made
    > little changes, it doesn't reformat/re-spit-out your whole code layout
    > from tree/attribute logic only. you have lokal changes only.
    > But a simple HTML editor like that in Mozilla-Seamonkey outputs a whole
    > new HTML, produces the HTML from logical tree only (regarding his (ugly)
    > style), destroys my whitspace layout and much more - forgetting
    > anything about the original layout.
    >
    > Such a "good HTML editor" must somehow track the original positions of
    > the tags in the file. And during each logical change in the tree it must
    > tracks the file position changes/offsets. That thing seems to miss in
    > lxml and BeautifulSoup which I tried so far.
    >
    > This is a frequent need I have. Nobody else's?
    >
    > Seems I need to write my own or patch BS to do that extra tracking?
    >


    basic feature(s) of such parser perhaps:

    * can it tell for each tag object in the parsed tree, at what
    original file position start:end it resided? even a basic need:
    tell me the line number e.g. (for warning/analysis reports e.g.)

    (* do the tree objects auto track/know if they were changed. (for
    convenience; a tree copy may serve this otherwise .. )

    the creation of a output with local changes whould be rather
    simple from that ...


    Robert
     
    Robert, Feb 1, 2010
    #4
  5. Re: HTML Parser which allows low-keyed local changes (upon serialization)

    Robert, 01.02.2010 14:36:
    > Stefan Behnel wrote:
    >> Robert, 31.01.2010 20:57:
    >>> I tried lxml, but after walking and making changes in the element tree,
    >>> I'm forced to do a full serialization of the whole document
    >>> (etree.tostring(tree)) - which destroys the "human edited" format of the
    >>> original HTML code. makes it rather unreadable.

    >>
    >> What do you mean? Could you give an example? lxml certainly does not
    >> destroy anything it parsed, unless you tell it to do so.

    >
    > of course it does not destroy during parsing.(?)


    I meant "parsed" in the sense of "has parsed and is now working on".


    > I mean: I want to walk with a Python script through the parsed tree HTML
    > and modify here and there things (auto alt tags from DB/similar, link
    > corrections, text sections/translated sentences... due to HTML code and
    > content checks.)


    Sure, perfectly valid use case.


    > Then I want to output the changed tree - but as close to the original
    > format as far as possible. No changes to my white space identation,
    > etc.. Only lokal changes, where really tags where changed.


    That's up to you. If you only apply local changes that do not change any
    surrounding whitespace, you'll be fine.


    > Thats similiar like that what a good HTML editor does: After you made
    > little changes, it doesn't reformat/re-spit-out your whole code layout
    > from tree/attribute logic only. you have lokal changes only.


    HTML editors don't work that way. They always "re-spit-out" the whole code
    when you click on "save". They certainly don't track the original file
    position of tags. What they preserve is the content, including whitespace
    (or not, if they reformat the code, but that's usually an *option*).


    > Such a "good HTML editor" must somehow track the original positions of
    > the tags in the file. And during each logical change in the tree it must
    > tracks the file position changes/offsets.


    Sorry, but that's nonsense. The file position of a tag is determined by
    whitespace, i.e. line endings and indentation. lxml does not alter that,
    unless you tell it do do so.

    Since you keep claiming that it *does* alter it, please come up with a
    reproducible example that shows a) what you do in your code, b) what your
    input is and c) what unexpected output it creates. Do not forget to include
    the version number of lxml and libxml2 that you are using, as well as a
    comment on /how/ the output differs from what you expected.

    My stab in the dark is that you forgot to copy the tail text of elements
    that you replace by new content, and that you didn't properly indent new
    content that you added. But that's just that, a stab in the dark. You
    didn't provide enough information for even an educated guess.

    Stefan
     
    Stefan Behnel, Feb 1, 2010
    #5
  6. Robert

    Robert Guest

    Re: HTML Parser which allows low-keyed local changes (upon serialization)

    Stefan Behnel wrote:
    > Robert, 01.02.2010 14:36:
    >> Stefan Behnel wrote:
    >>> Robert, 31.01.2010 20:57:
    >>>> I tried lxml, but after walking and making changes in the element tree,
    >>>> I'm forced to do a full serialization of the whole document
    >>>> (etree.tostring(tree)) - which destroys the "human edited" format of the
    >>>> original HTML code. makes it rather unreadable.
    >>> What do you mean? Could you give an example? lxml certainly does not
    >>> destroy anything it parsed, unless you tell it to do so.

    >> of course it does not destroy during parsing.(?)

    >
    > I meant "parsed" in the sense of "has parsed and is now working on".
    >
    >
    >> I mean: I want to walk with a Python script through the parsed tree HTML
    >> and modify here and there things (auto alt tags from DB/similar, link
    >> corrections, text sections/translated sentences... due to HTML code and
    >> content checks.)

    >
    > Sure, perfectly valid use case.
    >
    >
    >> Then I want to output the changed tree - but as close to the original
    >> format as far as possible. No changes to my white space identation,
    >> etc.. Only lokal changes, where really tags where changed.

    >
    > That's up to you. If you only apply local changes that do not change any
    > surrounding whitespace, you'll be fine.
    >
    >
    >> Thats similiar like that what a good HTML editor does: After you made
    >> little changes, it doesn't reformat/re-spit-out your whole code layout
    >> from tree/attribute logic only. you have lokal changes only.

    >
    > HTML editors don't work that way. They always "re-spit-out" the whole code
    > when you click on "save". They certainly don't track the original file
    > position of tags. What they preserve is the content, including whitespace
    > (or not, if they reformat the code, but that's usually an *option*).
    >
    >
    >> Such a "good HTML editor" must somehow track the original positions of
    >> the tags in the file. And during each logical change in the tree it must
    >> tracks the file position changes/offsets.

    >
    > Sorry, but that's nonsense. The file position of a tag is determined by
    > whitespace, i.e. line endings and indentation. lxml does not alter that,
    > unless you tell it do do so.
    >
    > Since you keep claiming that it *does* alter it, please come up with a
    > reproducible example that shows a) what you do in your code, b) what your
    > input is and c) what unexpected output it creates. Do not forget to include
    > the version number of lxml and libxml2 that you are using, as well as a
    > comment on /how/ the output differs from what you expected.
    >
    > My stab in the dark is that you forgot to copy the tail text of elements
    > that you replace by new content, and that you didn't properly indent new
    > content that you added. But that's just that, a stab in the dark. You
    > didn't provide enough information for even an educated guess.
    >


    I think you confused the logical level of what I meant with "file
    position":
    Of course its not about (necessarily) writing back to the same
    open file (OS-level), but regarding the whole serializiation
    string (wherever it is finally written to - I typically write the
    auto-converted HTML files to a 2nd test folder first, and want use
    "diff -u ..." to see human-readable what changed happened - which
    again is only reasonable if the original layout is preserved as
    good as possible )

    lxml and BeautifulSoup e.g. : load&parse a HTML file to a tree,
    immediately serialize the tree without changes => you see big
    differences of original and serialized files with quite any file.

    The main issue: those libs seem to not track any info about the
    original string/file positions of the objects they parse. The just
    forget the past. Thus they cannot by principle do what I want it
    seems ...

    Or does anybody see attributes of the tree objects - which I
    overlooked? Or a lib which can do or at least enable better this
    source-back-connected editing?


    Robert
     
    Robert, Feb 1, 2010
    #6
  7. Robert

    Tim Arnold Guest

    Re: HTML Parser which allows low-keyed local changes (upon serialization)

    "Robert" <> wrote in message
    news:hk729b$naa$...
    > Stefan Behnel wrote:
    >> Robert, 01.02.2010 14:36:
    >>> Stefan Behnel wrote:
    >>>> Robert, 31.01.2010 20:57:
    >>>>> I tried lxml, but after walking and making changes in the element
    >>>>> tree,
    >>>>> I'm forced to do a full serialization of the whole document
    >>>>> (etree.tostring(tree)) - which destroys the "human edited" format of
    >>>>> the
    >>>>> original HTML code. makes it rather unreadable.
    >>>> What do you mean? Could you give an example? lxml certainly does not
    >>>> destroy anything it parsed, unless you tell it to do so.
    >>> of course it does not destroy during parsing.(?)

    >>


    I think I understand what you want, but I don't understand why yet. Do you
    want to view the differences in an IDE or something like that? If so, why
    not pretty-print both and compare that?
    --Tim
     
    Tim Arnold, Feb 1, 2010
    #7
  8. Robert

    Nobody Guest

    On Sun, 31 Jan 2010 20:57:31 +0100, Robert wrote:

    > I tried lxml, but after walking and making changes in the element
    > tree, I'm forced to do a full serialization of the whole document
    > (etree.tostring(tree)) - which destroys the "human edited" format
    > of the original HTML code.
    > makes it rather unreadable.
    >
    > is there an existing HTML parser which supports tracking/writing
    > back particular changes in a cautious way by just making local
    > changes? or a least tracks the tag start/end positions in the file?


    HTMLParser, sgmllib.SGMLParser and htmllib.HTMLParser all allow you to
    retrieve the literal text of a start tag (but not an end tag).
    Unfortunately, they're only tokenisers, not parsers, so you'll need to
    handle minimisation yourself.
     
    Nobody, Feb 2, 2010
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page