XML/XHTML/HTML differences, bugs... and howto

Discussion in 'Python' started by Andrew Robinson, Jan 23, 2013.

  1. Good day :),

    I've been exploring XML parsers in python; particularly:
    xml.etree.cElementTree; and I'm trying to figure out how to do it
    incrementally, for very large XML files -- although I don't think the
    problems are restricted to incremental parsing.

    First problem:
    I've come across an issue where etree silently drops text without
    telling me; and separate.

    I am under the impression that XHTML is a subset of XML (eg:defined
    tags), and that once an HTML file is converted to XHTML, the body of the
    document can be handled entirely as XML.

    If I convert a (partial/contrived) html file like:

    <html>
    <div>
    <p> This is example <b>bold</b> text.
    </div>
    </html>

    to XHTML, I might do --right or wrong-- (1):

    <html>
    <div>
    <p /> This is example <b>bold</b> text.
    </div>
    </html>

    or, alternate difference: (2): "<p> This is example <b>bold</b> text. </p>"

    But, when I parse with etree, in example (1) both "This is an example"
    and "text." are dropped;
    The missing text is part of the start, or end event tags, in the
    incrementally parsed method.

    Likewise: In example (2), only "text" gets dropped.

    So, etree is silently dropping all text following a close tag, but
    before another open tag happens.

    Q:
    Isn't XML supposed to error out when invalid xml is parsed?
    Is there a way in etree to recover/access the dropped text?
    If not -- is the a python library issue, or the underlying expat.so,
    etc. library.

    Secondly;
    I have an XML file which will grow larger than memory on a target
    machine, so here's what I want to do:

    Given a source XML file, and a destination file:
    1) iteratively scan part of the source tree.
    2) Optionally Modify some of scanned tree.
    3) Write partial scan/tree out to the destination file.
    4) Free memory of no-longer needed (partial) source XML.
    5) continue scanning a new section of the source file... eg: goto step 1
    until source file is exhausted.

    But, I don't see a way to write portions of an XML tree, or iteratively
    write a tree to disk.
    How can this be done?

    :) Thanks!
     
    Andrew Robinson, Jan 23, 2013
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Wouter

    Differences XHTML - HTML

    Wouter, Aug 21, 2007, in forum: HTML
    Replies:
    7
    Views:
    403
    Andy Dingley
    Aug 21, 2007
  2. Home_Job_opportunity
    Replies:
    0
    Views:
    504
    Home_Job_opportunity
    Jan 8, 2009
  3. Josef 'Jupp' Schugt

    Still use 'ruby-bugs' for Ruby bugs?

    Josef 'Jupp' Schugt, Nov 4, 2004, in forum: Ruby
    Replies:
    2
    Views:
    165
    Tom Copeland
    Nov 4, 2004
  4. Stefan Behnel
    Replies:
    0
    Views:
    135
    Stefan Behnel
    Jan 24, 2013
  5. Andrew Robinson
    Replies:
    0
    Views:
    92
    Andrew Robinson
    Jan 23, 2013
Loading...

Share This Page