How to get an XML DOM while offline?

Discussion in 'Python' started by william tanksley, Mar 19, 2008.

  1. I want to parse my iTunes Library xml. All was well, until I unplugged
    and left for the train (where I get most of my personal projects
    done). All of a sudden, I discovered that apparently the presence of a
    DOCTYPE in the iTunes XML makes xml.dom.minidom insist on accessing
    the Internet... So suddenly I was unable to do any work.

    I don't want to modify the iTunes XML; iTunes rewrites it too often.
    How can I prevent xml.dom.minidom from dying when it can't access the
    Internet?

    Is there a simpler way to read the iTunes XML? (It's merely a plist,
    so the format is much simpler than general XML.)

    -Wm
    william tanksley, Mar 19, 2008
    #1
    1. Advertising

  2. william tanksley wrote:

    > I want to parse my iTunes Library xml. All was well, until I unplugged
    > and left for the train (where I get most of my personal projects
    > done). All of a sudden, I discovered that apparently the presence of a
    > DOCTYPE in the iTunes XML makes xml.dom.minidom insist on accessing
    > the Internet... So suddenly I was unable to do any work.
    >
    > I don't want to modify the iTunes XML; iTunes rewrites it too often.
    > How can I prevent xml.dom.minidom from dying when it can't access the
    > Internet?
    >
    > Is there a simpler way to read the iTunes XML? (It's merely a plist,
    > so the format is much simpler than general XML.)


    Normally, this should be solved using an entity-handler that prevents the
    remote fetching. I presume the underlying implementation of a SAX-parser
    does use one, but you can't override that (at least I didn't find anything
    in the docs)

    The most pragmatic solution would be to rip the doctype out using simple
    string methods and/or regexes.

    Diez
    Diez B. Roggisch, Mar 19, 2008
    #2
    1. Advertising

  3. william tanksley

    Paul Boddie Guest

    On 19 Mar, 16:27, "Diez B. Roggisch" <> wrote:
    > william tanksley wrote:
    > > I want to parse my iTunes Library xml. All was well, until I unplugged
    > > and left for the train (where I get most of my personal projects
    > > done). All of a sudden, I discovered that apparently the presence of a
    > > DOCTYPE in the iTunes XML makes xml.dom.minidom insist on accessing
    > > the Internet... So suddenly I was unable to do any work.


    The desire to connect to the Internet for DTDs is documented in the
    following bug:

    http://bugs.python.org/issue2124

    However, I can't reproduce the problem using xml.dom.minidom.parse/
    parseString and plain XHTML, although I may be missing something which
    activates the retrieval of the DTD.

    > > I don't want to modify the iTunes XML; iTunes rewrites it too often.
    > > How can I prevent xml.dom.minidom from dying when it can't access the
    > > Internet?

    >
    > > Is there a simpler way to read the iTunes XML? (It's merely a plist,
    > > so the format is much simpler than general XML.)

    >
    > Normally, this should be solved using an entity-handler that prevents the
    > remote fetching. I presume the underlying implementation of a SAX-parser
    > does use one, but you can't override that (at least I didn't find anything
    > in the docs)


    There's a lot of complicated stuff in the xml.dom package, but I found
    that the DOMBuilder class (in xml.dom.xmlbuilder) probably contains
    the things which switch such behaviour on or off. That said, I've
    hardly ever used the most formal DOM classes to parse XML in Python
    (where you get the DOM implementation and then create other factory
    classes - it's all very "Java" in nature), so the precise incantation
    is unknown/forgotten to me.

    > The most pragmatic solution would be to rip the doctype out using simple
    > string methods and/or regexes.


    Maybe, but an example fragment of the XML might help us diagnose the
    problem, ideally with some commentary from the people who wrote the
    xml.dom software in the first place.

    Paul
    Paul Boddie, Mar 19, 2008
    #3
  4. "Diez B. Roggisch" <> wrote:
    > The most pragmatic solution would be to rip the doctype out using simple
    > string methods and/or regexes.


    Thank you, Diez and Paul; I took Diez's solution, and it works well
    enough for me.

    > Diez


    -Wm
    william tanksley, Mar 31, 2008
    #4
  5. william tanksley wrote:
    > I want to parse my iTunes Library xml. All was well, until I unplugged
    > and left for the train (where I get most of my personal projects
    > done). All of a sudden, I discovered that apparently the presence of a
    > DOCTYPE in the iTunes XML makes xml.dom.minidom insist on accessing
    > the Internet... So suddenly I was unable to do any work.
    >
    > I don't want to modify the iTunes XML; iTunes rewrites it too often.
    > How can I prevent xml.dom.minidom from dying when it can't access the
    > Internet?
    >
    > Is there a simpler way to read the iTunes XML? (It's merely a plist,
    > so the format is much simpler than general XML.)


    Try lxml. Since version 2.0, its parsers will not access the network unless
    you tell it to do so.

    http://codespeak.net/lxml

    It's also much easier to use than minidom and much faster:

    http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

    Stefan
    Stefan Behnel, Apr 7, 2008
    #5
  6. Stefan Behnel wrote:

    >> Is there a simpler way to read the iTunes XML? (It's merely a plist,
    >> so the format is much simpler than general XML.)

    >
    > Try lxml. Since version 2.0, its parsers will not access the network unless
    > you tell it to do so.
    >
    > http://codespeak.net/lxml


    which makes it true for all ET implementations (the whole idea that
    parsing a file should result in unexpected network access is of course a
    potential security risk and one of a number of utterly stupid design
    decisions in XML).

    you'll find plist reading code here, btw:

    http://effbot.org/zone/element-iterparse.htm#incremental-decoding

    replace the import with "from xml.etree import cElementTree" if you're
    running 2.5.

    (not sure if that one works with lxml, though, but that should be
    fixable. you can at least reuse the unmarshaller dict).

    </F>
    Fredrik Lundh, Apr 7, 2008
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. George Durzi

    Using a Web Application While Offline ...

    George Durzi, Feb 2, 2005, in forum: ASP .Net
    Replies:
    3
    Views:
    2,083
    Ken Cox [Microsoft MVP]
    Feb 3, 2005
  2. Salomi

    Using a Web form While Offline

    Salomi, Sep 1, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    382
    Kevin Spencer
    Sep 1, 2005
  3. Jari Kujansuu
    Replies:
    2
    Views:
    988
    Jari Kujansuu
    Sep 30, 2003
  4. Replies:
    0
    Views:
    534
  5. Matìj Cepl

    caching submitted form while we are offline?

    Matìj Cepl, Dec 16, 2009, in forum: Javascript
    Replies:
    13
    Views:
    178
    Thomas 'PointedEars' Lahn
    Dec 22, 2009
Loading...

Share This Page