nonstandard XML character entities?

Discussion in 'Python' started by Paul Rubin, Apr 14, 2007.

  1. Paul Rubin

    Paul Rubin Guest

    I'm new to xml mongering so forgive me if there's an obvious
    well-known answer to this. It's not real obvious from the library
    documentation I've looked at so far. Basically I have to munch of a
    bunch of xml files which contain character entities like ú
    which are apparently nonstandard. They appear in w3.org tables but
    xml.etree.cElementTree.ElementTree.parse barfs at them and xmllint
    barfs at them.

    Basically I want to know if there's a way to supply the regular parser
    (preferably xml.etree but I guess I can switch to another one if
    necessary) with some kind of entity table, and/or if the info is
    supposed to be found in the DTD or someplace like that. Right now I'm
    ignoring the DTD and simply figuring out the doc structure by
    eyeballing the xml files, maybe not a perfectly approved method but
    it seems to be what most people do.

    Thanks
     
    Paul Rubin, Apr 14, 2007
    #1
    1. Advertising

  2. > I'm new to xml mongering so forgive me if there's an obvious
    > well-known answer to this. It's not real obvious from the library
    > documentation I've looked at so far. Basically I have to munch of a
    > bunch of xml files which contain character entities like ú
    > which are apparently nonstandard.


    If they contain such things, and do not contain a document type
    definition, they are not well-formed XML files (i.e. can't be
    called "XML" in a meaningful sense).

    It would have been helpful if you had given an example of such
    a document.

    > Basically I want to know if there's a way to supply the regular parser
    > (preferably xml.etree but I guess I can switch to another one if
    > necessary) with some kind of entity table, and/or if the info is
    > supposed to be found in the DTD or someplace like that. Right now I'm
    > ignoring the DTD and simply figuring out the doc structure by
    > eyeballing the xml files, maybe not a perfectly approved method but
    > it seems to be what most people do.


    If there is a document type declaration in the document, the best
    way is to parse it in a mode where the parser downloads the DTD
    when parsing it, and resolves the entity references itself.

    In SAX, you can put an EntityResolver into the parser, and then
    return a file-like object from resolveEntity. This can be used
    to avoid the network download; the document type declaration
    would still have to be present.

    Alternatively, you can implement a skippedEntity callback in
    the SAX content handler.

    In ElementTree, the XMLTreeBuilder has an attribute entity
    which is a dictionary used to map entity names in entity references
    to their definitions. Whether you can make the parser download
    the DTD itself, I don't know.

    Regards,
    Martin
     
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=, Apr 14, 2007
    #2
    1. Advertising

  3. Paul Rubin

    Chuck Rhode Guest

    Martin v. Löwis wrote this on Sat, 14 Apr 2007 09:10:44 +0200. My
    reply is below.

    > Paul Rubin:


    >> I'm new to xml mongering so forgive me if there's an obvious
    >> well-known answer to this. It's not real obvious from the library
    >> documentation I've looked at so far. Basically I have to munch of
    >> a bunch of xml files which contain character entities like ú
    >> which are apparently nonstandard.


    -snip-

    > In ElementTree, the XMLTreeBuilder has an attribute entity which is
    > a dictionary used to map entity names in entity references to their
    > definitions. Whether you can make the parser download the DTD
    > itself, I don't know.


    What he said....

    Try this on your piano:

    : import xml.etree.ElementTree # or elementtree.ElementTree prior to 2.5
    : ElementTree = xml.etree.ElementTree
    : import htmlentitydefs


    : class XmlFile(ElementTree.ElementTree):

    : def __init__(self, file=None, tag='global', **extra):
    : ElementTree.ElementTree.__init__(self)
    : parser = ElementTree.XMLTreeBuilder(
    : target=ElementTree.TreeBuilder(Element))
    : parser.entity = htmlentitydefs.entitydefs
    : self.parse(source=file, parser=parser)
    : return


    It looks goofy as can be, but it works for me.

    --
    ... Chuck Rhode, Sheboygan, WI, USA
    ... Weather: http://LacusVeris.com/WX
    ... 32° — Wind Calm
     
    Chuck Rhode, Apr 14, 2007
    #3
  4. Paul Rubin

    Chuck Rhode Guest

    Chuck Rhode wrote this on Sat, 14 Apr 2007 09:04:45 -0500. My reply is
    below.

    Fixed text wrap:

    > import xml.etree.ElementTree # or elementtree.ElementTree prior to 2.5
    > ElementTree = xml.etree.ElementTree
    > import htmlentitydefs



    > class XmlFile(ElementTree.ElementTree):


    > def __init__(self, file=None, tag='global', **extra):
    > ElementTree.ElementTree.__init__(self)
    > parser = ElementTree.XMLTreeBuilder(
    > target=ElementTree.TreeBuilder(Element))
    > parser.entity = htmlentitydefs.entitydefs
    > self.parse(source=file, parser=parser)
    > return



    --
    ... Chuck Rhode, Sheboygan, WI, USA
    ... Weather: http://LacusVeris.com/WX
    ... 32° — Wind Calm
     
    Chuck Rhode, Apr 14, 2007
    #4
  5. Paul Rubin

    Paul Rubin Guest

    "Martin v. Löwis" <> writes:
    > If they contain such things, and do not contain a document type
    > definition, they are not well-formed XML files (i.e. can't be
    > called "XML" in a meaningful sense).


    The documents do have a DTD, however the DTD file doesn't say anything
    about these entities.

    > It would have been helpful if you had given an example of such
    > a document.


    I can't post a whole document because these docs are very large
    and I'm not sure that the data is public. It does look like the DTD
    is public: the document begins with

    <?xml version="1.0" encoding="ISO-8859-1"?>
    <!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd">
    <ONIXmessage release="2.1">
    ...

    and that url points to the DTD which is online.

    Basically the doc has elements like

    <b036>Diana Montan&eacute;</b036>

    and both ElementTree and xmllint complain about the character entities
    (and there are a lot of them).

    > If there is a document type declaration in the document, the best
    > way is to parse it in a mode where the parser downloads the DTD
    > when parsing it, and resolves the entity references itself.


    Hmm, ok, I see there are a lot of <!ENTITY ...> directives in the
    DTD but nothing about those character entities--am I looking in
    the right place?

    > In ElementTree, the XMLTreeBuilder has an attribute entity
    > which is a dictionary used to map entity names in entity references
    > to their definitions. Whether you can make the parser download
    > the DTD itself, I don't know.


    Chuck Rhode posted some code for something like this so I'll try it
    on Monday.

    Thanks!
     
    Paul Rubin, Apr 14, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    1,017
    C. M. Sperberg-McQueen
    Jul 27, 2005
  2. Mirek Fidler
    Replies:
    8
    Views:
    574
    Mirek Fidler
    Jul 3, 2003
  3. Replies:
    7
    Views:
    527
  4. Mike McGranahan

    XML serialization of illegal character entities

    Mike McGranahan, Jul 11, 2006, in forum: ASP .Net Web Services
    Replies:
    0
    Views:
    271
    Mike McGranahan
    Jul 11, 2006
  5. Jim Higson
    Replies:
    3
    Views:
    251
    Eric Amick
    Jul 25, 2004
Loading...

Share This Page