nonstandard XML character entities?

P

Paul Rubin

I'm new to xml mongering so forgive me if there's an obvious
well-known answer to this. It's not real obvious from the library
documentation I've looked at so far. Basically I have to munch of a
bunch of xml files which contain character entities like ú
which are apparently nonstandard. They appear in w3.org tables but
xml.etree.cElementTree.ElementTree.parse barfs at them and xmllint
barfs at them.

Basically I want to know if there's a way to supply the regular parser
(preferably xml.etree but I guess I can switch to another one if
necessary) with some kind of entity table, and/or if the info is
supposed to be found in the DTD or someplace like that. Right now I'm
ignoring the DTD and simply figuring out the doc structure by
eyeballing the xml files, maybe not a perfectly approved method but
it seems to be what most people do.

Thanks
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

I'm new to xml mongering so forgive me if there's an obvious
well-known answer to this. It's not real obvious from the library
documentation I've looked at so far. Basically I have to munch of a
bunch of xml files which contain character entities like ú
which are apparently nonstandard.

If they contain such things, and do not contain a document type
definition, they are not well-formed XML files (i.e. can't be
called "XML" in a meaningful sense).

It would have been helpful if you had given an example of such
a document.
Basically I want to know if there's a way to supply the regular parser
(preferably xml.etree but I guess I can switch to another one if
necessary) with some kind of entity table, and/or if the info is
supposed to be found in the DTD or someplace like that. Right now I'm
ignoring the DTD and simply figuring out the doc structure by
eyeballing the xml files, maybe not a perfectly approved method but
it seems to be what most people do.

If there is a document type declaration in the document, the best
way is to parse it in a mode where the parser downloads the DTD
when parsing it, and resolves the entity references itself.

In SAX, you can put an EntityResolver into the parser, and then
return a file-like object from resolveEntity. This can be used
to avoid the network download; the document type declaration
would still have to be present.

Alternatively, you can implement a skippedEntity callback in
the SAX content handler.

In ElementTree, the XMLTreeBuilder has an attribute entity
which is a dictionary used to map entity names in entity references
to their definitions. Whether you can make the parser download
the DTD itself, I don't know.

Regards,
Martin
 
C

Chuck Rhode

Martin v. Löwis wrote this on Sat, 14 Apr 2007 09:10:44 +0200. My
reply is below.
Paul Rubin:
-snip-

In ElementTree, the XMLTreeBuilder has an attribute entity which is
a dictionary used to map entity names in entity references to their
definitions. Whether you can make the parser download the DTD
itself, I don't know.

What he said....

Try this on your piano:

: import xml.etree.ElementTree # or elementtree.ElementTree prior to 2.5
: ElementTree = xml.etree.ElementTree
: import htmlentitydefs


: class XmlFile(ElementTree.ElementTree):

: def __init__(self, file=None, tag='global', **extra):
: ElementTree.ElementTree.__init__(self)
: parser = ElementTree.XMLTreeBuilder(
: target=ElementTree.TreeBuilder(Element))
: parser.entity = htmlentitydefs.entitydefs
: self.parse(source=file, parser=parser)
: return


It looks goofy as can be, but it works for me.
 
C

Chuck Rhode

Chuck Rhode wrote this on Sat, 14 Apr 2007 09:04:45 -0500. My reply is
below.

Fixed text wrap:
 
P

Paul Rubin

Martin v. Löwis said:
If they contain such things, and do not contain a document type
definition, they are not well-formed XML files (i.e. can't be
called "XML" in a meaningful sense).

The documents do have a DTD, however the DTD file doesn't say anything
about these entities.
It would have been helpful if you had given an example of such
a document.

I can't post a whole document because these docs are very large
and I'm not sure that the data is public. It does look like the DTD
is public: the document begins with

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd">
<ONIXmessage release="2.1">
...

and that url points to the DTD which is online.

Basically the doc has elements like

<b036>Diana Montan&eacute;</b036>

and both ElementTree and xmllint complain about the character entities
(and there are a lot of them).
If there is a document type declaration in the document, the best
way is to parse it in a mode where the parser downloads the DTD
when parsing it, and resolves the entity references itself.

Hmm, ok, I see there are a lot of <!ENTITY ...> directives in the
DTD but nothing about those character entities--am I looking in
the right place?
In ElementTree, the XMLTreeBuilder has an attribute entity
which is a dictionary used to map entity names in entity references
to their definitions. Whether you can make the parser download
the DTD itself, I don't know.

Chuck Rhode posted some code for something like this so I'll try it
on Monday.

Thanks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,905
Latest member
Kristy_Poole

Latest Threads

Top