nonstandard XML character entities?

Paul Rubin · Apr 14, 2007

I'm new to xml mongering so forgive me if there's an obvious
well-known answer to this. It's not real obvious from the library
documentation I've looked at so far. Basically I have to munch of a
bunch of xml files which contain character entities like ú
which are apparently nonstandard. They appear in w3.org tables but
xml.etree.cElementTree.ElementTree.parse barfs at them and xmllint
barfs at them.

Basically I want to know if there's a way to supply the regular parser
(preferably xml.etree but I guess I can switch to another one if
necessary) with some kind of entity table, and/or if the info is
supposed to be found in the DTD or someplace like that. Right now I'm
ignoring the DTD and simply figuring out the doc structure by
eyeballing the xml files, maybe not a perfectly approved method but
it seems to be what most people do.

Thanks

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Apr 14, 2007

I'm new to xml mongering so forgive me if there's an obvious

well-known answer to this. It's not real obvious from the library
documentation I've looked at so far. Basically I have to munch of a
bunch of xml files which contain character entities like ú
which are apparently nonstandard.

If they contain such things, and do not contain a document type
definition, they are not well-formed XML files (i.e. can't be
called "XML" in a meaningful sense).

It would have been helpful if you had given an example of such
a document.

Basically I want to know if there's a way to supply the regular parser
(preferably xml.etree but I guess I can switch to another one if
necessary) with some kind of entity table, and/or if the info is
supposed to be found in the DTD or someplace like that. Right now I'm
ignoring the DTD and simply figuring out the doc structure by
eyeballing the xml files, maybe not a perfectly approved method but
it seems to be what most people do.

If there is a document type declaration in the document, the best
way is to parse it in a mode where the parser downloads the DTD
when parsing it, and resolves the entity references itself.

In SAX, you can put an EntityResolver into the parser, and then
return a file-like object from resolveEntity. This can be used
to avoid the network download; the document type declaration
would still have to be present.

Alternatively, you can implement a skippedEntity callback in
the SAX content handler.

In ElementTree, the XMLTreeBuilder has an attribute entity
which is a dictionary used to map entity names in entity references
to their definitions. Whether you can make the parser download
the DTD itself, I don't know.

Regards,
Martin

Chuck Rhode · Apr 14, 2007

Martin v. LÃ¶wis wrote this on Sat, 14 Apr 2007 09:10:44 +0200. My
reply is below.

Paul Rubin:
-snip-

In ElementTree, the XMLTreeBuilder has an attribute entity which is
a dictionary used to map entity names in entity references to their
definitions. Whether you can make the parser download the DTD
itself, I don't know.

What he said....

Try this on your piano:

: import xml.etree.ElementTree # or elementtree.ElementTree prior to 2.5
: ElementTree = xml.etree.ElementTree
: import htmlentitydefs

: class XmlFile(ElementTree.ElementTree):

: def __init__(self, file=None, tag='global', **extra):
: ElementTree.ElementTree.__init__(self)
: parser = ElementTree.XMLTreeBuilder(
: target=ElementTree.TreeBuilder(Element))
: parser.entity = htmlentitydefs.entitydefs
: self.parse(source=file, parser=parser)
: return

It looks goofy as can be, but it works for me.

Chuck Rhode · Apr 14, 2007

Chuck Rhode wrote this on Sat, 14 Apr 2007 09:04:45 -0500. My reply is
below.

Fixed text wrap:

Paul Rubin · Apr 14, 2007

Martin v. Löwis said:
If they contain such things, and do not contain a document type
definition, they are not well-formed XML files (i.e. can't be
called "XML" in a meaningful sense).

The documents do have a DTD, however the DTD file doesn't say anything
about these entities.

It would have been helpful if you had given an example of such
a document.

I can't post a whole document because these docs are very large
and I'm not sure that the data is public. It does look like the DTD
is public: the document begins with

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd">
<ONIXmessage release="2.1">
...

and that url points to the DTD which is online.

Basically the doc has elements like

<b036>Diana Montané</b036>

and both ElementTree and xmllint complain about the character entities
(and there are a lot of them).

If there is a document type declaration in the document, the best
way is to parse it in a mode where the parser downloads the DTD
when parsing it, and resolves the entity references itself.

Hmm, ok, I see there are a lot of <!ENTITY ...> directives in the
DTD but nothing about those character entities--am I looking in
the right place?

In ElementTree, the XMLTreeBuilder has an attribute entity
which is a dictionary used to map entity names in entity references
to their definitions. Whether you can make the parser download
the DTD itself, I don't know.

Chuck Rhode posted some code for something like this so I'll try it
on Monday.

Thanks!

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Questions about character entities in XML and PCI security compliance	7	Aug 7, 2008
codec for html/xml entities!?	3	Apr 18, 2008
entities in attribute values	3	Jan 12, 2009
XML to SGML entities	2	Dec 4, 2006
internal/external general entities	1	Dec 14, 2007
XML validation / exception.	0	Jan 24, 2013
wddx problem with entities	1	Jun 7, 2006

nonstandard XML character entities?

Paul Rubin

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Chuck Rhode

Chuck Rhode

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads