DTDs and XML: another "not well formed" question


S

seven.reeds

Hi,

I'm new to parsing/using xml but this project seemed reasonable to cut
my teeth on. I have a few dozen "articles" that are local
announcements of interest for my group's customers. They have a
simple format of a title, zero or more static or hyper-linked images
and one or more paragraphs of text.

A "Title" will hold plain text. The "Text"s will hold plain or mixed
content. The "Image"s will need to know about the hyper-link URL (if
any); the image source URL and possibly "height" and "width"
attributes.

I have made a stab at making a DTD

<!ELEMENT article (title, image*, text+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT image (src, width?, height?, link?)>
<!ATTLIST src CDATA #REQUIRED>
<!ATTLIST link CDATA #IMPLIED>
<!ATTLIST width PCDATA #IMPLIED>
<!ATTLIST height PCDATA #IMPLIED>
<!ELEMENT text (#CDATA)>

A sample xml doc looks like

<?xml version="1.0" ?>
<!DOCTYPE article SYSTEM "http://www.itg.uiuc.edu/publications/news/
news.dtd">
<article>
<title> Applied Physics Letters Features ITG Image on Cover </title>
<image link="http://scitation.aip.org/dbt/dbt.jsp?
KEY=APPLAB&Volume=90&Issue=21"
src="/images/apl_cover-130.jpg" />
<text> The cover for the <a href="http://scitation.aip.org/dbt/
dbt.jsp?KEY=APPLAB&Volume=90&Issue=21">May
21, 2007 edition of Applied Physics Letters</a>features an image
produced in the ...
</text>
</article>

Now, I have spent time searching this group and a couple others
related to the scripting language and the XML parser i am using. I
*know* what my problem is... what i don't know is why I have it.

My XML parser chokes on the first "&" (ampersand) in the "link"
attribute of the "image" tag. I know that being "well-formed" means
the amps should be "quoted" but I thought that the "CDATA bits in the
DTD meant that *ALL* characters are accepted in this context.

Is my DTD wrong for the xml I have? Is my parser/validator not
picking up on the DTD?

I know that I can pre-process the incoming xml file and change the
amps to the html entity version but that feels wastefull if CDATA is
doing what i thought it should do.

other than a clue :), what am I missing?
 
Ad

Advertisements

R

Richard Tobin

seven.reeds said:
<!ELEMENT article (title, image*, text+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT image (src, width?, height?, link?)>
<!ATTLIST src CDATA #REQUIRED>
<!ATTLIST link CDATA #IMPLIED>
<!ATTLIST width PCDATA #IMPLIED>
<!ATTLIST height PCDATA #IMPLIED>

Are src, link, width and height subelements or attributes?
You've listed them as child elements in the content model, and
then declared them as attributes.

You seem to be using attributes, so declare image as

<!ELEMENT image EMPTY>

And you've declared width and height as PCDATA which doesn't mean
anything; presumably you meant CDATA.
<!ELEMENT text (#CDATA)>

And here you've declared the content of text to be #CDATA, presumably
you meant #PCDATA.
A sample xml doc looks like

<?xml version="1.0" ?>
<!DOCTYPE article SYSTEM "http://www.itg.uiuc.edu/publications/news/
news.dtd">
<article>
<title> Applied Physics Letters Features ITG Image on Cover </title>
<image link="http://scitation.aip.org/dbt/dbt.jsp?
KEY=APPLAB&Volume=90&Issue=21"

All your ampersands need to be replaced with &amp;
src="/images/apl_cover-130.jpg" />
<text> The cover for the <a href="http://scitation.aip.org/dbt/
dbt.jsp?KEY=APPLAB&Volume=90&Issue=21">May

Your dtd doesn't say anything about <text> being allowed to contain
My XML parser chokes on the first "&" (ampersand) in the "link"
attribute of the "image" tag. I know that being "well-formed" means
the amps should be "quoted" but I thought that the "CDATA bits in the
DTD meant that *ALL* characters are accepted in this context.

A CDATA marked section in text, such as <![CDATA[hello & goodbye]]>
has that effect. In attributes (whether declared as CDATA or something
else) you have to quote ampersands. There's no way around it.

The CDATA/PCDATA terminology is certainly confusing.

-- Richard
 
Ad

Advertisements

S

seven.reeds

Are src, link, width and height subelements or attributes?
You've listed them as child elements in the content model, and
then declared them as attributes.

oh man, sorry. I've been spinning on this for a long time now and
have been making changes left and right to the dtd and xml file. I
pasted in an incorrect version. My intent is for src, link etc to be
attributes.

Thanks for your other advice. I'll try it all.

cheers,
 

Top