DTDs and XML: another "not well formed" question

Discussion in 'XML' started by seven.reeds, Jul 1, 2007.

  1. seven.reeds

    seven.reeds Guest

    Hi,

    I'm new to parsing/using xml but this project seemed reasonable to cut
    my teeth on. I have a few dozen "articles" that are local
    announcements of interest for my group's customers. They have a
    simple format of a title, zero or more static or hyper-linked images
    and one or more paragraphs of text.

    A "Title" will hold plain text. The "Text"s will hold plain or mixed
    content. The "Image"s will need to know about the hyper-link URL (if
    any); the image source URL and possibly "height" and "width"
    attributes.

    I have made a stab at making a DTD

    <!ELEMENT article (title, image*, text+)>
    <!ELEMENT title (#PCDATA)>
    <!ELEMENT image (src, width?, height?, link?)>
    <!ATTLIST src CDATA #REQUIRED>
    <!ATTLIST link CDATA #IMPLIED>
    <!ATTLIST width PCDATA #IMPLIED>
    <!ATTLIST height PCDATA #IMPLIED>
    <!ELEMENT text (#CDATA)>

    A sample xml doc looks like

    <?xml version="1.0" ?>
    <!DOCTYPE article SYSTEM "http://www.itg.uiuc.edu/publications/news/
    news.dtd">
    <article>
    <title> Applied Physics Letters Features ITG Image on Cover </title>
    <image link="http://scitation.aip.org/dbt/dbt.jsp?
    KEY=APPLAB&Volume=90&Issue=21"
    src="/images/apl_cover-130.jpg" />
    <text> The cover for the <a href="http://scitation.aip.org/dbt/
    dbt.jsp?KEY=APPLAB&Volume=90&Issue=21">May
    21, 2007 edition of Applied Physics Letters</a>features an image
    produced in the ...
    </text>
    </article>

    Now, I have spent time searching this group and a couple others
    related to the scripting language and the XML parser i am using. I
    *know* what my problem is... what i don't know is why I have it.

    My XML parser chokes on the first "&" (ampersand) in the "link"
    attribute of the "image" tag. I know that being "well-formed" means
    the amps should be "quoted" but I thought that the "CDATA bits in the
    DTD meant that *ALL* characters are accepted in this context.

    Is my DTD wrong for the xml I have? Is my parser/validator not
    picking up on the DTD?

    I know that I can pre-process the incoming xml file and change the
    amps to the html entity version but that feels wastefull if CDATA is
    doing what i thought it should do.

    other than a clue :), what am I missing?
     
    seven.reeds, Jul 1, 2007
    #1
    1. Advertising

  2. In article <>,
    seven.reeds <> wrote:

    ><!ELEMENT article (title, image*, text+)>
    ><!ELEMENT title (#PCDATA)>
    ><!ELEMENT image (src, width?, height?, link?)>
    > <!ATTLIST src CDATA #REQUIRED>
    > <!ATTLIST link CDATA #IMPLIED>
    > <!ATTLIST width PCDATA #IMPLIED>
    > <!ATTLIST height PCDATA #IMPLIED>


    Are src, link, width and height subelements or attributes?
    You've listed them as child elements in the content model, and
    then declared them as attributes.

    You seem to be using attributes, so declare image as

    <!ELEMENT image EMPTY>

    And you've declared width and height as PCDATA which doesn't mean
    anything; presumably you meant CDATA.

    ><!ELEMENT text (#CDATA)>


    And here you've declared the content of text to be #CDATA, presumably
    you meant #PCDATA.

    >A sample xml doc looks like
    >
    ><?xml version="1.0" ?>
    ><!DOCTYPE article SYSTEM "http://www.itg.uiuc.edu/publications/news/
    >news.dtd">
    ><article>
    > <title> Applied Physics Letters Features ITG Image on Cover </title>
    > <image link="http://scitation.aip.org/dbt/dbt.jsp?
    >KEY=APPLAB&Volume=90&Issue=21"


    All your ampersands need to be replaced with &amp;

    > src="/images/apl_cover-130.jpg" />
    > <text> The cover for the <a href="http://scitation.aip.org/dbt/
    >dbt.jsp?KEY=APPLAB&Volume=90&Issue=21">May


    Your dtd doesn't say anything about <text> being allowed to contain
    <a> elements. You'll need to change the declaration of text and add
    a declaration for a.

    >My XML parser chokes on the first "&" (ampersand) in the "link"
    >attribute of the "image" tag. I know that being "well-formed" means
    >the amps should be "quoted" but I thought that the "CDATA bits in the
    >DTD meant that *ALL* characters are accepted in this context.


    A CDATA marked section in text, such as <![CDATA[hello & goodbye]]>
    has that effect. In attributes (whether declared as CDATA or something
    else) you have to quote ampersands. There's no way around it.

    The CDATA/PCDATA terminology is certainly confusing.

    -- Richard
    --
    "Consideration shall be given to the need for as many as 32 characters
    in some alphabets" - X3.4, 1963.
     
    Richard Tobin, Jul 1, 2007
    #2
    1. Advertising

  3. seven.reeds

    seven.reeds Guest

    > Are src, link, width and height subelements or attributes?
    > You've listed them as child elements in the content model, and
    > then declared them as attributes.


    oh man, sorry. I've been spinning on this for a long time now and
    have been making changes left and right to the dtd and xml file. I
    pasted in an incorrect version. My intent is for src, link etc to be
    attributes.

    Thanks for your other advice. I'll try it all.

    cheers,
     
    seven.reeds, Jul 1, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rimu Atkinson

    how is this XML not well-formed???

    Rimu Atkinson, Jul 9, 2003, in forum: XML
    Replies:
    1
    Views:
    1,109
    Peter Flynn
    Jul 15, 2003
  2. Clifford W. Racz
    Replies:
    4
    Views:
    2,079
    Clifford W. Racz
    Feb 13, 2004
  3. Replies:
    2
    Views:
    653
    Joe Kesselman
    Nov 16, 2006
  4. Philou59
    Replies:
    8
    Views:
    782
    Peter Flynn
    Jan 20, 2007
  5. Rich Fowler
    Replies:
    2
    Views:
    1,367
    Rich Fowler
    Jan 22, 2010
Loading...

Share This Page