embedding xml in xml as non-xml :)

Discussion in 'XML' started by Mark Van Orman, Sep 14, 2004.

  1. Hi all,

    I have an application that logs in xml.

    Assume <xmlLog></xmlLog>. In this element the app logs anything it gets
    from foreign hosts. Now if the host sends xml data, the structure of the
    document changes. ie. <xmlLog><somTag></somTag></xmlLog>. This will
    cause problems with my log reader, because it assumes that <xmlLog/>
    contains non-xml data.

    My question is, is there a way to treat the data in the <xmlLog/>
    element as non xml data. Something I can do that would treat anything
    this element contains as a literal?

    Any help or suggestions would be greatly appreciated.



    Regards,


    Mark
    Mark Van Orman, Sep 14, 2004
    #1
    1. Advertising

  2. Mark Van Orman

    William Park Guest

    Mark Van Orman <> wrote:
    > Hi all,
    >
    > I have an application that logs in xml.
    >
    > Assume <xmlLog></xmlLog>. In this element the app logs
    > anything it gets from foreign hosts. Now if the host sends xml
    > data, the structure of the document changes. ie.
    > <xmlLog><somTag></somTag></xmlLog>. This will cause problems
    > with my log reader, because it assumes that <xmlLog/> contains
    > non-xml data.
    >
    > My question is, is there a way to treat the data in the
    > <xmlLog/> element as non xml data. Something I can do that
    > would treat anything this element contains as a literal?
    >
    > Any help or suggestions would be greatly appreciated.


    Modify your "log reader". If remote can send any ASCII, then why does
    log reader assume a particular format? '<somTag></somTag>' is ASCII
    string to me.

    --
    William Park <>
    Open Geometry Consulting, Toronto, Canada
    William Park, Sep 14, 2004
    #2
    1. Advertising

  3. Mark Van Orman

    Andy Dingley Guest

    On Mon, 13 Sep 2004 23:51:39 -0500, Mark Van Orman
    <> wrote:

    >In this element the app logs anything it gets from foreign hosts.


    Your problem is to map "input" to well-formed character data according
    to the rules of
    http://www.w3.org/TR/2004/REC-xml11-20040204/#syntax

    This is a task as old as computer programming with input files. There
    are several rechniques to solve it, broadly by "escaping" or by
    "wrapping"


    Your example of
    > <xmlLog><somTag></somTag></xmlLog>

    is quite easy, and could indeed be stored and read back, then treated
    as ASCII.

    However a foreign host that sends "<notATag<><>>" will break things,
    because
    <xmlLog><notATag<><>></xmlLog>
    isn't well-formed XML and so parsers will choke on it.


    The main problem is to handle the mapping of arbitrary characters into
    "character data" (this is a term carefully defined in the XML spec).

    The "escaping" way to do this is quite simple, and can be done with a
    handful of character substitutions (from the XML spec):

    :>The ampersand character (&) and the left angle bracket (<) MUST NOT
    :> appear in their literal form, [...] they MUST be escaped using
    :> either numeric character references or the strings "&amp;" and "&lt;"
    :> respectively. The right angle bracket (>) MAY be represented using
    :> the string "&gt;", and MUST, for compatibility, be escaped using
    :> either "&gt;" or a character reference when it appears in the string
    :> "]]>" in content,

    So your example of
    <xmlLog><somTag></somTag></xmlLog>
    becomes
    <xmlLog>&lt;somTag&gt;&lt;/somTag&gt;</xmlLog>


    You could also use a "CDATA section", which would be the "wrapping"
    approach. This takes the dubious input content and places it between
    two markers that say "Between these points is CDATA, not XML markup"

    The markers are <![CDATA[ and ]]>

    Your example of
    <xmlLog><somTag></somTag></xmlLog>
    becomes
    <xmlLog><![CDATA[<somTag></somTag>]]></xmlLog>

    be warned that you'll still need escaping in case the input contains a
    copy of the end marker! (read the XML spec, or ask again)



    Second problem is to define "input". This is important because in
    today's world we're really having to face up to internationalization,
    character sets and encodings. It's likely that you can redefine input
    from "anything" to "anything that is in UTF-8", which will make your
    life easier, but be aware you _have_ made a deliberate choice here.

    It's OK to write code that breaks in Japanese - just be aware that
    you've done so, and know what would need changing if you needed to
    remedy this.


    You'll find that RSS has this same problem when embedding HTML content
    within it. Some RSS versions handle this better than others, and
    there's an excellent overview here
    http://diveintomark.org/archives/2004/02/04/incompatible-rss

    --
    Smert' spamionam
    Andy Dingley, Sep 14, 2004
    #3
  4. Andy Dingley wrote:


    > It's OK to write code that breaks in Japanese - just be aware that
    > you've done so, and know what would need changing if you needed to
    > remedy this.
    >

    Andy,

    Why would code break only in Japanese and why is that ok?

    Regards,
    Kenneth
    Kenneth Stephen, Sep 14, 2004
    #4
  5. Mark Van Orman

    Andy Dingley Guest

    On Tue, 14 Sep 2004 12:51:49 GMT, Kenneth Stephen
    <> wrote:

    > Why would code break only in Japanese and why is that ok?


    That's just as an example. Most European-written XML code fails in
    CJKV countries (China, Japan, Korea, Vietnam). Most American-written
    XML fails in France Just look how many RSS feeds choke when they meet
    é, or more usually &eacute; with the entity having been defined.

    XML _itself_ (and the major tools) are very good at supporting a wide
    range of character sets and encodings, but there are rules you have to
    follow. For most _applications_, coders don't bother to do this. If
    you _know_ your app will never receive something outside ASCII, then
    that's all you need - but you should still be aware of what you've
    built.

    --
    Smert' spamionam
    Andy Dingley, Sep 14, 2004
    #5
  6. In article <>,
    Andy Dingley <> wrote:

    [...]

    % The markers are <![CDATA[ and ]]>
    %
    % Your example of
    % <xmlLog><somTag></somTag></xmlLog>
    % becomes
    % <xmlLog><![CDATA[<somTag></somTag>]]></xmlLog>
    %
    % be warned that you'll still need escaping in case the input contains a
    % copy of the end marker! (read the XML spec, or ask again)

    You don't need escaping so much as you need to end and restart the
    CDATA section

    <xmlLog><![CDATA[<somTag><![CDATA[with a CDATA section]]>]]><![CDATA[</somTag>]]></xmlLog>

    The first ]]> ends the first CDATA section. The second is data.
    --

    Patrick TJ McPhee
    East York Canada
    Patrick TJ McPhee, Sep 15, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. terry
    Replies:
    0
    Views:
    377
    terry
    Jul 9, 2003
  2. stefan
    Replies:
    3
    Views:
    422
    stefan
    Dec 8, 2004
  3. dmoore
    Replies:
    5
    Views:
    647
    dmoore
    Jul 19, 2007
  4. Ryan Oltman
    Replies:
    0
    Views:
    271
    Ryan Oltman
    Jan 13, 2009
  5. Bertram Scharpf

    PDF::Writer, embedding non-PNGs

    Bertram Scharpf, Dec 19, 2007, in forum: Ruby
    Replies:
    4
    Views:
    131
    Bertram Scharpf
    Dec 21, 2007
Loading...

Share This Page