embedding xml in xml as non-xml :)

M

Mark Van Orman

Hi all,

I have an application that logs in xml.

Assume <xmlLog></xmlLog>. In this element the app logs anything it gets
from foreign hosts. Now if the host sends xml data, the structure of the
document changes. ie. <xmlLog><somTag></somTag></xmlLog>. This will
cause problems with my log reader, because it assumes that <xmlLog/>
contains non-xml data.

My question is, is there a way to treat the data in the <xmlLog/>
element as non xml data. Something I can do that would treat anything
this element contains as a literal?

Any help or suggestions would be greatly appreciated.



Regards,


Mark
 
W

William Park

Mark Van Orman said:
Hi all,

I have an application that logs in xml.

Assume <xmlLog></xmlLog>. In this element the app logs
anything it gets from foreign hosts. Now if the host sends xml
data, the structure of the document changes. ie.
<xmlLog><somTag></somTag></xmlLog>. This will cause problems
with my log reader, because it assumes that <xmlLog/> contains
non-xml data.

My question is, is there a way to treat the data in the
<xmlLog/> element as non xml data. Something I can do that
would treat anything this element contains as a literal?

Any help or suggestions would be greatly appreciated.

Modify your "log reader". If remote can send any ASCII, then why does
log reader assume a particular format? '<somTag></somTag>' is ASCII
string to me.
 
A

Andy Dingley

In this element the app logs anything it gets from foreign hosts.

Your problem is to map "input" to well-formed character data according
to the rules of
http://www.w3.org/TR/2004/REC-xml11-20040204/#syntax

This is a task as old as computer programming with input files. There
are several rechniques to solve it, broadly by "escaping" or by
"wrapping"


Your example of
<xmlLog><somTag></somTag></xmlLog>
is quite easy, and could indeed be stored and read back, then treated
as ASCII.

However a foreign host that sends "<notATag<><>>" will break things,
because
<xmlLog><notATag<><>></xmlLog>
isn't well-formed XML and so parsers will choke on it.


The main problem is to handle the mapping of arbitrary characters into
"character data" (this is a term carefully defined in the XML spec).

The "escaping" way to do this is quite simple, and can be done with a
handful of character substitutions (from the XML spec):

:>The ampersand character (&) and the left angle bracket (<) MUST NOT
:> appear in their literal form, [...] they MUST be escaped using
:> either numeric character references or the strings "&amp;" and "&lt;"
:> respectively. The right angle bracket (>) MAY be represented using
:> the string "&gt;", and MUST, for compatibility, be escaped using
:> either "&gt;" or a character reference when it appears in the string
:> "]]>" in content,

So your example of
<xmlLog><somTag></somTag></xmlLog>
becomes
<xmlLog>&lt;somTag&gt;&lt;/somTag&gt;</xmlLog>


You could also use a "CDATA section", which would be the "wrapping"
approach. This takes the dubious input content and places it between
two markers that say "Between these points is CDATA, not XML markup"

The markers are <![CDATA[ and ]]>

Your example of
<xmlLog><somTag></somTag></xmlLog>
becomes
<xmlLog><![CDATA[<somTag></somTag>]]></xmlLog>

be warned that you'll still need escaping in case the input contains a
copy of the end marker! (read the XML spec, or ask again)



Second problem is to define "input". This is important because in
today's world we're really having to face up to internationalization,
character sets and encodings. It's likely that you can redefine input
from "anything" to "anything that is in UTF-8", which will make your
life easier, but be aware you _have_ made a deliberate choice here.

It's OK to write code that breaks in Japanese - just be aware that
you've done so, and know what would need changing if you needed to
remedy this.


You'll find that RSS has this same problem when embedding HTML content
within it. Some RSS versions handle this better than others, and
there's an excellent overview here
http://diveintomark.org/archives/2004/02/04/incompatible-rss
 
K

Kenneth Stephen

Andy Dingley wrote:

It's OK to write code that breaks in Japanese - just be aware that
you've done so, and know what would need changing if you needed to
remedy this.
Andy,

Why would code break only in Japanese and why is that ok?

Regards,
Kenneth
 
A

Andy Dingley

Why would code break only in Japanese and why is that ok?

That's just as an example. Most European-written XML code fails in
CJKV countries (China, Japan, Korea, Vietnam). Most American-written
XML fails in France Just look how many RSS feeds choke when they meet
é, or more usually &eacute; with the entity having been defined.

XML _itself_ (and the major tools) are very good at supporting a wide
range of character sets and encodings, but there are rules you have to
follow. For most _applications_, coders don't bother to do this. If
you _know_ your app will never receive something outside ASCII, then
that's all you need - but you should still be aware of what you've
built.
 
P

Patrick TJ McPhee

[...]

% The markers are <![CDATA[ and ]]>
%
% Your example of
% <xmlLog><somTag></somTag></xmlLog>
% becomes
% <xmlLog><![CDATA[<somTag></somTag>]]></xmlLog>
%
% be warned that you'll still need escaping in case the input contains a
% copy of the end marker! (read the XML spec, or ask again)

You don't need escaping so much as you need to end and restart the
CDATA section

<xmlLog><![CDATA[<somTag><![CDATA[with a CDATA section]]>]]><![CDATA[</somTag>]]></xmlLog>

The first ]]> ends the first CDATA section. The second is data.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,162
Latest member
GertrudeMa
Top