Handling " entity in attribute value

M

Mateusz Loskot

Hi,

I'd like to ask how XML parsers should handle attributes which consists
of " entity as value. I know XML allows to use both: single and
double quotes as attribute value terminator. That's clear.
But how should parser react for such situation:

I have CORDSYS element with string attribute which consists of value
with many " entities:

<COORDSYS
string="GEOGCS[&quot;GCS_WGS_1984&quot;,DATUM[&quot;WGS84&quot;,SPHEROID[&quot;WGS84&quot;,6378137,298.257223563]],PRIMEM[&quot;Greenwich&quot;,0],UNIT[&quot;Degree&quot;,0.0174532925199433]]"/>

So, when I read it to DOM and after someoperations I try to save it to
file parsers replaces double-quote value terminators to single-quote as
follows:

<COORDSYS
string='GEOGCS[&quot;GCS_WGS_1984&quot;,DATUM[&quot;WGS84&quot;,SPHEROID[&quot;WGS84&quot;,6378137,298.257223563]],PRIMEM[&quot;Greenwich&quot;,0],UNIT[&quot;Degree&quot;,0.0174532925199433]]'/>

Please, explain me how parser is expected to handle this element in
save operation.

Best regards
 
J

Jukka K. Korpela

Mateusz Loskot said:
I'd like to ask how XML parsers should handle attributes which consists
of &quot; entity as value.

As data that contains the ASCII quotation mark.
I have CORDSYS element with string attribute which consists of value
with many &quot; entities:
OK.

So, when I read it to DOM and after someoperations I try to save it to
file parsers replaces double-quote value terminators to single-quote as
follows:

That's external to XML parsing. You are not processing XML any more but
data constructed by parsing an XML document and representing it as a tree.
What happens then depends on the tools you use. Most probably the internal
representation does not contain the enclosing quotation marks or the entity
references but the parsed attribute values a strings. When you later output
the data in some format, perhaps linearizing it as XML, the results depend
on how you do that.

If all occurrences of ASCII quote and ASCII apostrophe in the attribute
values are "escaped" using entity or character references, it does not
matter whether you use quotes or apostrophes as delimiters when converting
the data back to XML format. (Naturally you need to use matching
delimiters, i.e. the same character as opening and as closing delimiter.)
 
M

Mateusz Loskot

Jukka said:
That's external to XML parsing. You are not processing XML any more but
data constructed by parsing an XML document and representing it as a tree.

Yes, I know
What happens then depends on the tools you use.

Yes, I use TinyXML DOM parser.
Most probably the internal
representation does not contain the enclosing quotation marks or the entity
references but the parsed attribute values a strings. When you later output
the data in some format, perhaps linearizing it as XML, the results depend
on how you do that.

I did some investigation and now I know internals of TinyXML. During
Save operation TinyXML checks if attribute value contains double-quote
character (")
then it encloses attribute value in single-quotes ('). Certainly, it's
correct from XML spec point of view.
This checking is simply made using (let's say function) find('\"') in
attribute value.

TinyXML can be compiled in, let's say, C-style, then it uses its own
string class or with STL support, then it uses std::string.
When TinyXML is compiled in C-style then all &quot; entities are
"vislble" to parser as double-quotes so if you printf value of my
'string' attribute in way how it is hold by TinyXML then you will get
double-quotes instead of &quot; entities. But when TinyXML is compiled
with STL support then everything works fine. TinyXML holds 'string'
attribute with &quot; entities and does not convert it to double-quotes
internally.

Here is longer story with some source code:
http://sourceforge.net/forum/forum.php?thread_id=1370207&forum_id=172103

I'm not sure if this approach is correct. I'm also not sure if this is
a TinyXML bug. That's why I've asked this question.
I'm going to do some further discussion with TinyXML developmend Team.

Thanks a lot
 
J

Jukka K. Korpela

Mateusz Loskot said:
During
Save operation TinyXML checks if attribute value contains double-quote
character (")
then it encloses attribute value in single-quotes ('). Certainly, it's
correct from XML spec point of view.

It is, but if the attribute value contains _both_ an ASCII quotation
mark " _and_ an ASCII apostrophe ' (which is admittedly rare), then
either of them _must_ be "escaped".
I'm not sure if this approach is correct.

I still don't know what the problem or question is about. You are saying
that the output format is correct. The internal format is not really an XML
issue and mostly a practical question: you need to know the internal format
in order to play with it.

What we _can_ say is that in processing XML data, &quot; and " (assuming a
context where " may appear) must be treated as identical. The distinction
should normally be lost in parsing, but if it is preserved in the internal
format, it should not affect processing of the data as XML. (The
distinction could be retained e.g. in order to be able to print out the
original XML source verbatim for some purpose.)
 
R

Richard Tobin

Jukka K. Korpela said:
It is, but if the attribute value contains _both_ an ASCII quotation
mark " _and_ an ASCII apostrophe ' (which is admittedly rare)

Not that rare: in an XSLT stylesheet an XPath may well contain a
string containing a quote. If you want an XPath string containing
both you're stuck!

-- Richard
 
M

Mateusz Loskot

Jukka said:
I still don't know what the problem or question is about. You are saying
that the output format is correct. The internal format is not really an XML
issue and mostly a practical question: you need to know the internal format
in order to play with it.

What we _can_ say is that in processing XML data, &quot; and " (assuming a
context where " may appear) must be treated as identical.

Yes, I understand it. The problem seems to be more technical and
implementation related:

http://sourceforge.net/forum/forum.php?thread_id=1370207&forum_id=172103

You can see that TinyXML parser works differently depending on C/C++
internal usage.

We are sure that when using every XML parser if I search XML element
for " then both &quot; and " (double-quotes) are expected to be
matched.

Cheers
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top