[XSLT] Vanishing tab character in attribute value

C

Christian Roth

Hello,

when using this "identity" processing sheet:


<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:eek:utput method="xml" encoding="iso-8859-1" />

<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>


on this XML instance document:

<?xml version="1.0" encoding="iso-8859-1" ?>
<element attr="a tab" />


the result is:

<?xml version="1.0" encoding="iso-8859-1"?>
<element attr="a tab"/>
^^
Tabulator(0x9)--^^

, i.e. the numerical entity from the input document is not
recreated at serialization time, but simply substituted for the real
character, a tab.

Unfortunately, this means that re-applying the identity stylesheet from
above on this document makes the tab character get replaced by a single
space character according to the Attribute-Value Normalization rules
(<http://www.w3.org/TR/REC-xml#AVNormalize>):

<?xml version="1.0" encoding="iso-8859-1"?>
<element attr="a tab"/>
^
Space(0x20)-----^

In short: The above "identity" processing sheet does not deliver a
semantically identical document. Because if it did, the tab character in
the attribute value needed to be written as a numerical entity, so that
a later parser would recreate the tab character in the attribute value
(and normalize it away to a single space).

I'm using the Xalan J2 2.5D1 XSLT processor. Ist this a bug in that
implementation (resp. its XML serializer)?

Regards,
Christian
 
D

Dimitre Novatchev

Christian Roth said:
Hello,

when using this "identity" processing sheet:


<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:eek:utput method="xml" encoding="iso-8859-1" />

<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>


on this XML instance document:

<?xml version="1.0" encoding="iso-8859-1" ?>
<element attr="a tab" />


the result is:

<?xml version="1.0" encoding="iso-8859-1"?>
<element attr="a tab"/>
^^
Tabulator(0x9)--^^

, i.e. the numerical entity from the input document is not
recreated at serialization time, but simply substituted for the real
character, a tab.

Unfortunately, this means that re-applying the identity stylesheet from
above on this document makes the tab character get replaced by a single
space character according to the Attribute-Value Normalization rules
(<http://www.w3.org/TR/REC-xml#AVNormalize>):

<?xml version="1.0" encoding="iso-8859-1"?>
<element attr="a tab"/>
^
Space(0x20)-----^

In short: The above "identity" processing sheet does not deliver a
semantically identical document.

XSLT is a language defining a transformation on a *tree*. It takes as input
a tree (regardless of what the parser did with the input stream of
characters) and produces as its output again a tree (then a serializer will
produce a string output -- only if necessary).

The identity transformation is really an identity -- assuming that we are
transforming a tree into a tree.

Your problem arises due to the fact that you serialize the result of the
first identity transformation and feed the second transformation with
something you consider different.

The solution is not to serialize the intermediate results.

Also, in XSLT 2.0 there are the so called "character maps".
(http://www.w3.org/TR/xslt20/#character-maps)
Using a character map one can specify that a specific character should be
substituted by a specified string of characters during serialization.



=====
Cheers,

Dimitre Novatchev.
http://fxsl.sourceforge.net/ -- the home of FXSL
 
C

Christian Roth

Dimitre Novatchev said:
Your problem arises due to the fact that you serialize the result of the
first identity transformation and feed the second transformation with
something you consider different.

The solution is not to serialize the intermediate results.

I probably misphrased my statement, it should not read that the XSLT
processor itself is at fault, but specifically the XML serializer (which
I think you are pointing to, as well).

However, it seems that many current default serializers of XSLT
processors widely available use a what I consider buggy XML serializer
implementation in that it needed to quote the tab character using a
numerical entity when serializing an XML element's attribute. Otherwise,
when reading the XML back, the internal tree would be different than it
was just before serializing it - a fact which I consider an XML
serializer's fault.

Am I overlooking something?

Regards, Christian.
 
R

Richard Tobin

Christian Roth said:
I probably misphrased my statement, it should not read that the XSLT
processor itself is at fault, but specifically the XML serializer (which
I think you are pointing to, as well).

The definition of the XML output method in the XSLT spec - which
basically says that if you read it in again you should get the same
data model - implies that the serializer should use a character
reference in this case.

So although it is a bug in the serialization, it is still a violation
of the XSLT spec.

-- Richard
 
D

Dimitre Novatchev

The definition of the XML output method in the XSLT spec - which
basically says that if you read it in again you should get the same
data model - implies that the serializer should use a character
reference in this case.

So although it is a bug in the serialization, it is still a violation
of the XSLT spec.

Probably "the same data model" means that two representations of an xml
document, one of which is the normalised version of the other, are the same?

Is there any clarity about this?


=====
Cheers,

Dimitre Novatchev.
http://fxsl.sourceforge.net/ -- the home of FXSL
 
A

Andy Fish

I know I made this point in an earlier reply, but I beleive that a tab
entity reference is semantically the same as a tab character i.e. any XML
parser or serialiser is free to replace one by the other.
 
C

Christian Roth

Andy Fish said:
I know I made this point in an earlier reply, but I beleive that a tab
entity reference is semantically the same as a tab character i.e. any XML
parser or serialiser is free to replace one by the other.

No, it is not - at least in attribute values in XML 1.0 conforming
documents. See <http://www.w3.org/TR/REC-xml#AVNormalize> for a
description why or for the problem description if this was so, see my
original post in this thread.

Regards, Christian.
 
C

Christian Roth

Dimitre Novatchev said:
Probably "the same data model" means that two representations of an xml
document, one of which is the normalised version of the other, are the same?

The problem is that both of these documents have already been normalized
according to the XML 1.0 normalization rules for attributes, and only
then they do not match. So, they are different - which is a bug, IMO.
 
C

Christian Roth

Richard Tobin said:
So although it is a bug in the serialization, it is still a violation
of the XSLT spec.

Thank you for the confirmation. So I'll proceed filing a bug against
Xalan J2 2.5D1.

Regards, Christian.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top