Copy and indenting XML files

H

Harrie

Hi group,

I want to indent existing XML files so they are more readable (at least
to me). At this moment I'm looking at the XML files OpenOffice.org's
Writer application produces in it's zipped "SXW" format (and they're one
line, probably to save space, which I find hard to read). At first I
thought I was going to do it with sed/awk or something like that, but
then I remembered the xsl:eek:utput element with the indent attribute of
XSL and this seems more natural to me. What I'm using now is this XSL file:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:eek:utput method="xml" indent="yes" encoding="UTF-8"/>

<xsl:template match="*">
<xsl:copy-of select=".">
<xsl:apply-templates/>
</xsl:copy-of>
</xsl:template>

</xsl:stylesheet>

This works like a charm, but I cannot copy the DOCTYPE declaration (and
XML declaration, but that's of less importance to me at this moment).

I've done some Googling and found out that it's not posible using XSL
since the document type declaration is not part of the tree model of the
XML file.

http://www.biglist.com/lists/xsl-list/archives/200106/msg00585.html

I'm using xsltproc as XSL processor and I know you can pass arguments to
it, so I'm looking for a way to extract the PUBLIC and/or SYSTEM
identifier of an XML file with other tools and pass it as an argument to
xsltproc, so it can generate a DTD with the doctype-public and/or
doctype-system attributes of xsl:eek:utput, but I'm not really sure how to
tackle this.

Has somebody already done something like this? Does someone have some
pointers for me?
 
J

Joe Kesselman

Harrie said:
This works like a charm, but I cannot copy the DOCTYPE declaration (and
XML declaration, but that's of less importance to me at this moment).

I've done some Googling and found out that it's not posible using XSL
since the document type declaration is not part of the tree model of the
XML file.

That's correct. You can explicitly specify the Public and System
Identifiers to be used in XSLT's output (see the doctype-public and
doctype-system options on the xsl:eek:utput directive), but as far as I
know there's no standard way to retrieve those values from the source
document in XSLT or XPath 1.0. (2.0 may change that.)

I believe both the DOM and SAX APIs expose these fields, though, so if
you really want to, it shouldn't be too hard to write a front-end tool
to obtain them and then pass those to your XSLT processor as parameters.
Or you could just write an indenting tool that uses those APIs to parse
the document in, explicitly modify it to add the indentation, and
serialize it back out.

WARNING: Changing indentation means changing the text content of the
document, and may change its actual meaning. Don't assume the
pretty-printed version is usable in place of the original; know what the
requirements are of the program you're working with. (Or you could avoid
changing the file at all, and use an XML-aware editor to make its
structure more visible.)
 
M

Martin Honnen

Harrie wrote:

but
then I remembered the xsl:eek:utput element with the indent attribute of
XSL and this seems more natural to me. What I'm using now is this XSL file:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:eek:utput method="xml" indent="yes" encoding="UTF-8"/>

<xsl:template match="*">
<xsl:copy-of select=".">
<xsl:apply-templates/>
</xsl:copy-of>
</xsl:template>

</xsl:stylesheet>

This works like a charm,

Really? xsl:copy-of will copy the element, its attribute and its child
nodes, then you additionallty use xsl:apply-templates to process the
child nodes again so you should got a lot of duplicated content that way.
but I cannot copy the DOCTYPE declaration (and
XML declaration, but that's of less importance to me at this moment).

You can't copy those but you can output them with the xsl:eek:utput
instruction e.g.
<xsl:eek:utput omit-xml-declaration="no" />

<xsl:eek:utput encoding="utf-8" />

<xsl:eek:utput
doctype-public="public id here"
doctype-system="syste id here" />

Of course it will be a problem if you want to use one stylesheet to
indent lots of different XML documents with different doctype
declarations but if you know the doctype all those documents need then
you can make sure that is output with the above instruction.

In addition to that some XSLT processors have extensions, like Saxon 6
for instance
<http://saxon.sourceforge.net/saxon6.5.5/extensions.html#saxon:doctype>
to allow you to output doctype declarations.
 
H

Harrie

Joe Kesselman said the following on 2/27/2006 01:13 +0200:
Harrie wrote:

That's correct. You can explicitly specify the Public and System
Identifiers to be used in XSLT's output (see the doctype-public and
doctype-system options on the xsl:eek:utput directive), but as far as I
know there's no standard way to retrieve those values from the source
document in XSLT or XPath 1.0. (2.0 may change that.)

I believe both the DOM and SAX APIs expose these fields, though, so if
you really want to, it shouldn't be too hard to write a front-end tool
to obtain them and then pass those to your XSLT processor as parameters.

This is what I had in mind and described at the end of my original
posting, but I don't have any experience with API's. I have installed
XMLgawk which uses Expat and hope that it might help me, but I have a
hard time digesting the language (I do have some experience with (g)awk
itself, but I find this quite different).
Or you could just write an indenting tool that uses those APIs to parse
the document in, explicitly modify it to add the indentation, and
serialize it back out.

Before I thought about using XSLT for indenting, I was thinking about a
POSIX shell script which uses some awk and sed. I suppose this give me
more flexabilaty, since I have no control over the amount of indenting
with 'xsl:eek:utput indent="yes"' and when I write something myself I
probably can.
WARNING: Changing indentation means changing the text content of the
document, and may change its actual meaning. Don't assume the
pretty-printed version is usable in place of the original; know what the
requirements are of the program you're working with. (Or you could avoid
changing the file at all, and use an XML-aware editor to make its
structure more visible.)

Yes, the same is true for HTML where white space can be significant. I
hadn't thought about it in this particular case, 'cause all I want to do
is reformat is so I can read it more easily and try to understand it
(it's just for educational purpose). I won't use the indenting results
for anything else, but thanks for reminding me.
 
H

Harrie

Martin Honnen said the following on 2/27/2006 14:01 +0200:
Harrie wrote:
[stripped xsl file]
This works like a charm,

Really? xsl:copy-of will copy the element, its attribute and its child
nodes, then you additionallty use xsl:apply-templates to process the
child nodes again so you should got a lot of duplicated content that way.

I've read up on xsl:copy-of and see you're quite right. The
xsl:apply-templates is a left over from my start with xsl:copy, which
didn't work for me 'cause it doesn't copy the attributes and child nodes.

Strange enough I don't have duplicated content and without the
xsl:apply-templates rule I get exactly the same result (I compared it
with "diff").

But thanks for pointing this out te me.
You can't copy those but you can output them with the xsl:eek:utput
instruction e.g.
<xsl:eek:utput omit-xml-declaration="no" />

<xsl:eek:utput encoding="utf-8" />

<xsl:eek:utput
doctype-public="public id here"
doctype-system="syste id here" />

Yes, this is what I have in mind and was at the end of my original posting.
Of course it will be a problem if you want to use one stylesheet to
indent lots of different XML documents with different doctype
declarations but if you know the doctype all those documents need then
you can make sure that is output with the above instruction.

At this moment I'm looking at the XML files OpenOffice.org's Writer
application is producing, so in this case I can hard code the DOCTYPE,
but I want a general solution.
In addition to that some XSLT processors have extensions, like Saxon 6
for instance
<http://saxon.sourceforge.net/saxon6.5.5/extensions.html#saxon:doctype>
to allow you to output doctype declarations.

Thanks, but to output it is not my problem (at least, not yet), I can do
that with XSL already like we both described earlier, but I need to find
a way to extract it from the source document first. Since
OpenOffice.org's Writer files are (when unzipped) XML files with only 1
long line, I find it hard to only extract the DOCTYPE (if it had been
multiple lines, I would have used grep with awk).

I've read that XMLgawk can read files by element, so I'm hoping that can
help me, but as I've said in Joe's reply, I have a hard time mastering
it's syntax.

Hmmm, if the DOCTYPE is not a part of the document tree, is it an element?
 
H

Harrie

Joe Kesselman said the following on 2/27/2006 01:13 +0200:
WARNING: Changing indentation means changing the text content of the
document, and may change its actual meaning. Don't assume the
pretty-printed version is usable in place of the original; know what the
requirements are of the program you're working with. (Or you could avoid
changing the file at all, and use an XML-aware editor to make its
structure more visible.)

Just out of curiosity:

I just read section 16.1 of the XSLT Recomandation [1] and there is a
warning (NOTE) about using indent with mixed content. I understand that
white space is signaficant there, but mixed content is not a good way of
writing XML anyway.

Above that, there is a paragraph about indent. I don't understand the
last long line of that paragraph (starting with: "The xml output method
should use an algorithm .." till the end).

Can somebody give an example of what is ment there?

[1] http://www.w3.org/TR/xslt#section-XML-Output-Method
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,900
Latest member
Nell636132

Latest Threads

Top