XML configurable formatter/"Pretty Printer"

C

Chris

Please set your reader to fixed width for readabilities sake...

I'm aware of the plethora of XML formatters (e.g. though with Xerces,
JDOM, DOM4J), but I have a few special wants.

- To be able to drop attributes down to the next line if there are more
than a certain number of them, or if they exceed a certain length.
- To be able to not only wrap text content but indent it at the current
indentation level.
- To be able to set wrap and indentation levels for CDATA sections.

For example:

<one two="asdfghjasdfhjk"
three="asdfhsajkdl"
four="asdfasdfhjkl"
five="asfahjkle"<six>content</six>
<seven>content</seven>
<eight>
This content is longer than eight characters and will ....
This content is longer than eight characters and will ....
This content is longer than eight characters and will ....
<eight>
<nine>
<![CDATA[
This is CDATA content that I would like to be formatted as well.
This is CDATA content that I would like to be formatted as well.
This is CDATA content that I would like to be formatted as well.
]]>
</nine>
<one>

I've googled for a couple of days and while there are a lot of
formatters out there none seem to be as powerful as I'd like. Ideally
there should be a formatter that allows you to set the position of
every token (element, attribute, attribute text, text, CDATA etc.) in a
nice configurable and extensible way. Maybe even different element
formatting depending on it's nested level.

We have to look at a *lot* of XML that is produced elsewhere and some
of it abuses XML design so much it makes my eyes bleed. We already
have a custom XML viewer/search tool (using DOM4J's Outputformatter to
display), but we still have a lot of xml that is terrible to look at.

So two questions: is there anything Java based that can do this? I'm
not expecting there to be so that leads me to the next question: how
would you go about implementing this? Custom Sax parser? Extend an
already existing formatter? Use a parser generator? Remember I'm
pretty much stuck with a java solution.

Ideas?

Thanks for your time,

~Chris
 
O

Oliver Wong

Chris said:
Please set your reader to fixed width for readabilities sake...

I'm aware of the plethora of XML formatters (e.g. though with Xerces,
JDOM, DOM4J), but I have a few special wants.

- To be able to drop attributes down to the next line if there are more
than a certain number of them, or if they exceed a certain length.

So far so good...
- To be able to not only wrap text content but indent it at the current
indentation level.
- To be able to set wrap and indentation levels for CDATA sections.

Doesn't this change the content of the XML document, so that the
"pretty-printed" document is no longer semantically equivalent to the
original document?

- Oliver
 
C

Chris

Yep, good point.

But as this utility is just formatting for human eyes, it doesn't
matter much if whitespace is added. I already have something roughly
equal to "view original" source in my XML viewer, if you wanted to see
the exact placement. I guess that feature would be more of an XML
renderer than a formatter.

Still, just being able to format how attributes are placed would be
handy, I can live without formatting for text and CDATA sections. Some
people give me XML that has dozens of attributes (ugh... I know), so
placing them intelligently as to be able to read without scrolling
would be a god send.

Thanks,

~Chris
 
R

Roedy Green

So two questions: is there anything Java based that can do this? I'm
not expecting there to be so that leads me to the next question: how
would you go about implementing this? Custom Sax parser? Extend an
already existing formatter? Use a parser generator? Remember I'm
pretty much stuck with a java solution.

You made a comment that suggested you might need a DTD or equivalent
to implement what you consider acceptable formatting rules. Those
schemas (is that the correct plural?) could be hard to come by. I
think the original idea was that every XML document would have a
corresponding DTD, but there are lots of supposedly XML documents with
other schemas and many without any at all and some I gather than could
not even in theory have a DTD.

So ... it seems to me you have two options. Concoct a set of rules
that can be implemented without a DTD, or use a tool such as Altova
which if memory serves, will generate you a DTD from a document,
perhaps with a little manual tweaking.
http://www.altova.com/matrix_x.html

Altova Enterprise is very expensive. Perhaps it by itself would be
sufficient.

If you are lucky, perhaps the documents you are dealing with all come
with a single type of schema.

I enjoy this sort of coding. I could write you such a beast for around
$300 US, price agreed in advance. You would give me your spec, and
sample documents you had manually formatted to your taste to clarify
the meaning of your spec. This would give you non-exclusive rights to
the code.
 
C

Chris Smith

Roedy Green said:
schemas (is that the correct plural?)

Technically, it's supposed to be "schemata". That sounds silly, though,
and everyone I've talked to says "schemas". Some dictionaries are even
listing "schemas" as a valid plural now.
I think the original idea was that every XML document would have a
corresponding DTD, but there are lots of supposedly XML documents with
other schemas and many without any at all and some I gather than could
not even in theory have a DTD.

The only possible set of well-formed XML documents that truly could not
have a DTD is one that allows an arbitrary element or attribute name on
the document element. Aside from that, a DTD can be devised that
includes any possible set of XML documents. For any non-trivial type of
data, it would be impossible to precisely describe the set of correct
documents using a DTD (or XML Schema, or anything else). Of course, the
true challenge is to create a DTD that does not contain *too many* XML
documents outside of the set, and/or that excludes certain likely
errors.

In all cases, though, you have this relationship:

well-formed <= valid <= correct

That is, the set of well-formed XML documents is a superset of those
that are valid in a particular instance, and the set of valid documents
is a superset of those that are really correct. In almost all cases,
you can replace "superset" with "strict superset" above.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
C

Chris

Roedy,

I already have XMLSpy, as it's pretty useful for writing schemas, but
I'm not sure I get what you are talking about when you suggest DTD.

Yes, DTD's and xml schemas (xsd's) do set the format of an xml document
in that they specify the order, type, constraints and plurality of
elements and attributes. But they don't have any use whatsoever in how
an XML files is textually laid out. Perhaps "format" is the wrong
word, but that's why I put "pretty print" in the subject. To be more
clear, I want a tool that will take:

<one two="asdfghjasdfhjk" three="asdfhsajkdl" four="asdfasdfhjkl"
five="asfahjkle"> <six>content</six>
<seven>content</seven> <eight> This content is longer than eight
characters and will .... This content is longer than eight characters
and will .... This content is longer than eight characters and will
..... <eight> <nine> <![CDATA[ This is CDATA content that I
would like to be formatted as well. This is CDATA content that I
would like to be formatted as well. This is CDATA content that I
would like to be formatted as well. ]]> </nine><one>

And display it as laid out in my original post. Both examples would be
validated by the same DTD or schema (though I may have missed a bracket
when pasted so they probably arent valid here), but the difference in
readability is striking.

Thanks,

~Chris
 
O

Oliver Wong

Chris said:
Yep, good point.

But as this utility is just formatting for human eyes, it doesn't
matter much if whitespace is added. I already have something roughly
equal to "view original" source in my XML viewer, if you wanted to see
the exact placement. I guess that feature would be more of an XML
renderer than a formatter.

Still, just being able to format how attributes are placed would be
handy, I can live without formatting for text and CDATA sections. Some
people give me XML that has dozens of attributes (ugh... I know), so
placing them intelligently as to be able to read without scrolling
would be a god send.

Have you considered using a XML Viewer with word wrapping support?

If you just want an XML pretty printer which puts every attribute on its
own line (instead of measuring the length of the attributes, and then
putting them on a newline if they exceed 80 characters, for example), I'd
imagine writing such a pretty printer would be "relatively" trivial.

Wouldn't you just be writing a SAX Parser that keeps an "indentation"
variable (perhaps as an int), and increment it with every "startElement"
event, decrement it with every "endElement" event, and do some slightly
clever iterating for the attributes within the startElement event?

- Oliver
 
O

Oliver Wong

Chris Smith said:
The only possible set of well-formed XML documents that truly could not
have a DTD is one that allows an arbitrary element or attribute name on
the document element. Aside from that, a DTD can be devised that
includes any possible set of XML documents. For any non-trivial type of
data, it would be impossible to precisely describe the set of correct
documents using a DTD (or XML Schema, or anything else). Of course, the
true challenge is to create a DTD that does not contain *too many* XML
documents outside of the set, and/or that excludes certain likely
errors.

Assuming one wants a DTD which exactly describes the acceptable set of
XML documents (i.e. which contains *ZERO* XML documents outside of the set),
I'd assume there's an infinite number of sets of XML documents which could
not be described by DTD. I haven't actually done an analysis on the
expressive power of DTDs, but I'd imagine that they are either as powerful
as context-free grammars, or as powerful as Turing Machines, both of which
have limitations on what languages (i.e. sets of documents) they can
describe. (E.g. accept only the XML documents which, given some suitable
encoding mechanism, represent programs/input pairs which will eventually
halt).

If one allows the DTD to describe a set which is a superset of the
desired set of legal XML documents, then one could always trivially write
whatever DTD's equivalent is to "accept everything and anything".

- Oliver
 
C

Chris Smith

Oliver Wong said:
Assuming one wants a DTD which exactly describes the acceptable set of
XML documents (i.e. which contains *ZERO* XML documents outside of the set),
I'd assume there's an infinite number of sets of XML documents which could
not be described by DTD.

Yes, that's what I said.
If one allows the DTD to describe a set which is a superset of the
desired set of legal XML documents, then one could always trivially write
whatever DTD's equivalent is to "accept everything and anything".

Yes, that's what I said, as well... except that there are certain
restrictions to what can be said in a DTD. The document (root) element
must match a specific tag name in the DTD, so at least one tag in the
document -- the root element -- must be described in the DTD in order
for the document to validate against the DTD

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
O

Oliver Wong

Chris Smith said:
Yes, that's what I said.

Sorry, I didn't read your message carefully enough. In particular, I
misread your statement "a DTD can be devised that includes any possible set
of XML documents" to mean "a DTD can be devised that describes any possible
set of XML documents."

To address the OP's original concern though, I don't think DTDs even
enter into the picture for what his/her concerns are, given that (s)he
doesn't even care if the pretty-printed document doesn't have the same
semantics as the original.

- Oliver
 
R

Roedy Green

The only possible set of well-formed XML documents that truly could not
have a DTD is one that allows an arbitrary element or attribute name on
the document element.
e.g. ant where you can concoct your own tasks.
 
C

Chris

"Have you considered using a XML Viewer with word wrapping support?"

The application I want to put it in *is* a specialized XML Viewer. :)
Choosing a general one would negate several of my very specific
requirements.

"Wouldn't you just be writing a SAX Parser that keeps an "indentation"
variable..."

That's exactly what I'm doing, but another piece of software means
another piece of software to maintain. I thought I'd give due
diligence into software reuse, but since there's nothing like what I'm
looking for, it's off to Eclipse I go...
 
J

Jaakko Kangasharju

Oliver Wong said:
Assuming one wants a DTD which exactly describes the acceptable
set of XML documents (i.e. which contains *ZERO* XML documents
outside of the set), I'd assume there's an infinite number of sets
of XML documents which could not be described by DTD.

You're right, and this has been known for over a hundred years[1] :)
I haven't actually done an analysis on the expressive power of DTDs,
but I'd imagine that they are either as powerful as context-free
grammars, or as powerful as Turing Machines

DTDs actually form a subset of what is called regular tree grammars.
The "regular" doesn't quite mean the same as with strings; regular
tree grammars resemble context-free languages more. The paper at
http://www.mulberrytech.com/Extreme/Proceedings/html/2001/Murata01/EML2001Murata01.html
by Murata et al. classifies different XML schema languages according
to their expressive power.

[1] It follows directly from the fact that a set is smaller than its
power set.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top