XML configurable formatter/"Pretty Printer"

Discussion in 'Java' started by Chris, Sep 28, 2005.

  1. Chris

    Chris Guest

    Please set your reader to fixed width for readabilities sake...

    I'm aware of the plethora of XML formatters (e.g. though with Xerces,
    JDOM, DOM4J), but I have a few special wants.

    - To be able to drop attributes down to the next line if there are more
    than a certain number of them, or if they exceed a certain length.
    - To be able to not only wrap text content but indent it at the current
    indentation level.
    - To be able to set wrap and indentation levels for CDATA sections.

    For example:

    <one two="asdfghjasdfhjk"
    three="asdfhsajkdl"
    four="asdfasdfhjkl"
    five="asfahjkle"
    >

    <six>content</six>
    <seven>content</seven>
    <eight>
    This content is longer than eight characters and will ....
    This content is longer than eight characters and will ....
    This content is longer than eight characters and will ....
    <eight>
    <nine>
    <![CDATA[
    This is CDATA content that I would like to be formatted as well.
    This is CDATA content that I would like to be formatted as well.
    This is CDATA content that I would like to be formatted as well.
    ]]>
    </nine>
    <one>

    I've googled for a couple of days and while there are a lot of
    formatters out there none seem to be as powerful as I'd like. Ideally
    there should be a formatter that allows you to set the position of
    every token (element, attribute, attribute text, text, CDATA etc.) in a
    nice configurable and extensible way. Maybe even different element
    formatting depending on it's nested level.

    We have to look at a *lot* of XML that is produced elsewhere and some
    of it abuses XML design so much it makes my eyes bleed. We already
    have a custom XML viewer/search tool (using DOM4J's Outputformatter to
    display), but we still have a lot of xml that is terrible to look at.

    So two questions: is there anything Java based that can do this? I'm
    not expecting there to be so that leads me to the next question: how
    would you go about implementing this? Custom Sax parser? Extend an
    already existing formatter? Use a parser generator? Remember I'm
    pretty much stuck with a java solution.

    Ideas?

    Thanks for your time,

    ~Chris
    Chris, Sep 28, 2005
    #1
    1. Advertising

  2. Chris

    Oliver Wong Guest

    "Chris" <> wrote in message
    news:...
    > Please set your reader to fixed width for readabilities sake...
    >
    > I'm aware of the plethora of XML formatters (e.g. though with Xerces,
    > JDOM, DOM4J), but I have a few special wants.
    >
    > - To be able to drop attributes down to the next line if there are more
    > than a certain number of them, or if they exceed a certain length.


    So far so good...

    > - To be able to not only wrap text content but indent it at the current
    > indentation level.
    > - To be able to set wrap and indentation levels for CDATA sections.


    Doesn't this change the content of the XML document, so that the
    "pretty-printed" document is no longer semantically equivalent to the
    original document?

    - Oliver
    Oliver Wong, Sep 28, 2005
    #2
    1. Advertising

  3. Chris

    Chris Guest

    Yep, good point.

    But as this utility is just formatting for human eyes, it doesn't
    matter much if whitespace is added. I already have something roughly
    equal to "view original" source in my XML viewer, if you wanted to see
    the exact placement. I guess that feature would be more of an XML
    renderer than a formatter.

    Still, just being able to format how attributes are placed would be
    handy, I can live without formatting for text and CDATA sections. Some
    people give me XML that has dozens of attributes (ugh... I know), so
    placing them intelligently as to be able to read without scrolling
    would be a god send.

    Thanks,

    ~Chris
    Chris, Sep 28, 2005
    #3
  4. Chris

    Roedy Green Guest

    On 28 Sep 2005 07:33:11 -0700, "Chris" <> wrote
    or quoted :

    >So two questions: is there anything Java based that can do this? I'm
    >not expecting there to be so that leads me to the next question: how
    >would you go about implementing this? Custom Sax parser? Extend an
    >already existing formatter? Use a parser generator? Remember I'm
    >pretty much stuck with a java solution.


    You made a comment that suggested you might need a DTD or equivalent
    to implement what you consider acceptable formatting rules. Those
    schemas (is that the correct plural?) could be hard to come by. I
    think the original idea was that every XML document would have a
    corresponding DTD, but there are lots of supposedly XML documents with
    other schemas and many without any at all and some I gather than could
    not even in theory have a DTD.

    So ... it seems to me you have two options. Concoct a set of rules
    that can be implemented without a DTD, or use a tool such as Altova
    which if memory serves, will generate you a DTD from a document,
    perhaps with a little manual tweaking.
    http://www.altova.com/matrix_x.html

    Altova Enterprise is very expensive. Perhaps it by itself would be
    sufficient.

    If you are lucky, perhaps the documents you are dealing with all come
    with a single type of schema.

    I enjoy this sort of coding. I could write you such a beast for around
    $300 US, price agreed in advance. You would give me your spec, and
    sample documents you had manually formatted to your taste to clarify
    the meaning of your spec. This would give you non-exclusive rights to
    the code.

    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Again taking new Java programming contracts.
    Roedy Green, Sep 28, 2005
    #4
  5. Chris

    Chris Smith Guest

    Roedy Green <> wrote:
    > schemas (is that the correct plural?)


    Technically, it's supposed to be "schemata". That sounds silly, though,
    and everyone I've talked to says "schemas". Some dictionaries are even
    listing "schemas" as a valid plural now.

    > I think the original idea was that every XML document would have a
    > corresponding DTD, but there are lots of supposedly XML documents with
    > other schemas and many without any at all and some I gather than could
    > not even in theory have a DTD.


    The only possible set of well-formed XML documents that truly could not
    have a DTD is one that allows an arbitrary element or attribute name on
    the document element. Aside from that, a DTD can be devised that
    includes any possible set of XML documents. For any non-trivial type of
    data, it would be impossible to precisely describe the set of correct
    documents using a DTD (or XML Schema, or anything else). Of course, the
    true challenge is to create a DTD that does not contain *too many* XML
    documents outside of the set, and/or that excludes certain likely
    errors.

    In all cases, though, you have this relationship:

    well-formed <= valid <= correct

    That is, the set of well-formed XML documents is a superset of those
    that are valid in a particular instance, and the set of valid documents
    is a superset of those that are really correct. In almost all cases,
    you can replace "superset" with "strict superset" above.

    --
    www.designacourse.com
    The Easiest Way To Train Anyone... Anywhere.

    Chris Smith - Lead Software Developer/Technical Trainer
    MindIQ Corporation
    Chris Smith, Sep 28, 2005
    #5
  6. Chris

    Chris Guest

    Roedy,

    I already have XMLSpy, as it's pretty useful for writing schemas, but
    I'm not sure I get what you are talking about when you suggest DTD.

    Yes, DTD's and xml schemas (xsd's) do set the format of an xml document
    in that they specify the order, type, constraints and plurality of
    elements and attributes. But they don't have any use whatsoever in how
    an XML files is textually laid out. Perhaps "format" is the wrong
    word, but that's why I put "pretty print" in the subject. To be more
    clear, I want a tool that will take:

    <one two="asdfghjasdfhjk" three="asdfhsajkdl" four="asdfasdfhjkl"
    five="asfahjkle"> <six>content</six>
    <seven>content</seven> <eight> This content is longer than eight
    characters and will .... This content is longer than eight characters
    and will .... This content is longer than eight characters and will
    ..... <eight> <nine> <![CDATA[ This is CDATA content that I
    would like to be formatted as well. This is CDATA content that I
    would like to be formatted as well. This is CDATA content that I
    would like to be formatted as well. ]]> </nine><one>

    And display it as laid out in my original post. Both examples would be
    validated by the same DTD or schema (though I may have missed a bracket
    when pasted so they probably arent valid here), but the difference in
    readability is striking.

    Thanks,

    ~Chris
    Chris, Sep 28, 2005
    #6
  7. Chris

    Oliver Wong Guest

    "Chris" <> wrote in message
    news:...
    > Yep, good point.
    >
    > But as this utility is just formatting for human eyes, it doesn't
    > matter much if whitespace is added. I already have something roughly
    > equal to "view original" source in my XML viewer, if you wanted to see
    > the exact placement. I guess that feature would be more of an XML
    > renderer than a formatter.
    >
    > Still, just being able to format how attributes are placed would be
    > handy, I can live without formatting for text and CDATA sections. Some
    > people give me XML that has dozens of attributes (ugh... I know), so
    > placing them intelligently as to be able to read without scrolling
    > would be a god send.


    Have you considered using a XML Viewer with word wrapping support?

    If you just want an XML pretty printer which puts every attribute on its
    own line (instead of measuring the length of the attributes, and then
    putting them on a newline if they exceed 80 characters, for example), I'd
    imagine writing such a pretty printer would be "relatively" trivial.

    Wouldn't you just be writing a SAX Parser that keeps an "indentation"
    variable (perhaps as an int), and increment it with every "startElement"
    event, decrement it with every "endElement" event, and do some slightly
    clever iterating for the attributes within the startElement event?

    - Oliver
    Oliver Wong, Sep 28, 2005
    #7
  8. Chris

    Oliver Wong Guest

    "Chris Smith" <> wrote in message
    news:...
    > The only possible set of well-formed XML documents that truly could not
    > have a DTD is one that allows an arbitrary element or attribute name on
    > the document element. Aside from that, a DTD can be devised that
    > includes any possible set of XML documents. For any non-trivial type of
    > data, it would be impossible to precisely describe the set of correct
    > documents using a DTD (or XML Schema, or anything else). Of course, the
    > true challenge is to create a DTD that does not contain *too many* XML
    > documents outside of the set, and/or that excludes certain likely
    > errors.


    Assuming one wants a DTD which exactly describes the acceptable set of
    XML documents (i.e. which contains *ZERO* XML documents outside of the set),
    I'd assume there's an infinite number of sets of XML documents which could
    not be described by DTD. I haven't actually done an analysis on the
    expressive power of DTDs, but I'd imagine that they are either as powerful
    as context-free grammars, or as powerful as Turing Machines, both of which
    have limitations on what languages (i.e. sets of documents) they can
    describe. (E.g. accept only the XML documents which, given some suitable
    encoding mechanism, represent programs/input pairs which will eventually
    halt).

    If one allows the DTD to describe a set which is a superset of the
    desired set of legal XML documents, then one could always trivially write
    whatever DTD's equivalent is to "accept everything and anything".

    - Oliver
    Oliver Wong, Sep 28, 2005
    #8
  9. Chris

    Chris Smith Guest

    Oliver Wong <> wrote:
    > Assuming one wants a DTD which exactly describes the acceptable set of
    > XML documents (i.e. which contains *ZERO* XML documents outside of the set),
    > I'd assume there's an infinite number of sets of XML documents which could
    > not be described by DTD.


    Yes, that's what I said.

    > If one allows the DTD to describe a set which is a superset of the
    > desired set of legal XML documents, then one could always trivially write
    > whatever DTD's equivalent is to "accept everything and anything".


    Yes, that's what I said, as well... except that there are certain
    restrictions to what can be said in a DTD. The document (root) element
    must match a specific tag name in the DTD, so at least one tag in the
    document -- the root element -- must be described in the DTD in order
    for the document to validate against the DTD

    --
    www.designacourse.com
    The Easiest Way To Train Anyone... Anywhere.

    Chris Smith - Lead Software Developer/Technical Trainer
    MindIQ Corporation
    Chris Smith, Sep 28, 2005
    #9
  10. Chris

    Oliver Wong Guest

    "Chris Smith" <> wrote in message
    news:...
    > Oliver Wong <> wrote:
    >> Assuming one wants a DTD which exactly describes the acceptable set
    >> of
    >> XML documents (i.e. which contains *ZERO* XML documents outside of the
    >> set),
    >> I'd assume there's an infinite number of sets of XML documents which
    >> could
    >> not be described by DTD.

    >
    > Yes, that's what I said.


    Sorry, I didn't read your message carefully enough. In particular, I
    misread your statement "a DTD can be devised that includes any possible set
    of XML documents" to mean "a DTD can be devised that describes any possible
    set of XML documents."

    To address the OP's original concern though, I don't think DTDs even
    enter into the picture for what his/her concerns are, given that (s)he
    doesn't even care if the pretty-printed document doesn't have the same
    semantics as the original.

    - Oliver
    Oliver Wong, Sep 28, 2005
    #10
  11. Chris

    Roedy Green Guest

    On Wed, 28 Sep 2005 12:55:47 -0600, Chris Smith <>
    wrote or quoted :

    >The only possible set of well-formed XML documents that truly could not
    >have a DTD is one that allows an arbitrary element or attribute name on
    >the document element.

    e.g. ant where you can concoct your own tasks.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Again taking new Java programming contracts.
    Roedy Green, Sep 28, 2005
    #11
  12. Chris

    Chris Guest

    "Have you considered using a XML Viewer with word wrapping support?"

    The application I want to put it in *is* a specialized XML Viewer. :)
    Choosing a general one would negate several of my very specific
    requirements.

    "Wouldn't you just be writing a SAX Parser that keeps an "indentation"
    variable..."

    That's exactly what I'm doing, but another piece of software means
    another piece of software to maintain. I thought I'd give due
    diligence into software reuse, but since there's nothing like what I'm
    looking for, it's off to Eclipse I go...
    Chris, Sep 29, 2005
    #12
  13. "Oliver Wong" <> writes:

    > Assuming one wants a DTD which exactly describes the acceptable
    > set of XML documents (i.e. which contains *ZERO* XML documents
    > outside of the set), I'd assume there's an infinite number of sets
    > of XML documents which could not be described by DTD.


    You're right, and this has been known for over a hundred years[1] :)

    > I haven't actually done an analysis on the expressive power of DTDs,
    > but I'd imagine that they are either as powerful as context-free
    > grammars, or as powerful as Turing Machines


    DTDs actually form a subset of what is called regular tree grammars.
    The "regular" doesn't quite mean the same as with strings; regular
    tree grammars resemble context-free languages more. The paper at
    http://www.mulberrytech.com/Extreme/Proceedings/html/2001/Murata01/EML2001Murata01.html
    by Murata et al. classifies different XML schema languages according
    to their expressive power.

    [1] It follows directly from the fact that a set is smaller than its
    power set.

    --
    Jaakko Kangasharju, Helsinki Institute for Information Technology
    () ASCII RIBBON CAMPAIGN
    /\ AGAINST HTML MAIL
    Jaakko Kangasharju, Sep 29, 2005
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?iso-8859-1?B?bW9vcJk=?=

    Configurable formatter for different source code

    =?iso-8859-1?B?bW9vcJk=?=, Jan 4, 2006, in forum: Java
    Replies:
    0
    Views:
    353
    =?iso-8859-1?B?bW9vcJk=?=
    Jan 4, 2006
  2. jmm-list-gn

    XML pretty printer

    jmm-list-gn, Aug 31, 2004, in forum: XML
    Replies:
    4
    Views:
    2,283
    jmm-list-gn
    Sep 1, 2004
  3. A.M-SG

    Switching from XML formatter to Binary Formatter

    A.M-SG, Nov 21, 2005, in forum: ASP .Net Web Services
    Replies:
    1
    Views:
    313
    Steven Cheng[MSFT]
    Nov 22, 2005
  4. PerlFAQ Server
    Replies:
    0
    Views:
    89
    PerlFAQ Server
    Jan 12, 2011
  5. PerlFAQ Server
    Replies:
    1
    Views:
    125
    brian d foy
    Apr 21, 2011
Loading...

Share This Page