partial DTD?

Discussion in 'XML' started by Rainer Gerhards, Jun 21, 2010.

  1. Hi All,

    please forgive me if this question is too basic. I am an XML beginner (at
    best ;)). For my open source project rsyslog [1] I am trying to find a
    better configuration file format. One of the candidates is an XML-based
    format [2]. If we take that route, I'd like to have the ability to at least
    partially verify a configuration file.

    However, in rsyslog nothing is static. Instead, functionality is loaded via
    modules, which can be written by third parties. These modules have (and
    need) the ability to add configuration parameters to the base set. So I
    never know exactly which parameters are valid. This makes it somewhat hard
    for me to define a DTD. I understand that probably the best option were to
    have a mechanism that permits a plugin to modify the DTD before it is being
    used. However, this sounds like a scary amount of work for which there is no
    other justification.

    So I wonder if it is possible to specify a DTD in a way that says "these are
    the rules for the elements specified inside this DTD, but additional
    containers may be added and are expected to be valid".

    Any advise on this topic would be most welcome.

    Thanks,
    Rainer

    [1] http://www.rsyslog.com
    [2] http://lists.adiscon.net/pipermail/rsyslog/2010-June/003749.html
     
    Rainer Gerhards, Jun 21, 2010
    #1
    1. Advertising

  2. Rainer Gerhards wrote:

    > So I wonder if it is possible to specify a DTD in a way that says "these
    > are the rules for the elements specified inside this DTD, but additional
    > containers may be added and are expected to be valid".


    I am not aware of any such features for DTDs. The W3C XML schema
    specification however allows wildcards
    http://www.w3.org/TR/xmlschema-0/#any and schema composition
    http://www.w3.org/TR/xmlschema-0/#import so you could consider to use
    schemas instead of a DTD.


    --

    Martin Honnen
    http://msmvps.com/blogs/martin_honnen/
     
    Martin Honnen, Jun 21, 2010
    #2
    1. Advertising

  3. Rainer Gerhards

    Peter Flynn Guest

    Manuel Collado wrote:
    > Rainer Gerhards escribi�:
    >> Hi All,
    >>
    >> please forgive me if this question is too basic. I am an XML beginner
    >> (at best ;)). For my open source project rsyslog [1] I am trying to
    >> find a better configuration file format. One of the candidates is an
    >> XML-based format [2]. If we take that route, I'd like to have the
    >> ability to at least partially verify a configuration file.
    >>
    >> However, in rsyslog nothing is static. Instead, functionality is
    >> loaded via modules, which can be written by third parties. These
    >> modules have (and need) the ability to add configuration parameters to
    >> the base set. So I never know exactly which parameters are valid. This
    >> makes it somewhat hard for me to define a DTD. I understand that
    >> probably the best option were to have a mechanism that permits a
    >> plugin to modify the DTD before it is being used. However, this sounds
    >> like a scary amount of work for which there is no other justification.
    >>
    >> So I wonder if it is possible to specify a DTD in a way that says
    >> "these are the rules for the elements specified inside this DTD, but
    >> additional containers may be added and are expected to be valid".
    >>
    >> Any advise on this topic would be most welcome.

    >
    > Not sure about what is really your problem:
    >
    > (1) Open set of valid parameter values
    > (2) Open set of module/parameter names
    >
    > If (1), the usual answer is to not constrain the set of valid values at
    > the XML markup level - implement validation checks at the application
    > level.
    >
    > If (2), do not use parameter/module names as tag names. Use attribute or
    > element values instead:
    > <param name="xxx">value</param>


    I'd agree very much with this: it makes it extensible to almost any case.

    If your application follows the conventional pattern, there are probably
    some base-level settings which apply globally, some which may be
    customised on (perhaps) a per-user or per-group basis, and some which
    apply to specific modules. This usually means a structure something like
    this:

    <?xml version="1.0"?>
    <!DOCTYPE config SYSTEM "config-v00.dtd">
    <config application="rsyslog" version="00" YYYY-MM-DD="2010-06-21">
    <base>
    <param name="verbosity">full</param>
    </base>
    <groups>
    <group type="user" name="rainer">
    <param name="autostart">no</param>
    </group>
    <group type="app" name="Google">
    <param name="domain">reverse-lookup</param>
    </group>
    </groups>
    <modules>
    <module name="gui">
    <param name="window-system">X</param>
    </module>
    </modules>
    </config>

    with config-v00.dtd:

    <!ELEMENT config (base,groups,modules)>
    <!ATTLIST config application CDATA #FIXED "rsyslog"
    version CDATA #REQUIRED
    YYYY-MM-DD CDATA #REQUIRED>
    <!ELEMENT base (param)+>
    <!ELEMENT param (#PCDATA)>
    <!ATTLIST param name NMTOKEN #REQUIRED>
    <!ELEMENT groups (group)+>
    <!ELEMENT group (param)+>
    <!ATTLIST group type (user|app|call) #REQUIRED
    name CDATA #REQUIRED>
    <!ELEMENT modules (module)+>
    <!ELEMENT module (param)+>
    <!ATTLIST module name NMTOKEN #REQUIRED>

    If it's possible to constrain module authors to make their module names
    and parameter names stick with A-Za-z0-9\.\-\_ then it makes checking a
    lot easier, but if not, make the attribute types CDATA.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
     
    Peter Flynn, Jun 21, 2010
    #3
  4. Hello everyone,

    many thanks for the good advise, this is very useful for me.

    I have also a related question. Probably this should have been the first
    question, but I wasn't smart enough to realize that ;) Is there any
    documentatin on best practices for XML based config files available? I tried
    to find such things, but I failed. Maybe I used the wrong search words, but
    in the majority of cases I got information on .NET but nothing that applies
    to XML config files in general.

    If you happen to know useful links, I would appreciate if you could tell me.

    Thanks again,
    Rainer

    "Peter Flynn" <> wrote in message
    news:...
    > Manuel Collado wrote:
    >> Not sure about what is really your problem:
    >>
    >> (1) Open set of valid parameter values
    >> (2) Open set of module/parameter names
    >>
    >> If (1), the usual answer is to not constrain the set of valid values at
    >> the XML markup level - implement validation checks at the application
    >> level.
    >>
    >> If (2), do not use parameter/module names as tag names. Use attribute or
    >> element values instead:
    >> <param name="xxx">value</param>

    >
    > I'd agree very much with this: it makes it extensible to almost any case.
    >
    > If your application follows the conventional pattern, there are probably
    > some base-level settings which apply globally, some which may be
    > customised on (perhaps) a per-user or per-group basis, and some which
    > apply to specific modules. This usually means a structure something like
    > this:
    >
    > <?xml version="1.0"?>
    > <!DOCTYPE config SYSTEM "config-v00.dtd">
    > <config application="rsyslog" version="00" YYYY-MM-DD="2010-06-21">
    > <base>
    > <param name="verbosity">full</param>
    > </base>
    > <groups>
    > <group type="user" name="rainer">
    > <param name="autostart">no</param>
    > </group>
    > <group type="app" name="Google">
    > <param name="domain">reverse-lookup</param>
    > </group>
    > </groups>
    > <modules>
    > <module name="gui">
    > <param name="window-system">X</param>
    > </module>
    > </modules>
    > </config>
    >
    > with config-v00.dtd:
    >
    > <!ELEMENT config (base,groups,modules)>
    > <!ATTLIST config application CDATA #FIXED "rsyslog"
    > version CDATA #REQUIRED
    > YYYY-MM-DD CDATA #REQUIRED>
    > <!ELEMENT base (param)+>
    > <!ELEMENT param (#PCDATA)>
    > <!ATTLIST param name NMTOKEN #REQUIRED>
    > <!ELEMENT groups (group)+>
    > <!ELEMENT group (param)+>
    > <!ATTLIST group type (user|app|call) #REQUIRED
    > name CDATA #REQUIRED>
    > <!ELEMENT modules (module)+>
    > <!ELEMENT module (param)+>
    > <!ATTLIST module name NMTOKEN #REQUIRED>
    >
    > If it's possible to constrain module authors to make their module names
    > and parameter names stick with A-Za-z0-9\.\-\_ then it makes checking a
    > lot easier, but if not, make the attribute types CDATA.
    >
    > ///Peter
    > --
    > XML FAQ: http://xml.silmaril.ie/
     
    Rainer Gerhards, Jun 22, 2010
    #4
  5. Rainer Gerhards

    Peter Flynn Guest

    Rainer Gerhards wrote:
    > I have also a related question. Probably this should have been the
    > first question, but I wasn't smart enough to realize that ;) Is there
    > any documentation on best practices for XML based config files
    > available?


    There is plenty on best practice for XML in general, but I have never
    seen anything specifically about XML for config files.

    Please let us know if you find any (or perhaps when you have finished
    the project, write some :)

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
     
    Peter Flynn, Jun 22, 2010
    #5
  6. "Peter Flynn" <> wrote in message
    news:...
    > Rainer Gerhards wrote:
    >> I have also a related question. Probably this should have been the
    >> first question, but I wasn't smart enough to realize that ;) Is there
    >> any documentation on best practices for XML based config files
    >> available?

    >
    > There is plenty on best practice for XML in general, but I have never
    > seen anything specifically about XML for config files.


    OK, at least I seem not to be too dump to Google ;)

    > Please let us know if you find any (or perhaps when you have finished
    > the project, write some :)


    Will do when I find one. I am unsure, though, of a single solution can
    become a "best practice". Anyhow, we had a very interesting discussion
    yesterday on the rsyslog mailing list. It started with this post:

    http://lists.adiscon.net/pipermail/rsyslog/2010-June/003764.html

    which suggest a format that I personally find highly readable, is valid XML
    and seems to be quite compact. Together with a SAX interface, it may even
    provide a solution to my initial question (even though the solution is
    different from the exact question).

    Thanks again for all help!
    Rainer
     
    Rainer Gerhards, Jun 23, 2010
    #6
  7. Rainer Gerhards

    Peter Flynn Guest

    Rainer Gerhards wrote:
    > "Peter Flynn" <> wrote in message
    > news:...
    >> Rainer Gerhards wrote:
    >>> I have also a related question. Probably this should have been the
    >>> first question, but I wasn't smart enough to realize that ;) Is there
    >>> any documentation on best practices for XML based config files
    >>> available?

    >>
    >> There is plenty on best practice for XML in general, but I have never
    >> seen anything specifically about XML for config files.

    >
    > OK, at least I seem not to be too dump to Google ;)
    >
    >> Please let us know if you find any (or perhaps when you have finished
    >> the project, write some :)

    >
    > Will do when I find one. I am unsure, though, of a single solution can
    > become a "best practice". Anyhow, we had a very interesting discussion
    > yesterday on the rsyslog mailing list. It started with this post:
    >
    > http://lists.adiscon.net/pipermail/rsyslog/2010-June/003764.html
    >
    > which suggest a format that I personally find highly readable, is valid
    > XML and seems to be quite compact. Together with a SAX interface, it may
    > even provide a solution to my initial question (even though the solution
    > is different from the exact question).


    David suggests some good points, although ultimately it is always a
    trade-off between conciseness and extensibility. Manuel suggested:

    > do not use parameter/module names as tag names. Use attribute or
    > element values instead


    and in general I agree -- for a config file format -- because when you
    come to extend or modify the software, you will find the hard-wired
    tagnames become an obstacle to extensibility, and you then need to start
    maintaining code to read obsolescent versions of config files. In the
    long term, the flexibility of using type and value attributes will make
    your life much easier, but I can understand the initial attraction that
    David expresses of matching the tagnames to the settings you want to
    configure. Have a look at the config files for a large system like
    Apache Cocoon, where (IMHO) they have achieved a reasonable balance
    between conciseness and flexibility.

    David also says:

    > note that with this approach everything important is in a tag, as
    > such you can allow arbatrary text to be in the file outside of tags
    > and just ignore it. This allows such text to be used as comments.


    This is very dangerous. It makes the use of an XML editor for managing
    the config files extremely difficult, and introduces a number of
    unexpected side-effects, including the danger of pernicious mixed
    content. Again, the concept of allowing arbitrary text is attractive,
    but it will cause serious problems for parsing and validation further
    down the line. I strongly recommend against it unless the config file is
    going to be extremely simple (in which case XML is probably the wrong
    choice anyway).

    ///Peter
    --
    XML FAQ: http://xml/silmaril.ie/
     
    Peter Flynn, Jun 23, 2010
    #7
  8. Peter Flynn wrote:
    > Again, the concept of allowing arbitrary text is attractive,
    > but it will cause serious problems for parsing and validation further
    > down the line. I strongly recommend against it unless the config file is
    > going to be extremely simple (in which case XML is probably the wrong
    > choice anyway).


    Since you can always drop in <!-- comments --> wherever needed, using
    text content for commenting isn't really all that much more convenient,
    and as Peter says it *is* more fragile. I second his recommendation:
    using XML semantics the way they're intended to be used ("say what you
    mean") makes for a much better design.

    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
     
    Joe Kesselman, Jun 24, 2010
    #8
  9. "Peter Flynn" <> wrote in message
    news:...
    > Rainer Gerhards wrote:
    >> http://lists.adiscon.net/pipermail/rsyslog/2010-June/003764.html
    >>
    >> which suggest a format that I personally find highly readable, is valid
    >> XML and seems to be quite compact. Together with a SAX interface, it may
    >> even provide a solution to my initial question (even though the solution
    >> is different from the exact question).

    >
    > David suggests some good points, although ultimately it is always a
    > trade-off between conciseness and extensibility. Manuel suggested:
    >
    >> do not use parameter/module names as tag names. Use attribute or
    >> element values instead

    >
    > and in general I agree -- for a config file format -- because when you
    > come to extend or modify the software, you will find the hard-wired
    > tagnames become an obstacle to extensibility, and you then need to start
    > maintaining code to read obsolescent versions of config files.


    Actually, this is a problem I have in rsyslog all the time. The system is
    heavily based on a plug-in architecture. Each plug-in brings in its own
    entities, and the config file needs to tie all these entities together.

    So far, my idea is that each plugin, during load, registers XML entity names
    (or even a partial DTD) with the rsyslog core. Then the core can merge a DTD
    from these registrations. More importantly, I can register the
    module-specific entity names in a list of valid entity names.

    My idea is that I can either read the DOM without validation and do the
    validation when building the actual config AST. I see some value in this
    approach as I need to do a number of semantic checks that go beyond the
    ability of DTDs or schemas (probably involving checking out some system
    features via API calls).

    Or I can parse the configs with a SAX-type of interface and my callback can
    use the core entity registrations while I go along.

    In both cases, I can identify the entity based on its name, and use the
    rsyslog core table of entity registrations to pass the entity down to the
    module in question. While doing so, I can also process some generic
    attributes that are based on the module type (we have several types of
    modules in rsyslog, each type being something like a superclass, e.g. types
    for input and output of messages). The rsyslog core will build an AST node
    based on the module type and the module entry point will add module-specific
    information it extracts from the attribute values (which is stored as an
    opaque block inside the generic AST node).

    So this *seems* to work for me without the problems you mentions. HOWEVER,
    this is my first time ever at doing such a thing with XML, and my idea is
    purely based on reading up XML and library specs. I am not sure if it is a
    good idea from the POV of someone with practical experience ;) So I'd
    appreciate to learn if you think this could work - or not...

    >In the
    > long term, the flexibility of using type and value attributes will make
    > your life much easier, but I can understand the initial attraction that
    > David expresses of matching the tagnames to the settings you want to
    > configure. Have a look at the config files for a large system like
    > Apache Cocoon, where (IMHO) they have achieved a reasonable balance
    > between conciseness and flexibility.


    Will do!

    > David also says:
    >
    >> note that with this approach everything important is in a tag, as
    >> such you can allow arbatrary text to be in the file outside of tags
    >> and just ignore it. This allows such text to be used as comments.

    >
    > This is very dangerous. It makes the use of an XML editor for managing
    > the config files extremely difficult, and introduces a number of
    > unexpected side-effects, including the danger of pernicious mixed
    > content. Again, the concept of allowing arbitrary text is attractive,
    > but it will cause serious problems for parsing and validation further
    > down the line. I strongly recommend against it unless the config file is
    > going to be extremely simple (in which case XML is probably the wrong
    > choice anyway).


    Point taken and noted. If I go for that format, I'll NOT promote that option
    (but I will not expressively forbid users to handle it that way at their own
    risk, aka "I don't care if they use it and it breaks somewhere down the
    line").

    Thanks again,
    Rainer
     
    Rainer Gerhards, Jun 24, 2010
    #9
  10. Rainer Gerhards

    David Lang Guest

    On Jun 23, 3:20 pm, Peter Flynn <> wrote:
    > David also says:
    >
    > > note that with this approach everything important is in a tag, as
    > > such you can allow arbatrary text to be in the file outside of tags
    > > and just ignore it. This allows such text to be used as comments.

    >
    > This is very dangerous. It makes the use of an XML editor for managing
    > the config files extremely difficult, and introduces a number of
    > unexpected side-effects, including the danger of pernicious mixed
    > content. Again, the concept of allowing arbitrary text is attractive,
    > but it will cause serious problems for parsing and validation further
    > down the line. I strongly recommend against it unless the config file is
    > going to be extremely simple (in which case XML is probably the wrong
    > choice anyway).


    how is allowing text that's not part of a tag to be treated asa
    comment (i.e. ignored by the application) dangerous? it seems to me
    that it's just a matter of having the application ignore anything
    that's not tags.

    you have to be aware of illegal XML characters, but don't you need to
    watch for those inside a comment tag anyway?


    In this case, the difficult with using 'normal' config file formats is
    the need to express pretty arbitrary nesting of things and most config
    formats ar really only setup for one level of nesting

    David Lang
     
    David Lang, Jun 24, 2010
    #10
  11. Rainer Gerhards

    David Lang Guest

    One key thing to remember here. modules are not created by random
    people, they are part of rsyslog itself.

    this should mean that there is not as much worry about what some
    module author is going to try and do.

    each module should be adding relativly little to the available
    configuration

    1. it adds things to configure the module (which could be tags or
    elements depending on if they can be specified more than once)

    2. it adds actions that can be used in many places. each action will
    have it's configuration (which I think will always be attributes, i
    can't think of any case where an action would need to specify anything
    more than once)

    the problem space in rsyslog is the following

    message processing

    define inputs (includes defining one or more parsers that convert data
    arriving to a standard datastructure,the definition of the parser
    itself is not part of the config file)

    define filters
    filters can involve
    nesting
    if-then-else
    discard this message (don't waste time having anything else
    process it)
    sets of filters/actions that can be specified separately so that
    you can have a complex set and then have other things say if
    <simplecondition> do <complex set> without needing to specify
    <complexset> more than once

    define outputs (or sets of outputs)

    it's the nesting and grouping of things that is complex and makes most
    config languages not really suitable for the task

    David Lang
     
    David Lang, Jun 24, 2010
    #11
  12. David Lang wrote:
    > how is allowing text that's not part of a tag to be treated asa
    > comment (i.e. ignored by the application) dangerous?


    In the long term, it's fragile; it will cause confusion and/or breakage
    if you later want to put text inside elements rather than in attribute
    values. It's also more likely to cause users grief if they want to write
    tooling to manipulate those files.

    So I would *not* consider relying on ignoring text content to be a "best
    practice". That doesn't mean you can't get away with it just that I
    think you're going to discover later that it wasn't the best choice.

    --
    Joe Kesselman,
    http://www.love-song-productions.com/people/keshlam/index.html

    {} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
    /\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
     
    Joe Kesselman, Jun 24, 2010
    #12
  13. Rainer Gerhards

    Peter Flynn Guest

    Rainer Gerhards wrote:
    > "Peter Flynn" <> wrote in message
    > news:...
    >> Rainer Gerhards wrote:
    >>> http://lists.adiscon.net/pipermail/rsyslog/2010-June/003764.html
    >>>
    >>> which suggest a format that I personally find highly readable, is valid
    >>> XML and seems to be quite compact. Together with a SAX interface, it may
    >>> even provide a solution to my initial question (even though the solution
    >>> is different from the exact question).

    >>
    >> David suggests some good points, although ultimately it is always a
    >> trade-off between conciseness and extensibility. Manuel suggested:
    >>
    >>> do not use parameter/module names as tag names. Use attribute or
    >>> element values instead

    >>
    >> and in general I agree -- for a config file format -- because when you
    >> come to extend or modify the software, you will find the hard-wired
    >> tagnames become an obstacle to extensibility, and you then need to start
    >> maintaining code to read obsolescent versions of config files.

    >
    > Actually, this is a problem I have in rsyslog all the time. The system
    > is heavily based on a plug-in architecture. Each plug-in brings in its
    > own entities, and the config file needs to tie all these entities together.
    >
    > So far, my idea is that each plugin, during load, registers XML entity
    > names (or even a partial DTD) with the rsyslog core. Then the core can
    > merge a DTD from these registrations. More importantly, I can register
    > the module-specific entity names in a list of valid entity names.
    >
    > My idea is that I can either read the DOM without validation and do the
    > validation when building the actual config AST. I see some value in this
    > approach as I need to do a number of semantic checks that go beyond the
    > ability of DTDs or schemas (probably involving checking out some system
    > features via API calls).
    >
    > Or I can parse the configs with a SAX-type of interface and my callback
    > can use the core entity registrations while I go along.
    >
    > In both cases, I can identify the entity based on its name, and use the
    > rsyslog core table of entity registrations to pass the entity down to
    > the module in question. While doing so, I can also process some generic
    > attributes that are based on the module type (we have several types of
    > modules in rsyslog, each type being something like a superclass, e.g.
    > types for input and output of messages). The rsyslog core will build an
    > AST node based on the module type and the module entry point will add
    > module-specific information it extracts from the attribute values (which
    > is stored as an opaque block inside the generic AST node).
    >
    > So this *seems* to work for me without the problems you mention.


    That's because you haven't encountered them yet :)

    > HOWEVER, this is my first time ever at doing such a thing with XML, and
    > my idea is purely based on reading up XML and library specs. I am not
    > sure if it is a good idea from the POV of someone with practical
    > experience ;) So I'd appreciate to learn if you think this could work -
    > or not...


    What are you using to create/edit the config files? A "dumb" text-editor
    (eg Notepad)? A "smart" text-editor with XML (eg Emacs/psgml/nxml)? Or a
    multi-pane XML editor (eg oXygen, XML Spy, etc)? Or are you creating
    them programmatically from within your code? And how will the module
    authors create them?

    My point was that if you start to do unexpected things with XML, like
    allowing random text in places where it's unexpected even if permitted,
    people will eventually run up against limitations in their software
    which they may not appreciate or understand.

    I just noticed that there is a whole chapter on XML in config files in
    Benoît Marchal's book "Applied XML Solutions" (Sams, 2000, 0672320541),
    which is probably worth reading.

    ///Peter
     
    Peter Flynn, Jun 25, 2010
    #13
  14. Rainer Gerhards

    Peter Flynn Guest

    David Lang wrote:
    [...]
    > how is allowing text that's not part of a tag to be treated asa
    > comment (i.e. ignored by the application) dangerous? it seems to me
    > that it's just a matter of having the application ignore anything
    > that's not tags.


    But XML is *all* tags. What I think you mean is you want to ignore all
    text nodes which have sibling element nodes. Is that correct?

    It's not so much a question of having your application "ignore" them:
    it's specifying accurately which bits of the parse tree to omit; and
    earlier, specifying to the editing application how to signal to the user
    that text in certain places is significant but in others not.

    You should understand that the markup community has been down this road
    a thousand times before, from the late 1980s onwards. I don't know of
    any application of XML (or SGML, for that matter) which has ever adopted
    this as a matter of practice -- if it has been done, it certainly has
    not survived AFAIK. That's not to say you can't; but you would need to
    examine what you are proposing *very* carefully before going down that path.

    If you *do* manage to make it work, please consider submitting a paper
    describing it to the Balisage conference, which is where markup people
    love to hear about these things (www.balisage.net).

    > you have to be aware of illegal XML characters, but don't you need to
    > watch for those inside a comment tag anyway?


    You shouldn't need to: if you are using the proper software (an XML
    editor), it won't let you generate such characters in the first place.

    I can't emphasize this strongly enough: USE AN XML EDITOR. I know it's
    very tempting, especially for the expert programmer, to do it all in
    Notepad or whatever, but in the end it will result in tears and
    recriminations. You wouldn't write your C or Java in Notepad (at least,
    I hope not), so you shouldn't expect to be able to do so with XML: the
    syntax is at least as arcane as a programming language, and IMHO a
    syntax-directed editor is essential.

    > In this case, the difficulty with using 'normal' config file formats is
    > the need to express pretty arbitrary nesting of things and most config
    > formats are really only setup for one level of nesting


    That's an argument for getting the document type design right, not an
    argument for allowing arbitrary character data between element nodes in
    element content.

    I don't think anyone has suggested using what you call "normal" config
    file formats (by which I think you mean two-level representations of
    java.properties or X resources files) -- my earlier example specifically
    avoided doing that, and Benoît Marchal's chapter I just referred to
    explicitly makes the same point. XML is *designed* to handle arbitrarily
    deep nesting -- have a look at any standard application like DocBook or TEI.

    ///Peter
     
    Peter Flynn, Jun 25, 2010
    #14
  15. "David Lang" <> wrote in message
    news:...
    > One key thing to remember here. modules are not created by random
    > people, they are part of rsyslog itself.
    >
    > this should mean that there is not as much worry about what some
    > module author is going to try and do.


    Ah, that's not really right. While most of the modules originated from the
    project, there are some (omoracle for example) that are just distributed for
    convenience. There most probably also exist modules the rsyslog team has
    never heard about. One reason to introduce a plugin architecture was to
    enable third parties to add functionality.

    Rainer
    PS: I know the comment is a bit off-topic here, but I thought this is
    important for the overall picture.
     
    Rainer Gerhards, Jun 25, 2010
    #15
  16. "Peter Flynn" <> wrote in message
    news:...
    > Rainer Gerhards wrote:
    >> "Peter Flynn" <> wrote in message
    >> news:...
    >>> Rainer Gerhards wrote:
    >>>> http://lists.adiscon.net/pipermail/rsyslog/2010-June/003764.html
    >>>>
    >>>> which suggest a format that I personally find highly readable, is valid
    >>>> XML and seems to be quite compact. Together with a SAX interface, it
    >>>> may
    >>>> even provide a solution to my initial question (even though the
    >>>> solution
    >>>> is different from the exact question).
    >>>
    >>> David suggests some good points, although ultimately it is always a
    >>> trade-off between conciseness and extensibility. Manuel suggested:
    >>>
    >>>> do not use parameter/module names as tag names. Use attribute or
    >>>> element values instead
    >>>
    >>> and in general I agree -- for a config file format -- because when you
    >>> come to extend or modify the software, you will find the hard-wired
    >>> tagnames become an obstacle to extensibility, and you then need to start
    >>> maintaining code to read obsolescent versions of config files.

    >>
    >> Actually, this is a problem I have in rsyslog all the time. The system
    >> is heavily based on a plug-in architecture. Each plug-in brings in its
    >> own entities, and the config file needs to tie all these entities
    >> together.
    >>
    >> So far, my idea is that each plugin, during load, registers XML entity
    >> names (or even a partial DTD) with the rsyslog core. Then the core can
    >> merge a DTD from these registrations. More importantly, I can register
    >> the module-specific entity names in a list of valid entity names.
    >>
    >> My idea is that I can either read the DOM without validation and do the
    >> validation when building the actual config AST. I see some value in this
    >> approach as I need to do a number of semantic checks that go beyond the
    >> ability of DTDs or schemas (probably involving checking out some system
    >> features via API calls).
    >>
    >> Or I can parse the configs with a SAX-type of interface and my callback
    >> can use the core entity registrations while I go along.
    >>
    >> In both cases, I can identify the entity based on its name, and use the
    >> rsyslog core table of entity registrations to pass the entity down to
    >> the module in question. While doing so, I can also process some generic
    >> attributes that are based on the module type (we have several types of
    >> modules in rsyslog, each type being something like a superclass, e.g.
    >> types for input and output of messages). The rsyslog core will build an
    >> AST node based on the module type and the module entry point will add
    >> module-specific information it extracts from the attribute values (which
    >> is stored as an opaque block inside the generic AST node).
    >>
    >> So this *seems* to work for me without the problems you mention.

    >
    > That's because you haven't encountered them yet :)


    That's why I asked (with 0 implementations, you always have 0 problems ;))

    >> HOWEVER, this is my first time ever at doing such a thing with XML, and
    >> my idea is purely based on reading up XML and library specs. I am not
    >> sure if it is a good idea from the POV of someone with practical
    >> experience ;) So I'd appreciate to learn if you think this could work -
    >> or not...

    >
    > What are you using to create/edit the config files? A "dumb" text-editor
    > (eg Notepad)? A "smart" text-editor with XML (eg Emacs/psgml/nxml)? Or a
    > multi-pane XML editor (eg oXygen, XML Spy, etc)? Or are you creating
    > them programmatically from within your code? And how will the module
    > authors create them?


    We must assume that one common case is a sysadmin on a stripped-down system
    with just plain old vi at his hands.

    > My point was that if you start to do unexpected things with XML, like
    > allowing random text in places where it's unexpected even if permitted,
    > people will eventually run up against limitations in their software
    > which they may not appreciate or understand.


    I already ruled that out...

    > I just noticed that there is a whole chapter on XML in config files in
    > Benoît Marchal's book "Applied XML Solutions" (Sams, 2000, 0672320541),
    > which is probably worth reading.


    That's a good pointer. However, digesting all the information from this
    thread, other discussions and adding a requirement I simply had forgotten
    [1], it turns out that XML is probably not a solution for rsyslog config
    files. That doesn't mean the discussion was useless. Right the opposite is
    true: without all your good comments, I'd probably not been able to see XML
    is not right for this specific job and I may have invested a lot of time in
    unfruitful work :)

    If you are interested in more detail of the reasons, it requires a lot of
    explanation. For those interested, I provide it in [1] and the follow-up
    posts to it.

    Thanks again,
    Rainer

    [1] http://lists.adiscon.net/pipermail/rsyslog/2010-June/003830.html
     
    Rainer Gerhards, Jun 25, 2010
    #16
  17. Rainer Gerhards

    David Lang Guest

    On Jun 24, 5:00 pm, Peter Flynn <> wrote:
    > What are you using to create/edit the config files? A "dumb" text-editor
    > (eg Notepad)? A "smart" text-editor with XML (eg Emacs/psgml/nxml)? Or a
    > multi-pane XML editor (eg oXygen, XML Spy, etc)? Or are you creating
    > them programmatically from within your code? And how will the module
    > authors create them?


    the answer is 'all of the above' ;-)

    I expect that most of the time they are going to be created by a dumb
    text editor (vi), but it would be useful to have the config file
    definition done in such a way thta you could take an off-the-shelf
    smart editor, point it at the DTD/schema and have it help the user get
    the config correct.

    > My point was that if you start to do unexpected things with XML, like
    > allowing random text in places where it's unexpected even if permitted,
    > people will eventually run up against limitations in their software
    > which they may not appreciate or understand.


    ok, my assumption was that with the definition of XML as a markup
    language, all XML editors would handle mixed text and tags. since the
    configs aren't expected to use anything but tags, the text portion
    could be used for comments.

    > I just noticed that there is a whole chapter on XML in config files in
    > Benoît Marchal's book "Applied XML Solutions" (Sams, 2000, 0672320541),
    > which is probably worth reading.


    I'll see ifI can track down a copy
     
    David Lang, Jun 25, 2010
    #17
  18. Rainer Gerhards

    David Lang Guest

    On Jun 24, 5:20 pm, Peter Flynn <> wrote:
    > David Lang wrote:
    >
    > [...]
    >
    > > how is allowing text that's not part of a tag to be treated asa
    > > comment (i.e. ignored by the application) dangerous? it seems to me
    > > that it's just a matter of having the application ignore anything
    > > that's not tags.

    >
    > But XML is *all* tags.  What I think you mean is you want to ignore all
    > text nodes which have sibling element nodes. Is that correct?


    what I mean is the ability to do
    <tag>
    <tag param=value>
    <tag/>
    comment, this is why I did this
    </tag>
    <tag>

    > It's not so much a question of having your application "ignore" them:
    > it's specifying accurately which bits of the parse tree to omit; and
    > earlier, specifying to the editing application how to signal to the user
    > that text in certain places is significant but in others not.


    if all text is ignored (i.e. not processed by the application in
    defining it's config) it's not a matter of ignoring text in some
    places but not in others.

    > You should understand that the markup community has been down this road
    > a thousand times before, from the late 1980s onwards. I don't know of
    > any application of XML (or SGML, for that matter) which has ever adopted
    > this as a matter of practice -- if it has been done, it certainly has
    > not survived AFAIK. That's not to say you can't; but you would need to
    > examine what you are proposing *very* carefully before going down that path.


    noted

    > If you *do* manage to make it work, please consider submitting a paper
    > describing it to the Balisage conference, which is where markup people
    > love to hear about these things (www.balisage.net).


    well, it 'works' in that I've been doing this for several years, but
    it seems such a trivial thing that I'm not sure how I would write it
    up.

    > > you have to be aware of illegal XML characters, but don't you need to
    > > watch for those inside a comment tag anyway?

    >
    > You shouldn't need to: if you are using the proper software (an XML
    > editor), it won't let you generate such characters in the first place.
    >
    > I can't emphasize this strongly enough: USE AN XML EDITOR. I know it's
    > very tempting, especially for the expert programmer, to do it all in
    > Notepad or whatever, but in the end it will result in tears and
    > recriminations. You wouldn't write your C or Java in Notepad (at least,
    > I hope not), so you shouldn't expect to be able to do so with XML: the
    > syntax is at least as arcane as a programming language, and IMHO a
    > syntax-directed editor is essential.


    for a system administration tool like syslog, this is not a
    requirement that we can impose. the system may not _have_ a XML aware
    editor on it.

    what rsyslog needs is a config file language that can be edited
    without any special editor, but we were thinking that by using XML we
    could benefit from the XML aware editors that exist by defining a DTD/
    schema that would effectively turn the generic XML editor into a
    rsyslog aware editor

    > > In this case, the difficulty with using 'normal' config file formats is
    > > the need to express pretty arbitrary nesting of things and most config
    > > formats are really only setup for one level of nesting

    >
    > That's an argument for getting the document type design right, not an
    > argument for allowing arbitrary character data between element nodes in
    > element content.
    >
    > I don't think anyone has suggested using what you call "normal" config
    > file formats (by which I think you mean two-level representations of
    > java.properties or X resources files) -- my earlier example specifically
    > avoided doing that, and Benoît Marchal's chapter I just referred to
    > explicitly makes the same point. XML is *designed* to handle arbitrarily
    > deep nesting -- have a look at any standard application like DocBook or TEI.


    the discussion on a config file format for rsyslog did not start with
    XML, they wandered around and drifted towards XML because it could
    handle the nesting well (overnight we identified the need to do if-
    then-else which I don't see a good way to do in XML). There have been
    suggestions that what we are trying to do is not a good fit for XML
    and therefor we should just use a 'normal' config language (for
    example the INI format)

    David Lang
     
    David Lang, Jun 25, 2010
    #18
  19. Rainer Gerhards

    Peter Flynn Guest

    David Lang wrote:
    > On Jun 24, 5:00�pm, Peter Flynn <> wrote:

    [...]
    > I expect that most of the time they are going to be created by a dumb
    > text editor (vi), but it would be useful to have the config file
    > definition done in such a way thta you could take an off-the-shelf
    > smart editor, point it at the DTD/schema and have it help the user get
    > the config correct.


    You don't even need an editor to do that; any standalone validating
    parser can do it (onsgmls, rxp, ...)

    > ok, my assumption was that with the definition of XML as a markup
    > language, all XML editors would handle mixed text and tags.


    They will, for some value of "handle".

    > since the configs aren't expected to use anything but tags, the text
    > portion could be used for comments.


    I think there is a misunderstanding here. An element in XML usually
    consists of two tags, a start-tag and an end-tag. Between them goes
    either (a) just text, or (b) just other elements, or (c) a mixture
    (Mixed Content, like paragraphs in HTML are made of). There is a special
    case called an EMPTY Element which contains nothing at all, and is
    allowed to use the special syntax of terminating the start-tag with />

    "The text portion" you refer to is (c). In effect you want a content
    model which is the inverse of the normal XML application, where nothing
    is allowed to contain any text except the lowest points in the
    hierarchy. What you are looking for would allow text (a) everywhere *or
    (b) everywhere *except* the lowest points in the hierarchy (if all your
    elements were declared EMPTY and you used attributes for the data).

    While every XML editor will accept this because it is perfectly within
    the rules, it is definitely an extreme edge case, so support for it will
    be minimal: see below for why.

    (Note that some XML applications -- including OOXML -- are at the
    opposite extreme and have no mixed content at all, not even inside
    paragraphs. Again this is perfectly within the rules, just harder to
    work with.)

    Don't forget that newlines are generally insignificant in XML: with
    other white-space they can be normalised to single spaces, under certain
    conditions. You CANNOT therefore rely on all processes treating
    beautifully-maintained "pretty-printed" XML as sacrosanct.

    So why the minimal support?

    If you allow PCDATA (text with no element markup) in between elements,
    the rules of XML will not allow you to specify sequence: the mixture has
    to allow elements and text *in any order*. So previously (using my
    example of the other day) you might have a config outer (root) element
    type, containing (in sequence) base, groups, and modules; if you allow
    interspersed text, the content model becomes "any mixture of text, base,
    groups, and modules, IN ANY ORDER" (XML Spec, 3.2.2, Mixed Content).

    This makes it hard-to-impossible to maintain any kind of structure to a
    document, which is why this kind of definition has been examined and
    rejected. As I said, there's nothing to stop you, but eventually you'll
    stop yourself.

    You *could* opt to use SGML instead :) where sequential mixed content
    was permitted; but one of the reasons we removed it from XML was the
    difficulty of creating, maintaining, and processing it.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
     
    Peter Flynn, Jun 25, 2010
    #19
  20. Rainer Gerhards

    Peter Flynn Guest

    David Lang wrote:
    > On Jun 24, 5:20�pm, Peter Flynn <> wrote:
    >> David Lang wrote:
    >>
    >> [...]
    >>
    >>> how is allowing text that's not part of a tag to be treated asa
    >>> comment (i.e. ignored by the application) dangerous? it seems to me
    >>> that it's just a matter of having the application ignore anything
    >>> that's not tags.

    >> But XML is *all* tags. �What I think you mean is you want to ignore all
    >> text nodes which have sibling element nodes. Is that correct?

    >
    > what I mean is the ability to do
    > <tag>
    > <tag param=value>
    > <tag/>
    > comment, this is why I did this
    > </tag>
    > <tag>


    I just replied to your earlier post explaining why this is so hard to
    manage. Basically XML cannot be used to constrain the order in which
    elements appear, if you permit arbitrary text to occur between them.
    It's prohibited by the Spec, and for good reason (unmaintainability).

    >> It's not so much a question of having your application "ignore" them:
    >> it's specifying accurately which bits of the parse tree to omit; and
    >> earlier, specifying to the editing application how to signal to the user
    >> that text in certain places is significant but in others not.

    >
    > if all text is ignored (i.e. not processed by the application in
    > defining it's config) it's not a matter of ignoring text in some
    > places but not in others.


    That would work, but you cannot specify element order.

    > well, it 'works' in that I've been doing this for several years, but
    > it seems such a trivial thing that I'm not sure how I would write it
    > up.


    It only "works" in the sense that you have to create the documents by
    imposing a human-mediated constraint of order on the appearance of
    elements. If you define the DTD or Schema to allow intervening text, an
    XML editor will have to permit the elements to occur in any order.

    > for a system administration tool like syslog, this is not a
    > requirement that we can impose. the system may not _have_ a XML aware
    > editor on it.


    In that case I recommend not using XML at all.

    > what rsyslog needs is a config file language that can be edited
    > without any special editor, but we were thinking that by using XML we
    > could benefit from the XML aware editors that exist by defining a DTD/
    > schema that would effectively turn the generic XML editor into a
    > rsyslog aware editor


    That is precisely what it will do, but NOT for the case where you allow
    text to occur arbitrarily between elements, IFF you need to preserve the
    order in which elements can occur.

    If order is not important (and it's arguable that in a config file, it
    might well not be significant), then what you propose will work, but you
    must be VERY careful not to make the application dependent on the
    occurrence of newlines (see my previous post) because in those
    circumstances, some editors opening your example:

    <tag>
    <tag param=value>
    <tag/>
    comment, this is why I did this
    </tag>
    <tag>

    will save it as

    <tag> <tag param=value> <tag/> comment, this is why I did this
    </tag> <tag>

    (that's all on one line: some newsreaders may break it up). This may not
    be what you want.

    > the discussion on a config file format for rsyslog did not start with
    > XML, they wandered around and drifted towards XML because it could
    > handle the nesting well (overnight we identified the need to do if-
    > then-else which I don't see a good way to do in XML). There have been
    > suggestions that what we are trying to do is not a good fit for XML
    > and therefor we should just use a 'normal' config language (for
    > example the INI format)


    Yep, that might be easier to do. But you could consider the alternative
    of allowing rsyslog comments where you want them, but using XML comment
    syntax:

    <tag>
    <tag param=value>
    <tag/>
    <!-- comment, this is why I did this -->
    </tag>
    <tag>

    That will work just fine, because then you can use properly constrained
    content models; but the sysadmin with only vi available must remember
    that the XML comment syntax would be compulsory.

    ///Peter
     
    Peter Flynn, Jun 25, 2010
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Billy
    Replies:
    2
    Views:
    516
    Billy
    Feb 1, 2006
  2. Joseph Tilian
    Replies:
    0
    Views:
    356
    Joseph Tilian
    Dec 21, 2004
  3. Ronald Fischer
    Replies:
    4
    Views:
    1,763
    Ronald Fischer
    Mar 17, 2005
  4. test
    Replies:
    2
    Views:
    2,055
    Oliver Wong
    Jul 28, 2006
  5. David Lang

    Re: partial DTD?

    David Lang, Jun 25, 2010, in forum: XML
    Replies:
    0
    Views:
    1,222
    David Lang
    Jun 25, 2010
Loading...

Share This Page