Looking for freely available, huge DTD

Discussion in 'XML' started by google@michaelwein.de, May 30, 2007.

  1. Guest

    Hello,

    For testing and demonstration purposes I need a freely available DTD
    that is both non-trivial and rather big, i.e. several hundred KB and
    thousands of elements. I need such a DTD because we have a problem
    with one of our tools processing XML documents that are based on such
    a DTD. Unfortunately that DTD (> 500 KB, > 4000 elements) is protected
    by an NDA and we are not allowed to disclose it to the tool vendor.
    However the tool vendor insinsts in getting the DTD to analyze the
    problem. So I would like to build a testcase based on a similar
    complex DTD but unfortunately haven't found any suitable so far in the
    web. Any help would be greatly appreciated.

    Thanks in advance
    Michael
    , May 30, 2007
    #1
    1. Advertising

  2. Guest

    On 30 May, 09:40, wrote:
    > Hello,
    >
    > For testing and demonstration purposes I need a freely available DTD
    > that is both non-trivial and rather big, i.e. several hundred KB and
    > thousands of elements. I need such a DTD because we have a problem
    > with one of our tools processing XML documents that are based on such
    > a DTD. Unfortunately that DTD (> 500 KB, > 4000 elements) is protected
    > by an NDA and we are not allowed to disclose it to the tool vendor.
    > However the tool vendor insinsts in getting the DTD to analyze the
    > problem. So I would like to build a testcase based on a similar
    > complex DTD but unfortunately haven't found any suitable so far in the
    > web. Any help would be greatly appreciated.
    >
    > Thanks in advance
    > Michael


    Just a thought...

    Have you considered obfuscating your DTD in some way, such as changing
    the names of items, and changing the order etc.? I would have thought
    that after that all DTDs look much the same! At least then your tool
    vendor is trying to fix your specific problem rather than something
    related which may not end up being the actual problem you have.

    HTH,

    Pete.
    --
    =============================================
    Pete Cordell
    Tech-Know-Ware Ltd
    for XML Schema to C++ data binding visit
    http://www.codalogic.com/lmx/
    =============================================
    , May 30, 2007
    #2
    1. Advertising

  3. Picarder Guest

    On May 30, 11:57 am, wrote:

    > Have you considered obfuscating your DTD in some way, such as changing
    > the names of items, and changing the order etc.?


    Hello Pete,

    Yes, have already thought about this. Unfortunately the DTD in
    question is quite complex and uses lots of Entity references so that
    it won't be trivial to obfuscate it and still keep it valid.
    So before writing a script/program to obfuscate the DTD I thought it
    would be simpler to get a real life example.

    > At least then your tool vendor is trying to fix your specific problem rather than something
    > related which may not end up being the actual problem you have.


    I talked with my tool vendor and they would accept a similar
    configuration as the problem is clearly related to the size and
    complexity of the DTD (the tools works fine for small and trivial
    examples).

    Michael
    Picarder, May 30, 2007
    #3
  4. wrote:
    > For testing and demonstration purposes I need a freely available DTD
    > that is both non-trivial and rather big, i.e. several hundred KB and
    > thousands of elements.


    For examples of serious DTDs, I'd suggest checking the W3C's website
    and/or the industry standardization efforts described at xml.org... but
    I don't know whether any of them are large enough for your needs.
    "Thousands of elements" suggests either a badly designed markup system
    or a document composed of a number of sub-languages. The largest DTDs I
    know of are for things like Docbook, or are the result of combining
    several standards (such as the xhtml-plus-svg-plus-whatever
    combinations, or business documents which incorporate markup for
    multiple kinds of data about the customer and the transaction), and I
    think those top out at a few hundred elements.

    I'd suggest trying to generate a synthetic testcase; it should be
    possible to write software that would generate a large DTD which has the
    same sorts of data structures yours does, and there are already tools
    which will generate nonsense documents that conform to a DTD. Of course
    the first thing you'll have to do is confirm that this testcase actually
    provokes the same problem; there's always the risk that the bug may be
    responding to something like specific choice of element names (hash
    collision or something similar).


    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, May 30, 2007
    #4
  5. Picarder Guest

    On 30 Mai, 14:18, Joe Kesselman <> wrote:

    > For examples of serious DTDs, I'd suggest checking the W3C's website
    > and/or the industry standardization efforts described at xml.org.


    I checked the "usual suspects" but didn't find anything appropriate,
    thus my post.

    > "Thousands of elements" suggests either a badly designed markup system
    > or a document composed of a number of sub-languages.


    Actually I would consider it very badly designed too. In former
    versions I could ask Stylus Studio and XMLSpy to generate a sample XML
    file for it but that doesn't work anymore with the newest release
    ("schema to complex"). And the company responsible for it only reacts
    to my complains when I can prove the DTD violating the W3C specs.

    But since this DTD is the official interface for accessing the system
    I have no other chance than use it.

    > The largest DTDs I know of are for things like Docbook


    Thanks for that hint, I just lookup up docbook.org and found a DTD
    containing 362 Elements. That's still far from my baby here (I must
    apologize, I just rechecked and it's not 4000 but "only" 2400 Elements
    and 870 Entities) but that might be suitable to construct a reasonable
    example.

    > I'd suggest trying to generate a synthetic testcase


    I believe it will be simpler to obfuscate/anonymize the given DTD.

    > there's always the risk that the bug may be responding to something like
    > specific choice of element names (hash collision or something similar).


    The bug is very likely to come from the fact that the processing tools
    actually isn't XML aware but instead uses some weird internal
    representation that isn't 100% compatible with XML's concepts. E.g. it
    can't handle comments in XML output.

    BTW, are there any good tools out there that can be used to check the
    soundness of a DTD itselft? Something like a DTD quality checker?
    XMLSpy says it is okay but I doubt that (otherwise it should be able
    to generate a proper sample XML file).

    Kind regards
    Michael
    Picarder, May 30, 2007
    #5
  6. Picarder wrote:
    >>I'd suggest trying to generate a synthetic testcase

    > I believe it will be simpler to obfuscate/anonymize the given DTD.


    It may well be, though of course that too runs some risk of obscuring
    the bug.

    Many moons ago, I promised myself that I would write a tool for
    anonymizing XSLT testcases, rewriting the stylesheet and input document
    in parallel. Doing that _well_ is a nontrivial task, but it'd be a
    useful thing for the community to have available. In My Copious Spare
    Time...

    > BTW, are there any good tools out there that can be used to check the
    > soundness of a DTD itself?


    Interesting question. I haven't seen one, outside of standard XML
    parsers' DTD reading logic.

    > XMLSpy says it is okay but I doubt that (otherwise it should be able
    > to generate a proper sample XML file).


    The bug may be in the generator rather than the DTD, of course. Sample
    generators are generally not all that useful and hence may not get much
    development effort put into them. Or XMLSpy may have its own limits on
    internal data structure sizes which your monster is overloading...

    Interesting little problem you've got there; good luck with it...

    --
    Joe Kesselman / Beware the fury of a patient man. -- John Dryden
    Joseph Kesselman, May 30, 2007
    #6
  7. Ixa Guest

    > a tool for anonymizing XSLT testcases, rewriting the stylesheet and
    > input document in parallel. Doing that _well_ is a nontrivial task


    Nontrivial? Perl and few regular expressions will do the trick. ;)

    --
    Ixa
    Ixa, May 30, 2007
    #7
  8. Picarder Guest

    On 30 Mai, 21:45, Ixa <> wrote:


    [a tool for anonymizing XSLT testcases]

    > Nontrivial? Perl and few regular expressions will do the trick. ;)


    Feel free to present your solution here :)

    Kind regards
    Michael
    Picarder, May 30, 2007
    #8
  9. Peter Flynn Guest

    Picarder wrote:
    > On 30 Mai, 14:18, Joe Kesselman <> wrote:
    >
    >> For examples of serious DTDs, I'd suggest checking the W3C's website
    >> and/or the industry standardization efforts described at xml.org.

    >
    > I checked the "usual suspects" but didn't find anything appropriate,
    > thus my post.
    >
    >> "Thousands of elements" suggests either a badly designed markup system
    >> or a document composed of a number of sub-languages.

    >
    > Actually I would consider it very badly designed too.


    Without knowing what it's for (which the NDA won't let you say) this is
    pretty much the only conclusion. This kind of monster is usually the
    result of a grotesque misunderstanding of XML, for example by someone
    who thinks it's a database management system in which an element equals
    a field.

    > But since this DTD is the official interface for accessing the system
    > I have no other chance than use it.
    >
    >> The largest DTDs I know of are for things like Docbook


    TEI is also the same magnitude.

    > Thanks for that hint, I just lookup up docbook.org and found a DTD
    > containing 362 Elements. That's still far from my baby here (I must
    > apologize, I just rechecked and it's not 4000 but "only" 2400 Elements
    > and 870 Entities) but that might be suitable to construct a reasonable
    > example.


    With some effort, it's probably possible to embed TEI in DocBook, or
    vice versa (TEI has an element-renaming mechanism which makes this
    easier). That might get you over 1,000 element types, but I don't know
    of any way to fake up a DTD of 2,400 element types.

    > The bug is very likely to come from the fact that the processing
    > tools actually isn't XML aware but instead uses some weird internal
    > representation that isn't 100% compatible with XML's concepts. E.g.
    > it can't handle comments in XML output.


    Have you tried validating an instance using the standard command-line
    tools like onsgmls?

    > BTW, are there any good tools out there that can be used to check the
    > soundness of a DTD itselft? Something like a DTD quality checker?


    It's called a validating parser. There is one built into every piece of
    XML software that performs validation on DTD-valid documents.

    > XMLSpy says it is okay but I doubt that (otherwise it should be able
    > to generate a proper sample XML file).


    Run your test instance through onsgmls and rxp first before making any
    decisions. They aren't perfect (purists scoff) but they're close.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
    Peter Flynn, May 31, 2007
    #9
  10. Ixa Guest

    > [a tool for anonymizing XSLT testcases]
    >> Nontrivial? Perl and few regular expressions will do the trick. ;)

    > Feel free to present your solution here :)


    OK, here goes. :)

    * * *

    The method that I have used when anonymizing SGMLs, XMLs and DTDs (or
    any textual content for that matter) is roughly the following:

    * Create scrambling key

    I do this by using a simple substitution cipher (alphabet soup), in
    ROT-13 way, but instead of shifting I use normal alphabet scrambling.
    This is not the most perfect way because the scrambling key can be
    calculated using statistical methods and educated guesses, but suits
    for most of the cases.

    ---8<---8<---
    $alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
    $regex = join('', shuffle( split //, $alpha) );
    ---8<---8<---

    * Define keywords

    By keyword I mean the words that should not be scrambled so that the
    end result makes sense. For example DTD keywords would be "ELEMENT",
    "ATTLIST", "PCDATA" and so forth. For XSLT there would be keywords like
    "xsl:stylesheet" and "preceding-sibling::".

    ---8<---8<---
    @keywords = ("ELEMENT", "ATTLIST", "PCDATA", [...] );
    ---8<---8<---

    * Pick up file and mangle all alphabets

    Just read the file and do normal transformation on each line according
    to the scrambling key.

    ---8<---8<---
    eval "\$line =~ tr/$alpha/$regex/";
    ---8<---8<---

    * Revert all keywords

    Seek through all keywords and do reverse scrambling.

    ---8<---8<---
    foreach $keyword (@keywords) {
    $scramble = $keyword;
    eval "\$scramble =~ tr/$alpha/$regex/";
    $line =~ s/$scramble/$keyword/;
    }
    ---8<---8<---

    * Loop for all files using the same scrambling key

    This makes sure that the scrambled DTD definitions and XML tags match
    and they can be used together. Just provide files on one spell like:

    ---8<---8<---
    $ ./anonymizer.pl *.dtd *.xml
    ---8<---8<---

    * * *

    The result looks something like this (part of DITA concept.mod and
    lawnmower concept sample):

    ---8<---8<---
    <!ELEMENT SCySRFz ((%zTzJR;), (%zTzJRMJzZ;)?,
    (%ZaCXzfRZS; | %MuZzXMSz;)?,
    (%FXCJCB;)?, (%SCyuCfk;)?, (%XRJMzRf-JTycZ;)?,
    (%SCySRFz-TyEC-zkFRZ;)* ) >
    <!ATTLIST SCySRFz
    Tf ID #REQUIRED
    SCyXRE CDATA #IMPLIED
    %ZRJRSz-MzzZ;
    %JCSMJTmMzTCy-MzzZ;
    %MXSa-MzzZ;
    CbzFbzSJMZZ
    CDATA #IMPLIED
    fCjMTyZ CDATA "&TySJbfRf-fCjMTyZ;" >
    ---8<---8<---
    <?xml version="1.0" encoding="utf-8"?>
    <!-- daTZ ETJR TZ FMXz CE zaR vwdH AFRy dCCJcTz FXCxRSz aCZzRf Cy
    qCbXSRECXBR.yRz. qRR zaR MSSCjFMykTyB JTSRyZR.zKz ETJR ECX
    MFFJTSMuJR JTSRyZRZ.-->
    <!-- (W) WCFkXTBaz wUY WCXFCXMzTCy 2001, 2005. HJJ tTBazZ tRZRXDRf.
    *-->
    <!DOCTYPE SCySRFz PUBLIC "-//AHqwq//vdv vwdH WCySRFz//gi"
    "../../fzf/SCySRFz.fzf">
    <SCySRFz Tf="JMPyjCPRXSCySRFz" xml:lang="en-us">
    <zTzJR>IMPyjCPRX</zTzJR>
    <SCyuCfk><F>daR JMPyjCPRX TZ M jMSaTyR bZRf zC Sbz BXMZZ Ty zaR kMXf.
    IMPyjCPRXZ SMy uR
    RJRSzXTS, BMZ-FCPRXRf, CX jMybMJ.</F></SCyuCfk>
    </SCySRFz>
    ---8<---8<---

    * * *

    One can of course argue that this is not the most efficient way of
    doing the anonymizing, but it has worked for me so far. The biggest
    drawbacks in this method are:

    * URI scrambling

    All filenames, folders and paths are scrambled in the process, so the
    script should rename them at the same time. However, it could be
    difficult to spot those filenames automatically and the URIs may also
    have folder names. Adding filenames as keywords and/or manually
    renaming the files and folders afterwards might be the way to go.

    * listing keywords

    Grepping specifications manually to the keyword list is a bit big task,
    but it needs to be done only once.

    * not true encryption

    As mentioned, the scrambling key can be reverse engineered. It is
    fairly easy to do statistical analysis to get the key. The effort gets
    bigger if the script is extended so that instead of simple
    substitution, one letter is replaced with varying amount of random
    letters. One step harder could the have "lossy key" that deletes some
    of the alphabets, but then the script has to make sure that the end
    result makes sense (for example <i>). There are also other ways to
    improve the key but they all require more focus one the implementation.

    Anyway, IMO this method is a compromise between functionality, the
    amount of time spend on script and secrecy, and it really depends on
    the overall project and NDA if this method is suitable and sufficient.

    --
    Ixa
    Ixa, May 31, 2007
    #10
  11. Ixa wrote:
    >> [a tool for anonymizing XSLT testcases]
    >>> Nontrivial? Perl and few regular expressions will do the trick. ;)

    >> Feel free to present your solution here :)

    >
    > OK, here goes. :)


    Good solution for the cases it covers, but...

    Remember, my comment was that it was nontrivial to properly anonymize
    XSLT testcases. That means rewriting the stylesheet logic in synch with
    the changes to the input document.... which means being aware of the
    XPaths and making sure the greeked input document still matches them
    when (and only when) it should, which may require being more careful
    about how you manipulate the document's content.

    Trivial cases have trivial solutions. One that's robust enough to give
    out to customers who have written nontrivial stylesheets with some trust
    of getting back valid testcases is not that simple, alas.


    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, May 31, 2007
    #11
  12. Ixa Guest

    > rewriting the stylesheet logic in synch with the changes to the input document

    So, you are actually looking for a method for further scrambling the
    actual logic in XSLT templates and structure in XML in addition to
    messing up the element, attribute and variable names, or?

    > Trivial cases have trivial solutions.


    Absolutely. There is lot of room for improvements in that method, the
    most important being (from XSLT point of view) that the anonymizer
    should be structure-aware. Now it just blindly messes up the lines in
    text files and requires check on the result.

    I guess the best approach could be to use XSLT to modify the XSLT and
    XML at the same time. Then there would be total control on what parts
    of the trees would be changed and there would be possibility to do more
    than just alphabet scrambling.

    --
    Ixa
    Ixa, May 31, 2007
    #12
  13. Ixa wrote:
    > I guess the best approach could be to use XSLT to modify the XSLT and
    > XML at the same time.


    That's a bit hard to do with XSLT 1.0, unless you use the redirect
    extensions... but, yes, I was pondering that approach. (As folks may
    remember, I've already written an article on using stylesheets to
    manipulate other stylesheets.)

    But the bookkeeping for this particular N-way anonymizer may be ugly
    enough to be better handled in a more traditional programmming language.

    Hey, if I had a full solution worked out, I'd have published it
    already... <smile/>

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, May 31, 2007
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    6
    Views:
    535
    Andrea Desole
    Dec 13, 2004
  2. Replies:
    3
    Views:
    481
  3. Victor Bazarov
    Replies:
    0
    Views:
    418
    Victor Bazarov
    Jul 20, 2011
  4. Virchanza
    Replies:
    2
    Views:
    584
    Bo Persson
    Jul 23, 2011
  5. Nathaniel Talbott

    Microsoft's C/C++ compiler freely available

    Nathaniel Talbott, Nov 15, 2003, in forum: Ruby
    Replies:
    19
    Views:
    178
    KONTRA Gergely
    Jan 2, 2004
Loading...

Share This Page