Looking for freely available, huge DTD

G

google

Hello,

For testing and demonstration purposes I need a freely available DTD
that is both non-trivial and rather big, i.e. several hundred KB and
thousands of elements. I need such a DTD because we have a problem
with one of our tools processing XML documents that are based on such
a DTD. Unfortunately that DTD (> 500 KB, > 4000 elements) is protected
by an NDA and we are not allowed to disclose it to the tool vendor.
However the tool vendor insinsts in getting the DTD to analyze the
problem. So I would like to build a testcase based on a similar
complex DTD but unfortunately haven't found any suitable so far in the
web. Any help would be greatly appreciated.

Thanks in advance
Michael
 
U

usenet

Hello,

For testing and demonstration purposes I need a freely available DTD
that is both non-trivial and rather big, i.e. several hundred KB and
thousands of elements. I need such a DTD because we have a problem
with one of our tools processing XML documents that are based on such
a DTD. Unfortunately that DTD (> 500 KB, > 4000 elements) is protected
by an NDA and we are not allowed to disclose it to the tool vendor.
However the tool vendor insinsts in getting the DTD to analyze the
problem. So I would like to build a testcase based on a similar
complex DTD but unfortunately haven't found any suitable so far in the
web. Any help would be greatly appreciated.

Thanks in advance
Michael

Just a thought...

Have you considered obfuscating your DTD in some way, such as changing
the names of items, and changing the order etc.? I would have thought
that after that all DTDs look much the same! At least then your tool
vendor is trying to fix your specific problem rather than something
related which may not end up being the actual problem you have.

HTH,

Pete.
--
=============================================
Pete Cordell
Tech-Know-Ware Ltd
for XML Schema to C++ data binding visit
http://www.codalogic.com/lmx/
=============================================
 
P

Picarder

Have you considered obfuscating your DTD in some way, such as changing
the names of items, and changing the order etc.?

Hello Pete,

Yes, have already thought about this. Unfortunately the DTD in
question is quite complex and uses lots of Entity references so that
it won't be trivial to obfuscate it and still keep it valid.
So before writing a script/program to obfuscate the DTD I thought it
would be simpler to get a real life example.
At least then your tool vendor is trying to fix your specific problem rather than something
related which may not end up being the actual problem you have.

I talked with my tool vendor and they would accept a similar
configuration as the problem is clearly related to the size and
complexity of the DTD (the tools works fine for small and trivial
examples).

Michael
 
J

Joe Kesselman

For testing and demonstration purposes I need a freely available DTD
that is both non-trivial and rather big, i.e. several hundred KB and
thousands of elements.

For examples of serious DTDs, I'd suggest checking the W3C's website
and/or the industry standardization efforts described at xml.org... but
I don't know whether any of them are large enough for your needs.
"Thousands of elements" suggests either a badly designed markup system
or a document composed of a number of sub-languages. The largest DTDs I
know of are for things like Docbook, or are the result of combining
several standards (such as the xhtml-plus-svg-plus-whatever
combinations, or business documents which incorporate markup for
multiple kinds of data about the customer and the transaction), and I
think those top out at a few hundred elements.

I'd suggest trying to generate a synthetic testcase; it should be
possible to write software that would generate a large DTD which has the
same sorts of data structures yours does, and there are already tools
which will generate nonsense documents that conform to a DTD. Of course
the first thing you'll have to do is confirm that this testcase actually
provokes the same problem; there's always the risk that the bug may be
responding to something like specific choice of element names (hash
collision or something similar).
 
P

Picarder

For examples of serious DTDs, I'd suggest checking the W3C's website
and/or the industry standardization efforts described at xml.org.

I checked the "usual suspects" but didn't find anything appropriate,
thus my post.
"Thousands of elements" suggests either a badly designed markup system
or a document composed of a number of sub-languages.

Actually I would consider it very badly designed too. In former
versions I could ask Stylus Studio and XMLSpy to generate a sample XML
file for it but that doesn't work anymore with the newest release
("schema to complex"). And the company responsible for it only reacts
to my complains when I can prove the DTD violating the W3C specs.

But since this DTD is the official interface for accessing the system
I have no other chance than use it.
The largest DTDs I know of are for things like Docbook

Thanks for that hint, I just lookup up docbook.org and found a DTD
containing 362 Elements. That's still far from my baby here (I must
apologize, I just rechecked and it's not 4000 but "only" 2400 Elements
and 870 Entities) but that might be suitable to construct a reasonable
example.
I'd suggest trying to generate a synthetic testcase

I believe it will be simpler to obfuscate/anonymize the given DTD.
there's always the risk that the bug may be responding to something like
specific choice of element names (hash collision or something similar).

The bug is very likely to come from the fact that the processing tools
actually isn't XML aware but instead uses some weird internal
representation that isn't 100% compatible with XML's concepts. E.g. it
can't handle comments in XML output.

BTW, are there any good tools out there that can be used to check the
soundness of a DTD itselft? Something like a DTD quality checker?
XMLSpy says it is okay but I doubt that (otherwise it should be able
to generate a proper sample XML file).

Kind regards
Michael
 
J

Joseph Kesselman

Picarder said:
I believe it will be simpler to obfuscate/anonymize the given DTD.

It may well be, though of course that too runs some risk of obscuring
the bug.

Many moons ago, I promised myself that I would write a tool for
anonymizing XSLT testcases, rewriting the stylesheet and input document
in parallel. Doing that _well_ is a nontrivial task, but it'd be a
useful thing for the community to have available. In My Copious Spare
Time...
BTW, are there any good tools out there that can be used to check the
soundness of a DTD itself?

Interesting question. I haven't seen one, outside of standard XML
parsers' DTD reading logic.
XMLSpy says it is okay but I doubt that (otherwise it should be able
to generate a proper sample XML file).

The bug may be in the generator rather than the DTD, of course. Sample
generators are generally not all that useful and hence may not get much
development effort put into them. Or XMLSpy may have its own limits on
internal data structure sizes which your monster is overloading...

Interesting little problem you've got there; good luck with it...
 
I

Ixa

a tool for anonymizing XSLT testcases, rewriting the stylesheet and
input document in parallel. Doing that _well_ is a nontrivial task

Nontrivial? Perl and few regular expressions will do the trick. ;)
 
P

Picarder

[a tool for anonymizing XSLT testcases]
Nontrivial? Perl and few regular expressions will do the trick. ;)

Feel free to present your solution here :)

Kind regards
Michael
 
P

Peter Flynn

Picarder said:
I checked the "usual suspects" but didn't find anything appropriate,
thus my post.


Actually I would consider it very badly designed too.

Without knowing what it's for (which the NDA won't let you say) this is
pretty much the only conclusion. This kind of monster is usually the
result of a grotesque misunderstanding of XML, for example by someone
who thinks it's a database management system in which an element equals
a field.
But since this DTD is the official interface for accessing the system
I have no other chance than use it.

TEI is also the same magnitude.
Thanks for that hint, I just lookup up docbook.org and found a DTD
containing 362 Elements. That's still far from my baby here (I must
apologize, I just rechecked and it's not 4000 but "only" 2400 Elements
and 870 Entities) but that might be suitable to construct a reasonable
example.

With some effort, it's probably possible to embed TEI in DocBook, or
vice versa (TEI has an element-renaming mechanism which makes this
easier). That might get you over 1,000 element types, but I don't know
of any way to fake up a DTD of 2,400 element types.
The bug is very likely to come from the fact that the processing
tools actually isn't XML aware but instead uses some weird internal
representation that isn't 100% compatible with XML's concepts. E.g.
it can't handle comments in XML output.

Have you tried validating an instance using the standard command-line
tools like onsgmls?
BTW, are there any good tools out there that can be used to check the
soundness of a DTD itselft? Something like a DTD quality checker?

It's called a validating parser. There is one built into every piece of
XML software that performs validation on DTD-valid documents.
XMLSpy says it is okay but I doubt that (otherwise it should be able
to generate a proper sample XML file).

Run your test instance through onsgmls and rxp first before making any
decisions. They aren't perfect (purists scoff) but they're close.

///Peter
 
I

Ixa

[a tool for anonymizing XSLT testcases]
Feel free to present your solution here :)

OK, here goes. :)

* * *

The method that I have used when anonymizing SGMLs, XMLs and DTDs (or
any textual content for that matter) is roughly the following:

* Create scrambling key

I do this by using a simple substitution cipher (alphabet soup), in
ROT-13 way, but instead of shifting I use normal alphabet scrambling.
This is not the most perfect way because the scrambling key can be
calculated using statistical methods and educated guesses, but suits
for most of the cases.

---8<---8<---
$alpha = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
$regex = join('', shuffle( split //, $alpha) );
---8<---8<---

* Define keywords

By keyword I mean the words that should not be scrambled so that the
end result makes sense. For example DTD keywords would be "ELEMENT",
"ATTLIST", "PCDATA" and so forth. For XSLT there would be keywords like
"xsl:stylesheet" and "preceding-sibling::".

---8<---8<---
@keywords = ("ELEMENT", "ATTLIST", "PCDATA", [...] );
---8<---8<---

* Pick up file and mangle all alphabets

Just read the file and do normal transformation on each line according
to the scrambling key.

---8<---8<---
eval "\$line =~ tr/$alpha/$regex/";
---8<---8<---

* Revert all keywords

Seek through all keywords and do reverse scrambling.

---8<---8<---
foreach $keyword (@keywords) {
$scramble = $keyword;
eval "\$scramble =~ tr/$alpha/$regex/";
$line =~ s/$scramble/$keyword/;
}
---8<---8<---

* Loop for all files using the same scrambling key

This makes sure that the scrambled DTD definitions and XML tags match
and they can be used together. Just provide files on one spell like:

---8<---8<---
$ ./anonymizer.pl *.dtd *.xml
---8<---8<---

* * *

The result looks something like this (part of DITA concept.mod and
lawnmower concept sample):

---8<---8<---
<!ELEMENT SCySRFz ((%zTzJR;), (%zTzJRMJzZ;)?,
(%ZaCXzfRZS; | %MuZzXMSz;)?,
(%FXCJCB;)?, (%SCyuCfk;)?, (%XRJMzRf-JTycZ;)?,
(%SCySRFz-TyEC-zkFRZ;)* ) >
<!ATTLIST SCySRFz
Tf ID #REQUIRED
SCyXRE CDATA #IMPLIED
%ZRJRSz-MzzZ;
%JCSMJTmMzTCy-MzzZ;
%MXSa-MzzZ;
CbzFbzSJMZZ
CDATA #IMPLIED
fCjMTyZ CDATA "&TySJbfRf-fCjMTyZ;" >
---8<---8<---
<?xml version="1.0" encoding="utf-8"?>
<!-- daTZ ETJR TZ FMXz CE zaR vwdH AFRy dCCJcTz FXCxRSz aCZzRf Cy
qCbXSRECXBR.yRz. qRR zaR MSSCjFMykTyB JTSRyZR.zKz ETJR ECX
MFFJTSMuJR JTSRyZRZ.-->
<!-- (W) WCFkXTBaz wUY WCXFCXMzTCy 2001, 2005. HJJ tTBazZ tRZRXDRf.
*-->
<!DOCTYPE SCySRFz PUBLIC "-//AHqwq//vdv vwdH WCySRFz//gi"
"../../fzf/SCySRFz.fzf">
<SCySRFz Tf="JMPyjCPRXSCySRFz" xml:lang="en-us">
<zTzJR>IMPyjCPRX</zTzJR>
<SCyuCfk><F>daR JMPyjCPRX TZ M jMSaTyR bZRf zC Sbz BXMZZ Ty zaR kMXf.
IMPyjCPRXZ SMy uR
RJRSzXTS, BMZ-FCPRXRf, CX jMybMJ.</F></SCyuCfk>
</SCySRFz>
---8<---8<---

* * *

One can of course argue that this is not the most efficient way of
doing the anonymizing, but it has worked for me so far. The biggest
drawbacks in this method are:

* URI scrambling

All filenames, folders and paths are scrambled in the process, so the
script should rename them at the same time. However, it could be
difficult to spot those filenames automatically and the URIs may also
have folder names. Adding filenames as keywords and/or manually
renaming the files and folders afterwards might be the way to go.

* listing keywords

Grepping specifications manually to the keyword list is a bit big task,
but it needs to be done only once.

* not true encryption

As mentioned, the scrambling key can be reverse engineered. It is
fairly easy to do statistical analysis to get the key. The effort gets
bigger if the script is extended so that instead of simple
substitution, one letter is replaced with varying amount of random
letters. One step harder could the have "lossy key" that deletes some
of the alphabets, but then the script has to make sure that the end
result makes sense (for example <i>). There are also other ways to
improve the key but they all require more focus one the implementation.

Anyway, IMO this method is a compromise between functionality, the
amount of time spend on script and secrecy, and it really depends on
the overall project and NDA if this method is suitable and sufficient.
 
J

Joe Kesselman

Ixa said:
[a tool for anonymizing XSLT testcases]
Nontrivial? Perl and few regular expressions will do the trick. ;)
Feel free to present your solution here :)

OK, here goes. :)

Good solution for the cases it covers, but...

Remember, my comment was that it was nontrivial to properly anonymize
XSLT testcases. That means rewriting the stylesheet logic in synch with
the changes to the input document.... which means being aware of the
XPaths and making sure the greeked input document still matches them
when (and only when) it should, which may require being more careful
about how you manipulate the document's content.

Trivial cases have trivial solutions. One that's robust enough to give
out to customers who have written nontrivial stylesheets with some trust
of getting back valid testcases is not that simple, alas.
 
I

Ixa

rewriting the stylesheet logic in synch with the changes to the input document

So, you are actually looking for a method for further scrambling the
actual logic in XSLT templates and structure in XML in addition to
messing up the element, attribute and variable names, or?
Trivial cases have trivial solutions.

Absolutely. There is lot of room for improvements in that method, the
most important being (from XSLT point of view) that the anonymizer
should be structure-aware. Now it just blindly messes up the lines in
text files and requires check on the result.

I guess the best approach could be to use XSLT to modify the XSLT and
XML at the same time. Then there would be total control on what parts
of the trees would be changed and there would be possibility to do more
than just alphabet scrambling.
 
J

Joe Kesselman

Ixa said:
I guess the best approach could be to use XSLT to modify the XSLT and
XML at the same time.

That's a bit hard to do with XSLT 1.0, unless you use the redirect
extensions... but, yes, I was pondering that approach. (As folks may
remember, I've already written an article on using stylesheets to
manipulate other stylesheets.)

But the bookkeeping for this particular N-way anonymizer may be ugly
enough to be better handled in a more traditional programmming language.

Hey, if I had a full solution worked out, I'd have published it
already... <smile/>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,439
Members
44,829
Latest member
PIXThurman

Latest Threads

Top