XML Not good for Big Files (vs Flat Files)

Oliver Wong · Apr 5, 2006

Roedy Green said:
And because it was so simple look what a fucking mess email is in.
People who write email clients are not simpletons. They need a
protocol that works, not one you can understand in five minutes.

SMTP was a hack to do an email demo. It was not rethought once the
problems of scale and spam became apparent.

What might have happened if they got a "good" e-mail protocol is that
only one or two proprietary e-mail clients would have been written back
then, they would have been seen as "enterprise" applications, and most
people wouldn't use e-mail outside of very large corporations.

The value of a network increases on the square of the number of people
involved. On the Internet, it's more important for something to be available
*now*, than for it to be perfect.

- Oliver

Oliver Wong · Apr 5, 2006

Roedy Green said:
So what if instead you wrote your schema, then using automated tools
created an ASN.1 binary file much more compact that you can parse 100
times faster and can turn back into fluffy XML any time you want using
the ASN.1 schema. It really amounts to more clever than usual
compression scheme for XML in that you can read it directly rather
than having to decompress it first.

Then look on fluffy XML as a debugging dump format. For computer to
computer you exchange ASN.1 and created and parse ASN.1. The fluffy
form never exists except conceptually.

Yes, this is exactly what I've been advocating, but did not manage to
state as clearly as you have just now. Thank you.

- Oliver

Roedy Green · Apr 6, 2006

What is your take on JSON?

it is a bit tighter than XML, but it has no schemas.

The thing that replaces/augments XML needs a compact, validated, fast
to parse format.

If I were designing XML there would be a central repository of
official field names that would have validation guarantees on them as
well.

e.g. ZIP4 - with a URL o a service that would validate that a ZIP was
a legal number.

ZIP4vSTATE -- with a URL of a service/class you can download that
would validate ZIP cross state. the zip here is guaranteed to belong
to a nearby STATE field.

PHONEUS -- a validates American phone number that it actually exists
in a directory.

There are names for various levels of validation guaranteed.

The biggest problem is importing data from other sources is cleaning
up the crud, or making invalid assumptions about how clean it is.
Identifying the field names or extracting the data is only about 5% of
the problem.

Roedy Green · Apr 6, 2006

So, what's the ASN.1 equivalent of JAXB?

since XML and ASN.1 are interconvertible, if you have something that
needs XML, you fluff and use it.

Roedy Green · Apr 6, 2006

The problem with a "straight-to-binary" approach is that you'd have to
use custom tools to process the data. With XML, you can use a generic XML
editor, or worse case, a simple text-editor.

No you don't. You use an ASN schema and a binary parser. It is just
like XML only compact.

Roedy Green · Apr 6, 2006

The only other place corruption could occur is the name of the elements,
the names of the attribute, or some of the punctuation (e.g. '<', '>', '/').
Should such corruption occur, it's trivial for a human to fix them, and some
software tools are pretty good at guessing at the fixes as well.

I have not had that sort of corruption to deal with since the 80s.
Humans monkeying with data files is still with us.

Roedy Green · Apr 6, 2006

The value of a network increases on the square of the number of people
involved. On the Internet, it's more important for something to be available
*now*, than for it to be perfect.

the phone system has the same problem blocking its evolution. The
voltages we use are far higher than we would have chosen for
microelectronics. They are the same ones Bell used.

The heterogeneity of the cell phone system is what is spurring its
rapid evolution. We need to create something similar for email -- with
connected systems but each using its own technology.

Kent Paul Dolan · Apr 6, 2006

Oliver said:
I've never used ASN.1 in the sense

I have, and it isn't the most pleasant of
experiences. Even though the standard
is perfectly well defined, it is simply
beyond the comprehension of programmers
who are not entirely comfortable with math
and with hex/binary representation, a big
fraction of all programmers.

I had the distasteful experience of trying
to explain, for several hours, _that_ a
problem decoding ASN.1 existed in a
major code suite causing widespread
failures, _what_ that problem was, in
painful detail, _how_ that problem could
be repaired, ditto, to in the end still be
staring at blank faces that showed no
comprehension, and whose owners
decided not to fix the problem as one
that "couldn't be important", since they
couldn't even comprehend the problem.

The problem was at that point symptomless,
I'd found it by studying the code (which, in
this large project, wasn't code I was allowed
to alter) to make sure my code on the other
end would be sending what it was expecting
to receive.

Within a week, that decoding problem caused
errors which brought the part of
the project on which we were working
to a standstill, embarassingly and publically,
at which point the miscoded part suddenly
looked a lot more worthy of attention.

ASN.1 has another problem; it fails one of the
main goodness criteria for XML -- it cannot be
picked up, sans documentation, decades later
by persons unfamiliar with the data, and be
immediately identified for what it is, as XML can.

ASN.1 is instead an excellent approximation of
line noise to the casual viewer.

xanthian.

Kent Paul Dolan · Apr 6, 2006

Monique said:
Hence my "(that I know of)" fudge =)

Just to add a bit of understanding to
previous postings, ASN.1 (BER flavor) is
at the heart of communications in the
SNMP (simple network management
protocol) arena (SNMP is a kludge, like TCP,
that was supposed to go away when the
"real standard" was finalized, but was so
popular, the "real standard" never had a
chance) , thus the huge number of
"networking" packages found vulnerable
to the security problem in an ASN.1
library. It is popular in part because SNMP
uses lots of "event occurrance" counters,
whose values as transported during network
management communications can be of
arbitrary precision (nice for stuff counted at
packet flyby rates of gajillions per second for
weeks, and which might need to be checked
to be identical to the least significant bit to
assure no packets got dropped unnoticed)
when coded in ASN.1/BER, which supports
arbitrary precision integers very sweetly and
compactly (if nearly incomprehensibily to
ordinary mortals staring at the encoded
version).

FWIW

xanthian.

James McGill · Apr 6, 2006

No you don't. You use an ASN schema and a binary parser. It is just
like XML only compact.

Nobody is going to use ASN just for fun. It's so obviously a product of
some 1980s multi-tiered management bureaucracy, it's not even funny.
Don't get me wrong -- I appreciate the strong typing and hard guarantees
that are possible within the framework. There are ASN constructs for
things that would be a major pain in any representation (like the stuff
dealing with Sets -- I understand the value in data binding
applications).

But it's not *fun*. At no level is it easy to work with. It's
something you use because your boss pays you to work with it, and it's
NOT something you use simply because you enjoy it.

Timbo · Apr 6, 2006

Steve said:
Hmmm, I, as a human, find the second form *much* easier to browse. I can pick
out the actual content *much* faster. Granted, I might prefer something like:

Of course! But you can use a stylesheet to format it as you wish!

Dag Sunde · Apr 6, 2006

Steve Wampler said:
I have. I stand by my statement. What about XSD *isn't* about syntax?
Granted, XSDs provide very fine-grained control over syntactic issues.

Value-ranges, min/max, n..m occurences, Enums...

Ok... Syntax borderlines, but after validating a given XML file against
its schema I know that not only is it syntactically correct, but also
that I.e. the date fields contains valid dates of a particular format,
I have confirmed that the <age> element really does contain only integer
values between 0 and 130... Because the schema said so... Et.c.

Timbo · Apr 6, 2006

Roedy said:
for a FLAT file there is no need to use tags. That is only when you
have a structrured file.

Yes, sure. For tables etc, XML is of little value. I absolutely
agree, and I would use something like CSV for that.

Kent Paul Dolan · Apr 6, 2006

Sorry, all of that about ASN.1 should have been
qualified with "in BER" (encoding).

xanthian.

Chris Uppal · Apr 6, 2006

Oliver said:
Give me a typical XML file though, and I could probably come up with
an interpretation that is near the original, depending on how the
elements and attributes are named.

Difficult to see how this is an advantage for production purposes.

If they file contains a reference to a
DTD or XSD, then I could navigate over to that URL and gain even more
information.

Now that is a real advantage. Note that the XML is not "self-describing", but
it's certainly a good attribute of the format that it can include a link to a
description.

-- chris

Chris Uppal · Apr 6, 2006

Timbo said:
??? Which was exactly what I said in the sentence after the one
you quoted!

Then you shouldn't have shouted so loud -- my ears were still ringing and I
missed the next few words you said ;-)

In hindsight, MEANING wasn't the correct word...
and I'm not sure of what IS the correct word...

I think "formatting" is probably the right word. There's no meaning in the
tags -- it might /look/ as if there's meaning, and well-chosen tags certainly
help if you are ever in the unfortunate position of having to read or edit XML
by hand, but there's nothing real there.

Perhaps I'd accept "mnemonics"...

-- chris

Chris Uppal · Apr 6, 2006

Oliver said:
However, another nice thing about XML over the other two formats is
that there is a standardize escaping mechanism. Artists are... well...
artistic... and they sometimes do crazy things.

All the file formats I can think of have well-defined escape mechanisms (in
CSV, unfortunately, you have a choice of about 10 and it's difficult to be sure
that all parties are agreed on which is in use). XML has one too. That's
hardly an advantage for XML (especially when its mechanism is so crappy).

What the world needed, but didn't get, was a well-designed, standardised[*]
escape mechanism which could be used in almost any file format....

([*] if only by convention)

-- chris

Chris Uppal · Apr 6, 2006

James said:
Nobody is going to use ASN just for fun. It's so obviously a product of
some 1980s multi-tiered management bureaucracy, it's not even funny.

Doesn't the same thing apply to XML ?

-- chris

Timbo · Apr 6, 2006

Chris said:
Timbo wrote:

Then you shouldn't have shouted so loud -- my ears were still ringing and I
missed the next few words you said ;-)

I wanted emphasise those two words, and many people still use
text-based newsreaders, so I don't use italics

I think "formatting" is probably the right word. There's no meaning in the
tags -- it might /look/ as if there's meaning, and well-chosen tags certainly
help if you are ever in the unfortunate position of having to read or edit XML
by hand, but there's nothing real there.

Ah, ok... we have actually got our shared definitions crossed

"Formatting" is definately not the word I want. I think "meaning"
is the correct word, but "contains" is misleading. When I say that
using XML format "contains" meaning, I mean that it "has a"
meaning, not that the meaning is self-evident from the tags. That
is, the XML that is passed has a meaning that can be interpreted
by the receiver, if it shares the same definitions as the sender.

In ontological teams, "John, Smith, 555,.." is just a list of
instances of concepts, with no relation to their concepts. This
makes their meaning, at worst, impossible to derive, at best,
ambiguous. Whereas, <Person> ... <Person> is an instance of a
concept, but tagging it with its concept Person allows the
receiver to derive meaning and reason about this information.

How this information is formated is not really relevant, as long
as the "is-a" relations (and others) are present.

Stefan Ram · Apr 6, 2006

Timbo said:
How this information is formated is not really relevant, as
long as the "is-a" relations (and others) are present.

When a new document type is to be defined, when should one
choose child elements and when attributes?

The criterion that makes sense regarding the meaning can not
be used in XML due to syntactic restrictions.

An element is describing something. A description is an
assertion. An assertion might contain unary predicates or
binary relations.

Comparing this structure of assertions with the structure
of XML, it seems to be natural to represent unary predicates
with types and binary relations with attributes.

Say, "x" is a rose and belongs to Jack. The assertion is:

rose( x ) ^ owner( x, "Jack" )

This is written in XML as:

<rose owner="Jack" />

Thus, my answer would be: use element types for unary
predicates and attributes for binary relations.

Unfortunately, in XML, this is not always possible, because in
XML:

- there might be at most one type per element,

- there might be at most one attribute value per attribute
name, and

- attribute values are not allowed to be structured in
XML.

Therefore, the designers of XML document types are forced to
abuse element /types/, to describe the /relation/ of an
element to its parent element.

This /is/ an abuse, because the designation "element type"
obviously is supposed to give the /type of an element/,
i.e., a property which is intrinsic to the element alone
and has nothing to do with its relation to other elements.

The document type designers, however, are being forced to
commit this abuse, to reinvent poorly the missing structured
attribute values using the means of XML. If a rose has two
owners, the following element is not allowed in XML:

<rose owner="Jack" owner="Jill" />

One is made to use representations such as the following:

<rose>
<owner>Jack</owner>
<owner>Jill</owner></rose>

Here the notion "element type" suggests that it is marked that
Jack is "an owner", in the sense that "owner" is supposed to
be the type (the kind) of Jack.

The intention of the author, however, is that "owner" is
supposed to give the /relation/ to the containing element
"rose". This is the natural field of application for
attributes, as the meaning of the word "attribute" outside of
XML makes clear, but it is not possible to use them for this
purpose in XML.

An alternative solution might be the following notation.

<rose owner="Alexander Marie" />

Here a /new/ mini language (not XML anymore) is used within an
attribute value, which, of course, can not be checked anymore
by XML validators. This is really done so, for example, in
XHTML, where classes are written this way.

So in its main language XHTML, the W3C has to abandon XML
even to write class attributes. This is not such a good
accomplishment given that the W3C was able to use the
experience made with SGML and HTML when designing XML and that
XHTML is one of the most prominent XML applications.

The needless restrictions of XML inhibit the meaningful use of
syntax. This makes many document type designers wondering,
when attributes and when elements are supposed to be used,
which actually is an evidence of incapacity for the design of
XML, that does not have many more notations than attributes
and elements. And now the W3C failed to give even these two
notations a clear and meaningful dedication!

Without the restrictions described, XML alone would have
nearly the expressive power of RDF/XML, which has to repair
painfully some of the errors made in the XML-design.

Now, some recommend to /always/ use subelements, because one
can never know, whether an attribute value that seems to be
unstructured today might need to become structured tomorrow.
(Or it is recommended to use attributes only when one is quite
confident that they never will need to be structured.) Now, this
recommendation does not even try to make a sense out of
attributes, but just explains how to circumvent the obstacles
the W3C has built into XML.

Others recommend to use attributes for something they
call "metadata".

Others use an XML editor that happens to make the input of
attributes more comfortable than the input of elements and
seriously suggest, therefore, to use as many attributes as
possible.

Still others have studied how to use CSS to format XML
documents and are using this to give recommendations about
when to use attributes and when to use subelements.

Of course: Mixing all these criteria (structured vs.
unstructured, data vs. "metadata", by CSS, by the ease of
editing, ...) often will give conflicting recommendations.

Other notations than XML have solved the problem by either
omitting attributes altogether or by allowing structured
attributes. I believe that notations with structured
attributes, which also allow multiple element types and
multiple attribute values for the same attribute name,
are helpful.

text to xml conversion	2	Jun 21, 2007
A new use for XML in applications	2	Oct 26, 2005
XML Resume Help	2	Oct 18, 2004
CanonML: beyond TeX and XML, a lesson also for arrogant stringers?	3	May 5, 2006
Available 2 Java, 1 Sr.Dot net consultant for your DIRECT client reks.......................	2	Jul 23, 2007
NoSQL Movement?	30	Mar 3, 2010
Announce SiSU - publishing for e-documents, books, libraries, relational databases	1	Jan 4, 2005
Asp.net Important Topics.	0	Jan 18, 2007

XML Not good for Big Files (vs Flat Files)

Oliver Wong

Oliver Wong

Roedy Green

Roedy Green

Roedy Green

Roedy Green

Roedy Green

Kent Paul Dolan

Kent Paul Dolan

James McGill

Timbo

Dag Sunde

Timbo

Kent Paul Dolan

Chris Uppal

Chris Uppal

Chris Uppal

Chris Uppal

Timbo

Stefan Ram

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads