XML Not good for Big Files (vs Flat Files)

Steve Wampler · Apr 5, 2006

Joe said:
Hierarchical data, dude. What if someone has more than one phone
number? With the comma-delimited flat file approach, it's not readily
apparent how you could implement that.

<Person>
<PhoneNumber>...</PhoneNumber>
<PhoneNumber>...</PhoneNumber>
...
</Person>

we can have as many PhoneNumbers as we want that are associated with a
person, and because it's all hierarchical we can just walk up the
hierarchy to see who these PhoneNumbers belong to.

Eh? That's still syntax. Are you saying all syntax is non-hierarchical?

People have represented hierarchical data in many ways *well before XML*,
including, yes, flat files - and it's not that hard. It's still a syntax issue.
Heck, even arbitrary graph data (hardly "hierarchical") has many syntactic
representations, including flat files.

Look, I *like* XML *for some things*, but wish people would take the time
to recognize what it is and want it isn't, please.

Timbo · Apr 5, 2006

Steve said:
I haven't seen anything in XML
that does more than provide a guarantee that the syntax is right.

Ok, so say you are writing an application that deploys an agent to
find you the best prices for CDs on the web. If you share the same
ontological definition of CD attributes, you could have the
following album embedded in a webpage:

<Album>
<Artist> Stevie Wonder </Artist>
<Title> Innervisions </Title>
<Producer> .. </Producer>
<Track number=1 name=".."/>
<Track number=2 name=".."/>
... etc..
<Price> £5</Price>
</Album>

Compare that to the text:

Stevie Wonder, Innervisions, 1: ..., 2: ..., £5

You can see that clearly, any online CD store that follows the XML
definition in the first one (which could be defined in a schema)
would be easier to browse than one that has free text, especially
if some CDs have data that others don't, such as accompanying
musicians. You could find the grammar for the free text, write a
parser for it (or download one), and interpret the parsed data,
but simply sharing the set of definitions is more straightforward.

Oliver Wong · Apr 5, 2006

Chris Uppal said:
But it doesn't. The meaning comes from the /interpretation/ of the data,
not
from its transmission form. The parties sharing data must come to an
agreement
about the meaning before they can share information. Once they have done
that,
deciding on a shared format is pretty trivial whether they use XML, ASN.1,
YAML, CSV, or a custom format.

[In this post, I will group "XML", "ASN.1", "YAML", and "CSV with
headers" all under a single group which I will call "XML"; basically, this
"XML" group means data with metadata tags. As for "CSV without headers" and
"custom format", I'm going to group them together as "typical binary file".]

I'd say it's somewhere in between Timbos and Chris' claims [with the
distortion of Chris' claim as described above]. If you plonked a typical
"binary" file onto my desktop (e.g. perhaps ripping a random file from a
Playstation DVD), and told me to try to interpret it, I could get out my hex
editor, and look around for human-readable strings, and from there maybe
look for end-of-string markers, or some sort of length-of-string headers,
and then from there try to figure out markers for other datatypes, but I'd
probably wouldn't get very far.

Give me a typical XML file though, and I could probably come up with an
interpretation that is near the original, depending on how the elements and
attributes are named. If they file contains a reference to a DTD or XSD,
then I could navigate over to that URL and gain even more information.

- Oliver

Timbo · Apr 5, 2006

Chris said:
Timbo wrote:

But it doesn't. The meaning comes from the /interpretation/ of the data, not
from its transmission form. The parties sharing data must come to an agreement
about the meaning before they can share information.

??? Which was exactly what I said in the sentence after the one
you quoted!

In hindsight, MEANING wasn't the correct word...
and I'm not sure of what IS the correct word...

Once they have done that,
deciding on a shared format is pretty trivial whether they use XML, ASN.1,
YAML, CSV, or a custom format.

Sure, you can send it in a CSV format, but to keep the meta-data,
then it would be:
FirstName=John, LastName=Smith, Phone=55555, etc,

where you basically have the tags in the CSV, and you are then
facing the same problems as the original poster was complaining
about. It's not the syntax of XML that is useful (frankly, I find
it tediously difficult to follow when I am forced too), it's the
fact that it provides an easy way to store meta-data, and there
are lots of nice tools to support this. It's this meta-information
that the original poster does not like.

Timbo · Apr 5, 2006

Steve said:
Agreed, but if I have to sit through one more PowerPoint presentation
where the presenter throws up a bunch of slides full of XML as if that's
conveying useful information to the audience, I'm going to scream. XML
is *NOT* the appropriate tool for this!

AAAGGGHHH!!! Yes!! I hate it when people put XML into the
research/technical papers and presentations! Unless of course the
paper/presentation is actually about XML, then I guess it could be
quite necessary

XML is not a human-readable format.

Oliver Wong · Apr 5, 2006

Steve Wampler said:
Really? I wouldn't have thought so. What makes you think 'f' stands
for 'field'? Maybe these are four new flavours of Ben&Jerry's ice cream.
(Not that I'd buy any of them...)

If the 4 elements were 4 of the same things, they'd have the same name.
So if they were all flavors, the document should have looked something like:

<f>John</f>
<f>Smith</f>
<f>5555555</f>
<f>37 Finch Ave.</f>

Then I'd say we have 4 records, each record containing 1 field, which
can be an arbitrary string (semantically, the field might represent
Ben&Jerry ice creams).

Since each element had a different name, I can conclude that this is 1
record with 4 fields. Perhaps each field represents a flavor of ice cream
(e.g. this is somebody's top 4 favorite ice creams, or these are the 4 most
profitable flavors, or these are, in order of submission, 4 flavors being
requested by customers etc.)

BTW, the fact that the tag names contained an 'f' is irrelevant to my
calling them fields. The document could have been

<boo>John</boo>
<bar>Smith</bar>
<buntz>5555555</buntz>
<batz>37 Finch Ave.</batz>

And I would have still come to the conclusion that we're dealing with a
single record with 4 fields.

The point is that the tag names are, ultimately, just strings. We might
think we understand what they mean (and can be right a high percentage of
the time if the strings are well chosen), but in the end, they mean
whatever the code at each end that defines the semantics (not the syntax)
to be. That codes *still* has to agree at both ends, just as it does
with "John,Smith,5555555,37 Finch Ave.". I haven't seen anything in XML
that does more than provide a guarantee that the syntax is right.

Later on i nthe thread, someone mentions hierarchy, and you respond that
we've had hierarchy before XML. Well, we had syntax checking before XML too.
XML doesn't give us anything new in that sense. It just gives us a "better"
way of doing what we've been previously doing, where "better" depends the
problem you're trying to solve.

- Oliver

Bent C Dalager · Apr 5, 2006

Thing is, I doubt whether that is true. There seems to be an XML mindset that
can be summed up as "don't reinvent the wheel" (to be charitable) or "use
pre-existing work, no matter how complex it is" (to be uncharitable). XML
itself inherits all sorts of unwanted complexity from SGML. The applications
of XML tend to want to use XML as metadata. Then they start to define stuff in
terms of other XML languages, or using other XML languages. The end result is
/seriously/ complicated.

You are referring to the use of namespaces, and importing a namespace
someone else made instead of making your own tags for the same stuff?
If so, I would agree that this leads to added complexity. It tends to
force you to have to relate to a number of tags that are unnecessary
for the application at hand but which happened to be inherited from
the external namespace, and the organization of that namespace may not
be optimal for the use it is getting put to in the derived
application.

In my opinions there's a badly inverted pyramid at work. In normal situations
you build more complex systems on less complex ones. That doesn't seem to

I don't know. I have yet to write a Swing application that is more

complex than Swing said:
apply to the XML world. It builds complex systems on top of even more complex
systems.
A year or two back, I got the idea that RDF would be suitable for a very small
project of mine. So I started looking into RDF. It was a private project and
I didn't want to spend more than a few days coding. By the time I realised
that I wasn't going to find the bottom of the RDF tar-pit, I'd already spent
that "few days"...

I haven't used RDF, but I would agree that it is quite possible to
completely bollox up an XML application. When defining XML
applications for my own use, I always find that it is a serious
mistake to try and be clever about it

There is an enormous amount of advanced features in XML (mostly
inherited from SGML, as you point out) that you really don't want to
be using. It will just end up confusing both yourself and any other
developers who have to relate to your XML application.

This, incidentally, is why I tend to shake my head at any Microsoft PR
person that says such things as "it is defined in XML and therefore it
is an open format". Along this particular axis, XML is more like Perl
than it is like Java. It may be incredibly useful and powerful, but it
also supports arbitrarily horrid levels of obfuscation. Whether or not
you can get any sense out of an XML application is entirely up to
whoever wrote it.

Cheers
Bent D

Steve Wampler · Apr 5, 2006

Oliver said:
If the 4 elements were 4 of the same things, they'd have the same
name. So if they were all flavors, the document should have looked

You're attaching semantics (note the 'should'). There is nothing in
XML that prevents someone from using a unique tag for every entry.
Granted that's not accepted convention (and a bad idea, to boot), but you
have to make some semantic assumptions to get any interpretation out of XML.

Later on i nthe thread, someone mentions hierarchy, and you respond
that we've had hierarchy before XML. Well, we had syntax checking before
XML too. XML doesn't give us anything new in that sense. It just gives
us a "better" way of doing what we've been previously doing, where
"better" depends the problem you're trying to solve.

Oh, I agree with that. The point I was responding to was the statement
that seemed to imply hierarchy was not syntax, and that it was difficult
to represent hierarchy without XML. The difficulty is that there are
too many ways to do so (no standardization) - that's XML's real
contribution, to me - even though it's flawed, it is nearly ubiquitous.
(Although there are still too many ways to represent the same data in
XML as well - which is probably true of any "sufficiently powerful" syntax -
at least it's possible to always parse the XML. Making sense of the
result is another matter).

The fact that it is also (somewhat) self-defining in syntax is also useful
in some contexts, but not something I find *overwhelmingly* valuable.

James McGill · Apr 5, 2006

You are referring to the use of namespaces, and importing a namespace
someone else made instead of making your own tags for the same stuff?
If so, I would agree that this leads to added complexity.

Here's a case in point, from my corner of the real world:

The "Any" element that can go in a RFC-2518 "DAV" Multistatus.
It turns out to be quite difficult to bind this kind of XSD to
Objects, although, with a little work, Castor handles it just fine.
But you basically have a schema that says "put anything here", and in
order to implement that, you have to compromise and specify what
"anything" may consist of.

Another problem that came to light from that experience was, once
Microsoft interprets a spec wrong, that wrong interpretation becomes the
spec, no matter what anyone else says. Even a very well designed, fully
specified XML schema is no guarantee of a successful interface
constraint!

Steve Wampler · Apr 5, 2006

Timbo said:
Ok, so say you are writing an application that deploys an agent to find
you the best prices for CDs on the web. If you share the same
ontological definition of CD attributes, you could have the following
album embedded in a webpage:

<Album>
<Artist> Stevie Wonder </Artist>
<Title> Innervisions </Title>
<Producer> .. </Producer>
<Track number=1 name=".."/>
<Track number=2 name=".."/>
... etc..
<Price> £5</Price>
</Album>

Compare that to the text:

Stevie Wonder, Innervisions, 1: ..., 2: ..., £5

You can see that clearly, any online CD store that follows the XML
definition in the first one (which could be defined in a schema) would
be easier to browse than one that has free text, especially if some CDs
have data that others don't, such as accompanying musicians. You could
find the grammar for the free text, write a parser for it (or download
one), and interpret the parsed data, but simply sharing the set of
definitions is more straightforward.

Hmmm, I, as a human, find the second form *much* easier to browse. I can pick
out the actual content *much* faster. Granted, I might prefer something like:

Steve Wonder: Innervisions ($9.25)
1: ....
2: ....
3: ....

but that would depend on whether I'm more interested in the artist and album or
the details of the album content. (Great price, by the way!)

Of course, you're talking about computer handling of the data, where your points
are more valid. That's *still* syntax though.

Oliver Wong · Apr 5, 2006

Steve Wampler said:
Hmmm, I, as a human, find the second form *much* easier to browse. I can
pick
out the actual content *much* faster. Granted, I might prefer something
like:

Steve Wonder: Innervisions ($9.25)
1: ....
2: ....
3: ....

but that would depend on whether I'm more interested in the artist and
album or
the details of the album content. (Great price, by the way!)

Of course, you're talking about computer handling of the data, where your
points
are more valid. That's *still* syntax though.

I find Timo's XML version as easy to read as Timbo's CSV version.
However, I do find Steve's "custom" version easier to read over the other
two, as a human.

However, another nice thing about XML over the other two formats is that
there is a standardize escaping mechanism. Artists are... well...
artistic... and they sometimes do crazy things. In CSV, or the custom
format, how do you distinguish being an album whose name is the empty
string, and an album whose name is the single space character? What if the
album contains a colon in it? What if the artist name contains a colon in
it? What if the album name contains an open-parenthesis and dollar sign in
it, but no close-parenthesis? Etc.

As purely digital music becomes more popular (e.g. songs existing only
as OGG or MP3 files, and no physical albums, so no cover art nescessary),
you could have tech-savy artists define the names of their tracks to be the
newline character for some specific platform, for example. Maybe I'll go
write a song right now whose name is the value of the Java literal String
expression "\u0000\r\n\u0008\r\n\n". For clarity, the name of my song is 7
characters long, and is not intended to be pronounced (there will be no
lyrics in the song).

With XML, it's possible to express unambiguously any possible string of
characters (using, e.g., entity-references). With CSV or the custom format,
you'd have to invent an escaping-system, and then I, as a human, would have
to learn about your escaping system to either be able to read the data
myself, or to implement a program which can parse the data.

- Oliver

Oliver Wong · Apr 5, 2006

Steve Wampler said:
You're attaching semantics (note the 'should'). There is nothing in
XML that prevents someone from using a unique tag for every entry.
Granted that's not accepted convention (and a bad idea, to boot), but you
have to make some semantic assumptions to get any interpretation out of
XML.

Yes, but that'd be abusing XML in the same sense that there's no reason
why you couldn't "use" CSV in the following manner:

<ExampleCSVDocument>
,,,,,,
,,
,,,,,,,,
,,,,
,,
,

,,,,,,
</ExampleCSVDocument>

Where I'm representing a sequence of integers such that each row (which
is equivalent to a line in CSV) is an integer, and the number of fields
(which is equal to the number of commas plus one) is the value of that
integer. I could represent ASCII (or even Unicode) text that way, but it'd
be breaking the unenforced-but-understood conventions of CSV.

In fact, I could further obfuscate the above document by putting in
random content in between the commas, and defining that you should just
ignore such content, and only count the number of commas.

The difficulty is that there are
too many ways to do so (no standardization) - that's XML's real
contribution, to me - even though it's flawed, it is nearly ubiquitous.

I agree. The great thing about XML is that you can pretty much "accept"
them no matter what platform you're running on (as opposed to, say, a
Microsoft Word document). That's why I sometimes write documents in XHTML,
rather than Microsoft Word, as I want to make it readable by the widest
audience possible.

As to the "too many ways" issue, I've mentioned elsewhere in this thread
the "stack/layer of services" view. TCP/IP can be used for so many
applications: e-mail, file transfer, instant messenging, etc. How can we
make sense of all these uses? Well, the applications don't directly use
TCP/IP, but rather they use services (e.g. FTP) that use TCP/IP. And then
you could built a service on top of FTP to simulate a shared file system,
and so on.

So to me XML is one layer. You can then build layers on top of that
(e.g. XHTML, RSS, SOAP), and then build layers on top of those, and so on.
You can also put XML over other layers (e.g. XML->gzip->FTP->TCP/IP to send
XMLs between computers while still using a reasonable amount of bandwidth).

The fact that it is also (somewhat) self-defining in syntax is also useful
in some contexts, but not something I find *overwhelmingly* valuable.

I like XML's linking to DTDs or XSDs. When you have this strange XML
file, and you think to yourself "Where can I find out more information on
the format?", you have an URL telling you exactly where to go. This is
better, IMHO, than the previous practice of relying on file extensions to
define what kind of data is in the file, and then using a site like
http://filext.com/ to find out more about those kinds of files.

- Oliver

Oliver Wong · Apr 5, 2006

Timbo said:
AAAGGGHHH!!! Yes!! I hate it when people put XML into the
research/technical papers and presentations! Unless of course the
paper/presentation is actually about XML, then I guess it could be quite
necessary

I agree. Another appropriate use might be if the presentation is about
some piece of software which USES XML, and the presenter is just showing an
example document to give an idea of what kind of data will be present. E.g.
someone explaining why all blogging software should implement RSS feeds (I'm

looking at you Xanga said:
XML is not a human-readable format.

I disagree.

- Oliver

James McGill · Apr 5, 2006

I agree. Another appropriate use might be if the presentation is about
some piece of software which USES XML, and the presenter is just showing an
example document to give an idea of what kind of data will be present.

Even then, I usually say something like "the data formats are defined in
an XML Schema document which is in CVS." (My product has database
bindings, configs, and messaging, generated by and bound to XSD).

If I'm presenting something from this, I normally show the data as the
result of a transform into somehing presentable. But then, my audience
is always closed, consisting only of people who are already on the same
page and are as close to the mechanics of the product as I am.

James McGill · Apr 5, 2006

Remember that we are talking about a government here.

The Canadian government, which I've been led to understand is the most
progressive on Earth, etc.

Oliver Wong · Apr 5, 2006

James McGill said:
Even then, I usually say something like "the data formats are defined in
an XML Schema document which is in CVS." (My product has database
bindings, configs, and messaging, generated by and bound to XSD).

If I'm presenting something from this, I normally show the data as the
result of a transform into somehing presentable. But then, my audience
is always closed, consisting only of people who are already on the same
page and are as close to the mechanics of the product as I am.

Yeah, this works if the people you're speaking to have access to (and
know how to use) CVS. I was envisionning a situation where you're trying to
convince a bunch of people to adopt a new technology, and so the burden is
on you to provide all the relevant information. E.g. you're in a room with a
bunch of business people, and you show them an example RSS document, and say
essentially say "See? It's so easy, even you, who had no training in
computer science, can figure out how to write an RSS document if you really
wanted to."

- Oliver

Timo Stamm · Apr 5, 2006

Steve said:
I haven't seen anything in XML
that does more than provide a guarantee that the syntax is right.

Have a look at XSDs.

Timo

Timo Stamm · Apr 5, 2006

Steve said:
There is nothing in XML that prevents someone from using a unique
tag for every entry.

Of course there is. There are various ways to define schemes for XML
documents:

http://en.wikipedia.org/wiki/XML_schema#XML_schema_languages

Timo

Andrew McDonagh · Apr 5, 2006

Joe said:
Exactly. Any data set containing 30 million records would be grossly
inefficient in one single file, whether it be XML or otherwise.

besides ...the subject is about large XML files vs flat file.

Not xml vs RDBMs.

If we were talking about using the XML file as a database, then they a
point (small or large file). Relational databases won over Hierarchical
databases years ago for many good reasons.

grasp06110 · Apr 5, 2006

What is your take on JSON?

text to xml conversion	2	Jun 21, 2007
A new use for XML in applications	2	Oct 26, 2005
XML Resume Help	2	Oct 18, 2004
CanonML: beyond TeX and XML, a lesson also for arrogant stringers?	3	May 5, 2006
Available 2 Java, 1 Sr.Dot net consultant for your DIRECT client reks.......................	2	Jul 23, 2007
NoSQL Movement?	30	Mar 3, 2010
Announce SiSU - publishing for e-documents, books, libraries, relational databases	1	Jan 4, 2005
Asp.net Important Topics.	0	Jan 18, 2007

XML Not good for Big Files (vs Flat Files)

Steve Wampler

Timbo

Oliver Wong

Timbo

Timbo

Oliver Wong

Bent C Dalager

Steve Wampler

James McGill

Steve Wampler

Oliver Wong

Oliver Wong

Oliver Wong

James McGill

James McGill

Oliver Wong

Timo Stamm

Timo Stamm

Andrew McDonagh

grasp06110

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads