XML Not good for Big Files (vs Flat Files)

C

Chris Uppal

Monique Y. Mudama wrote:

[about ASN.1]
Except that, apparently, it's not terribly well known or supported.
That does make a difference. One of the selling points of XML is that
it can allow diverse groups to share data.

Um, it is rather /widely/ used. Do you remember when someone discovered a
bunch of exploitable vulnerabilities in a commonly used ASN.1 related library
(I think it might have been a code generator that produced vulnerable code).
The list of affected products included just about every vendor of
network-related kit.

-- chris
 
C

Chris Uppal

Monique said:
I had a mental image of a toddler, er, toddling along. No idea if
that's actually what was meant. In the context of my brain, it meant
"so easy a toddler could do it."

The word's common in British English. I don't know about other
dialects/flavours.

The word "doddle" does derive from "toddle", according to the OED, where
"toddle" means the halting walk of an infant or elderly/infirm person. A
doddle, however, is just something that is easy -- as the OED puts it: "a
'walk-over'".

-- chris
 
J

Joe Attardi

Also, isn't it likely that the file would be split up?
Exactly. Any data set containing 30 million records would be grossly
inefficient in one single file, whether it be XML or otherwise.
 
B

bugbear

Homer said:
John,Smith,5555555,37 Finch Ave.

Is now:

<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>

In the first example is 5555555 a phone number, or
part of the address?

And, w.r.t repeating tags; 1 word. gzip.
Several applications simply use gzip'd XML
to get a good compromise.

gzip (and other compressors) are rather good
at crunching off the kind of trivial
repetition you object to.

BugBear
 
M

Monique Y. Mudama

Monique Y. Mudama wrote:

[about ASN.1]
Except that, apparently, it's not terribly well known or supported.
That does make a difference. One of the selling points of XML is
that it can allow diverse groups to share data.

Um, it is rather /widely/ used. Do you remember when someone
discovered a bunch of exploitable vulnerabilities in a commonly used
ASN.1 related library (I think it might have been a code generator
that produced vulnerable code). The list of affected products
included just about every vendor of network-related kit.

*ducks*

Okay, I guess it is widely supported. I just haven't happened to have
come across anything in my development work that ever made use of it
(that I know of). I shouldn't have generalized that to the rest of
the world.
 
M

Monique Y. Mudama

Remember that we are talking about a government here. Being only a
decade behind the times is damned impressive !

Now, now. In 1999 I worked on a US govt project (I think it was DoD, or
maybe DISA) to create an XML repository to share across govt branches.

I also spent 1998 through erm, a a couple of years ago working on Java
systems for some defense related stuff. I think when we started we
were using 1.1.7, and it did take a looooong time to convince the
customer to upgrade, but after that it wasn't too hard to keep moving.
I remember getting bitten by glob imports + that new List class,
engendering a hatred of glob imports that continues to this day.

Some govt customers are very into new technology (almost to the point
of silliness -- they want to reimplement in the new stuff even if
there's no direct benefit and resources would be better spent
improving the rest of the app).
 
H

Homer

That's great. Put tones of repeating tags inside the file and make it
huge and now everybody is saying how to make it small with
Gzip/Binary,...

Third field (between delimiters; whatever it is) is phone number. Any
file has File Spec Document (unless you XML lovers has replaced it with
some XML equivalent).

When the sender and receiver are agreed on format there is no need to
repeat labels. Like what you write on postal envelop. Or you told your
wife your name is John 20 years ago. No need to wear a name tag just in
case you change your name (if you change your name tell her one more
time; sending File Spec Doc to receiver)

I am still saying I am %100 with you all that IF you are sending data
in small volume and/or receiver doesn't know about the file format
XML is the best solution. But use it as a tool to fix any problems is
going too far.

Homer
 
T

Timbo

Homer said:
I guess these responses are proving of my point. You know all that the
best solution for transferring huge files between two parties is simple
flat file that both sender/receiver have agreed upon file format and
using secure line. But you still defend adding tons of tags to a file
that both sender/receiver are familiar with the format.
My guess is that you don't really understand either my post, or
XML. It's not the FORMAT of XML, it's the fact that it contains
MEANING. So, if the sender and receiver have a shared ontology
that says that FirstName is someone's first name, then the data
<FirstName>John<FirstName> is more than just some text with the
value "John"... it is saying that "John" is his first name. So
rather than just having raw data, you have information that is
useful to the receiver. Moreso, for a third-party to use this
information, you need only to give them the shared definitions,
rather them give them the format and the meaning.
 
O

Oliver Wong

Steve Wampler said:
No problem:

<f1>John</f1>
<f2>Smith</f2>
<f3>5555555</f3>
<f4>37 Finch Ave.</f4>

There, that should make people happy :)
(Of course, given this group, maybe the tags should be in Klingon...)

Well, at least with this notation, I wouldn't have made my initial
mistake of thinking I was dealing with 4 records which seemed to be
arbitrary strings.

Give the tag names, I can see I am dealing with a single record with 4
fields.

So we're making progress here, but perhaps the tag names could have been
better chosen.

And if there were an XSD along with this, I could check wether f3 was
purely numeric, or if it could contain arbitrary string data as well.

- Oliver
 
T

Timbo

Homer said:
Or you told your
wife your name is John 20 years ago. No need to wear a name tag just in
case you change your name (if you change your name tell her one more
time; sending File Spec Doc to receiver)
This is a good example. People don't just usually walk up to a
group of people and say "John"... they say something like "My name
is John". They identify what the piece of data "John" actually
means using a shared definition for "My name is".
I am still saying I am %100 with you all that IF you are sending data
in small volume and/or receiver doesn't know about the file format
XML is the best solution. But use it as a tool to fix any problems is
going too far.
Yeah, I don't really know what you mean by that last sentence. I
suspect you are implying that people who like XML tend to see it
as a solution to everything? That's certainly true of every
technology.

As someone who encounters a lot of AI stuff in my job, I can
really see the value of associating values of things with what
they are meant to represent. XML is one way to help with this. I'm
not sure what the Canadian government is doing with the
information that it is transferring, but I can see the advantage
of tagging the kind of information that they would be using. If it
was parsing raw data into XML simply for the purpose of backing it
up over a network or something, that would be odd, but I'm
guessing they are doing other things with it too.
 
B

Bent C Dalager

That's great. Put tones of repeating tags inside the file and make it
huge and now everybody is saying how to make it small with
Gzip/Binary,...

Third field (between delimiters; whatever it is) is phone number. Any
file has File Spec Document (unless you XML lovers has replaced it with
some XML equivalent).

When the sender and receiver are agreed on format there is no need to
repeat labels.

XML isn't particularly useful for the original sender and receiver.
They would probably be better off using a binary format. It is useful
for the third party who wants his product to interact or compete with
the software used by sender and receiver and therefore needs to
reverse engineer the protocol being used between them. In this
context, a high level of protocol redundancy is extremely useful since
it makes it reasonably easy for a human to work out what is going on
so that he can replicate it.

This is part of the same philosophy that made the Internet so big in
the first place: simple protocols that anyone could understand and
hook into. SMTP isn't a very good protocol by any stretch of the
imagination, but it is _simple_ and you can very easily hook into it
to make it do the things _you_ want it to do. If SMTP had been
ASN.1-based, chances are X.400 or something would have won the email
protocol wars because only professionals would have bothered extending
SMTP or creating cheap (free) MTAs, mail clients, etc.

XML may be a resource hog, it may be absolutely preposterous from an
information theory standpoint and it may have accumulated a shedload
of idiosyncrasies over time, but it does help keep technology and
protocols accessible to hobbyists and starting programmers. This is
highly useful in itself and might very well be enough to justify its
widespread adoption.
I am still saying I am %100 with you all that IF you are sending data
in small volume and/or receiver doesn't know about the file format
XML is the best solution. But use it as a tool to fix any problems is
going too far.

I tend to bring up my "XML-based streaming video" horror scenario in
these debates just to point out that XML should be used with some
caution:

<video-frame number="1654392">
<line number="1">
<pixel number="1">
<colour red="0" green="14" blue="200"/>
</pixel>
<pixel number="2">
<colour red="0" green="13" blue="198"/>
</pixel>
<pixel number="3">
<colour red="3" green="12" blue="197"/>
</pixel>
<!-- more pixels . . -->
</line>
<!-- more lines ... -->
</video-frame>

_Now_ we're talking broadband :)

Cheers
Bent D
 
C

Chris Uppal

Timbo said:
My guess is that you don't really understand either my post, or
XML. It's not the FORMAT of XML, it's the fact that it contains
MEANING.

But it doesn't. The meaning comes from the /interpretation/ of the data, not
from its transmission form. The parties sharing data must come to an agreement
about the meaning before they can share information. Once they have done that,
deciding on a shared format is pretty trivial whether they use XML, ASN.1,
YAML, CSV, or a custom format.

-- chris
 
S

Steve Wampler

Oliver said:
Well, at least with this notation, I wouldn't have made my initial
mistake of thinking I was dealing with 4 records which seemed to be
arbitrary strings.

Give the tag names, I can see I am dealing with a single record with
4 fields.

Really? I wouldn't have thought so. What makes you think 'f' stands
for 'field'? Maybe these are four new flavours of Ben&Jerry's ice cream.
(Not that I'd buy any of them...)

The point is that the tag names are, ultimately, just strings. We might
think we understand what they mean (and can be right a high percentage of
the time if the strings are well chosen), but in the end, they mean
whatever the code at each end that defines the semantics (not the syntax)
to be. That codes *still* has to agree at both ends, just as it does
with "John,Smith,5555555,37 Finch Ave.". I haven't seen anything in XML
that does more than provide a guarantee that the syntax is right.
 
J

Joe Attardi

I am still saying I am %100 with you all that IF you are sending data
in small volume and/or receiver doesn't know about the file format
XML is the best solution. But use it as a tool to fix any problems is
going too far.

I do agree with you on this one! XML is definitely not a catch-all
solution for every problem. Using it to send 30 million records is
probably not a good use for it.

But, you are being too harsh on XML, accusing people of using it
because it's "the cool thing" or because it's new and a novelty (both
of which are false, by the way).

_For applicable problems_, XML is extremely useful because of its
features, not because it is a "cool" technology.

What if the XML data needs to be converted to some other format? XSLT
to the rescue! I can use an XSL stylesheet to quickly convert my XML
data file to your flat file comma-delimited format.

Then there's XML data binding tools like JAXB, Castor, etc. Give it an
XSD, and *poof* it generates a set of Java classes around it. Now I can
load my XML data in a program and not worry about parsing it with
DOM/SAX or searching for data using XPath.
 
J

Joe Attardi

Homer said:
That's great. Put tones of repeating tags inside the file and make it
huge and now everybody is saying how to make it small with
Gzip/Binary,...

Third field (between delimiters; whatever it is) is phone number. Any
file has File Spec Document (unless you XML lovers has replaced it with
some XML equivalent).

One thing nobody's really mentioned much yet is attributes. They are
just as descriptive as elements, and your example does tend to overuse
the sub-elements. Consider this:

Instead of:
<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>

what about,

<PersonList>
<Person firstName="John" lastName="Smith" phoneNum="5555555"
address="37 Finch Ave." />
</PersonList>

You've cut down on duplicated text by half (since things like
firstname, lastname, etc. are now attributes and therefore don't need
closing tags).
 
O

Oliver Wong

Roedy Green said:
There are ways now given an XML schema to create the equivalent binary
ASN.1 that can be decoded up to 100 times faster than the orgininal
XML. Given the incompetence of the W3C in designing XML, I would not
entrust them to produce a binary equivalent. Let's just stick with
ASN.1. Unless it had built-in dictionary compression, it is not going
to be sufficiently better than ASN.1 to warrant a competing format.

Alternatively, you could take a "stack of services" view, as one
typically does with networking protocols, and just see XML as one of the
higher level service. It's a good way to serialize simple state-only (i.e.
no behaviour) objects to a string. Parsing XML, would be equivalent to:

ArrayList<Person> persons = new ArrayList<Person>();
Person p = new Person();
p.setFirstName("John");
p.setLastName("Smith");
p.setTelephone/Id/WhateverItIs("5555555");
p.setAddress("37 Finch Ave.");
persons.add(p);
//And so on for all the other records.

If the "raw" XML is takes too long to transfer over the network, use a
seperate compression service as a layer beneath the XML. That was, as
compression technology improves, we can swap out the underlying service
without changing any of the code that deals with the XML layer.

- Oliver
 
J

Joe Attardi

Steve said:
I haven't seen anything in XML
that does more than provide a guarantee that the syntax is right.

Hierarchical data, dude. What if someone has more than one phone
number? With the comma-delimited flat file approach, it's not readily
apparent how you could implement that.

<Person>
<PhoneNumber>...</PhoneNumber>
<PhoneNumber>...</PhoneNumber>
....
</Person>

we can have as many PhoneNumbers as we want that are associated with a
person, and because it's all hierarchical we can just walk up the
hierarchy to see who these PhoneNumbers belong to.
 
C

Chris Uppal

Bent said:
XML may be a resource hog, it may be absolutely preposterous from an
information theory standpoint and it may have accumulated a shedload
of idiosyncrasies over time, but it does help keep technology and
protocols accessible to hobbyists and starting programmers.

Thing is, I doubt whether that is true. There seems to be an XML mindset that
can be summed up as "don't reinvent the wheel" (to be charitable) or "use
pre-existing work, no matter how complex it is" (to be uncharitable). XML
itself inherits all sorts of unwanted complexity from SGML. The applications
of XML tend to want to use XML as metadata. Then they start to define stuff in
terms of other XML languages, or using other XML languages. The end result is
/seriously/ complicated.

In my opinions there's a badly inverted pyramid at work. In normal situations
you build more complex systems on less complex ones. That doesn't seem to
apply to the XML world. It builds complex systems on top of even more complex
systems.

A year or two back, I got the idea that RDF would be suitable for a very small
project of mine. So I started looking into RDF. It was a private project and
I didn't want to spend more than a few days coding. By the time I realised
that I wasn't going to find the bottom of the RDF tar-pit, I'd already spent
that "few days"...

"Proper" XML (i.e. used semantically, not just as an unbelievably clunky and
ineffective file format) is not -- in my very limited experience -- accessible
to someone who can't spend /lots/ of time on it upfront, and continue to spend
lots of time on it thereafter.

-- chris
 
S

Steve Wampler

Joe said:
I do agree with you on this one! XML is definitely not a catch-all
solution for every problem. Using it to send 30 million records is
probably not a good use for it.

But, you are being too harsh on XML, accusing people of using it
because it's "the cool thing" or because it's new and a novelty (both
of which are false, by the way).

Agreed, but if I have to sit through one more PowerPoint presentation
where the presenter throws up a bunch of slides full of XML as if that's
conveying useful information to the audience, I'm going to scream. XML
is *NOT* the appropriate tool for this! There are *much better* ways to
present data to humans and these presenters are clearly showing XML
because it's "the cool thing" - or immensely lazy - which makes me
wonder about the quality of the code they write...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top