XML Not good for Big Files (vs Flat Files)

H

Homer

I am a little bit tired of this obsession people have with XML and XML
technology. Please share your thoughts and let me know if I am thinking
in a wrong way. I believe some people are over using XML all over the
place. Nowadays Canadian Government is pushing XML to its organization
as standard for data/file transfer. Huge files moving between companies
now include tones of XML Tags repeating all over the file and slowing
down networks and crashing applications because of size.
I am not objecting to the whole technology. I know advantages of XML
and using it all the times for Config files or our web oriented
applications but using it as standard for moving big files is going too
far. Here is the example:

John,Smith,5555555,37 Finch Ave.

Is now:

<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>

And Tags are repeating and repeating:

<FirstName>....</FirstName>
<LastName>....</LastName>
<PhoneNum>....</PhoneNum>
<Address>....</Address>

<FirstName>....</FirstName>
<LastName>....</LastName>
<PhoneNum>....</PhoneNum>
<Address>....</Address>


Please let me know what you think.


Regards,

Homer
 
J

James McGill

And Tags are repeating and repeating:

XML markup does tend to bloat the data.

I personally believe you should use serializable objects that can be
represented according to an XML schema when that's appropriate, but that
also can be serialized into a tightly packed format when that is
appropriate as well. So I should be able to marshal/unmarshal the
serialized object to and from XML, but I should also be able to stream
that object without marshalling it -- and the other end should be able
to unmarshal to xml, validate according to the schema, etc.

Likewise, database bindings should be informed by the xml schema, but
the XML markup shouldn't be what you store in the db.
 
M

mtp

Homer said:
I am a little bit tired of this obsession people have with XML and XML
technology. Please share your thoughts and let me know if I am thinking
in a wrong way. I believe some people are over using XML all over the
place. Nowadays Canadian Government is pushing XML to its organization
as standard for data/file transfer. Huge files moving between companies
now include tones of XML Tags repeating all over the file and slowing
down networks and crashing applications because of size.

you can use indexing, binary XML, or compression
I am not objecting to the whole technology. I know advantages of XML
and using it all the times for Config files or our web oriented
applications but using it as standard for moving big files is going too
far. Here is the example:

John,Smith,5555555,37 Finch Ave.

Is now:

<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>

And Tags are repeating and repeating:

<FirstName>....</FirstName>
<LastName>....</LastName>
<PhoneNum>....</PhoneNum>
<Address>....</Address>

<FirstName>....</FirstName>
<LastName>....</LastName>
<PhoneNum>....</PhoneNum>
<Address>....</Address>


Please let me know what you think.

may be one of the computing service wanted more money for his service
with this big project ?

may be everybody think "newer is better" ?
 
C

cherukan

Homer said:
I am a little bit tired of this obsession people have with XML and XML
technology. Please share your thoughts and let me know if I am thinking
in a wrong way. I believe some people are over using XML all over the
place. Nowadays Canadian Government is pushing XML to its organization
as standard for data/file transfer. Huge files moving between companies
now include tones of XML Tags repeating all over the file and slowing
down networks and crashing applications because of size.
I am not objecting to the whole technology. I know advantages of XML
and using it all the times for Config files or our web oriented
applications but using it as standard for moving big files is going too
far. Here is the example:

John,Smith,5555555,37 Finch Ave.

Is now:

<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>

And Tags are repeating and repeating:

<FirstName>....</FirstName>
<LastName>....</LastName>
<PhoneNum>....</PhoneNum>
<Address>....</Address>

<FirstName>....</FirstName>
<LastName>....</LastName>
<PhoneNum>....</PhoneNum>
<Address>....</Address>


Please let me know what you think.


Regards,

Homer

Yes that does seem like a network killer. It depends on what the
intended use of the file is, on the other end and the client receiving
it, if they *have to* use XML, certain optimizations can be done for
just the transfer part...

<header>
<firstName>A15</firstName>
<lastName>A15</lastName>
<phone>A10</phone>
<address>A10</address>
</header>
<data>
[[CDATA
<!-- fixed width data goes here -->
]]
</data>

OR

<header>
<fieldSeparator>;</fieldSeparator>
<field>firstName</field>
<field>lastName</field>
<field>phone</field>
<field>address</field>
</header>
<data>
[[CDATA
<!-- delimited data goes here -->
]]
</data>

OR a combination of the above.

In short, XML should be preferred only if documentation and
discoverability are more important than performance.
 
R

RC

Homer wrote:

Please let me know what you think.

XML is never designed to replace database server.

You can use XML file transfer portion of data
from a database.
i.e.

SELECT lastname,fistname,phonenumber,address
FROM phonebook
WHERE state = 'NY' AND city = 'somewhere';

A flat file like this

William|John|12345678|84 5th Ave

I don't know which column is last name, first name.
3rd column is person ID or phone number?

You need let the programmers know what column is what.

Next time if some one change flat file format to

85 5th Ave|John|William|12345678

Then your database will incorrect after updated.


True XML creates large file size.
But it makes our life easier.

You can make up your own tags
<lastName> or <Last_Name>, etc.
the tags can be in English, Spanish, French, Russian, Japanese, etc.
 
J

James McGill

OR a combination of the above.

You're almost touching on the big problem: Misconception of what it
means to be "standard".

XML has (several) standardized markup frameworks, but it is silent as to
content or utilization. It is ridiculous for a government entity to
demand that "XML" be "the standard" for data interchange. They need to
bless certain schemas if that's their goal, but it also needs to be
abstract enough that systems can be designed efficiently.

In your examples, the designers can claim that they are using "XML", and
therefore "are standardized" on it, but the three examples we've seen so
far are not at all interchangeable...
 
T

Timbo

Homer said:
John,Smith,5555555,37 Finch Ave.

Is now:

<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>
It's true that the XML data in your example is bulky, but what it
has that the unstructured doesn't have is meta-level information,
such as "John" the first name of someone. If the parties involved
(ie. that sender and receiver of this information) have an
agreement as to the meaning of "FirstName", then they are sharing
more than just text... it has some implicit meaning. If you send
it unstructured, then the receiver has to know how to parse the
data into this agreed meaning, which means it needs to know the
format of the data.

Then, on the other hand, if the data is just stored in a database
or something with no definition of the what the tags mean, then I
agree with you... using XML is of little use.
 
O

Oliver Wong

Homer said:
I am a little bit tired of this obsession people have with XML and XML
technology. Please share your thoughts and let me know if I am thinking
in a wrong way. I believe some people are over using XML all over the
place. Nowadays Canadian Government is pushing XML to its organization
as standard for data/file transfer. Huge files moving between companies
now include tones of XML Tags repeating all over the file and slowing
down networks and crashing applications because of size.
I am not objecting to the whole technology. I know advantages of XML
and using it all the times for Config files or our web oriented
applications but using it as standard for moving big files is going too
far. Here is the example:

John,Smith,5555555,37 Finch Ave.

Is now:

<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>

And Tags are repeating and repeating:

<FirstName>....</FirstName>
<LastName>....</LastName>
<PhoneNum>....</PhoneNum>
<Address>....</Address>

<FirstName>....</FirstName>
<LastName>....</LastName>
<PhoneNum>....</PhoneNum>
<Address>....</Address>


Please let me know what you think.

If your complaint is file size during network transfer, compress the
file before sending it.

If your complaint is file size during parsing, use SAX instead of DOM,
and don't keep the whole file in memory at once.

Use the right tool for the job. If for whatever problem you're trying to
solve, you've got a better tool than XML, then use it. But if the problem is
"The government requires me to use XML", then I can't think of a better tool
than XML to solve that particular problem (except maybe emmigration ;)).

- Oliver
 
L

Lasse Reichstein Nielsen

Homer said:
I am a little bit tired of this obsession people have with XML and XML
technology.

Hear hear!
Seems some people think XML is the solution to all problems.
I'd rather classify it as the lowest common denominator for exchanging
tree-structured data - and definitly not something fit for humans to
read or write directly.
John,Smith,5555555,37 Finch Ave.

Is now:

<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>

And Tags are repeating and repeating:
Please let me know what you think.

Apart from what everybody else have said, zipping such a file
should yield a *very* high compression factor.

/L
 
J

Joe Attardi

John,Smith,5555555,37 Finch Ave.
Is now:

<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>

Yes but, now we know what all the data means. Your example is quite
clear, but what about this one:

Lawrence,David,Maynard,MA

Could mean several things:
(1) Lawrence David lives in Maynard, MA.
(2) David Lawrence lives in Maynard, MA
(3) David Maynard lives in Lawrence, MA
(4) Maynard David lives in Lawrence, MA
etc. You see where I'm going with this.

Where
<FirstName>Lawrence</FirstName>
<LastName>David</LastName>
<City>Maynard</City>
<State>MA</State>

leaves no question.

Yes, we as humans know intuitively that city and state go together. But
for an application using this data, there has to be some specification
defined and all systems that use it must be aware of it.
 
H

Homer

I guess these responses are proving of my point. You know all that the
best solution for transferring huge files between two parties is simple
flat file that both sender/receiver have agreed upon file format and
using secure line. But you still defend adding tons of tags to a file
that both sender/receiver are familiar with the format. I believe lots
of people are using XML because it's cool and new. And these people
give advise to companies and organizations.

Some points about your suggestions:

1- Marshalling/Object Stream: Too Advance for places like government.
2- Have Mixed XML/Raw Data: Then what is the point of having XML at the
top? Unless you are sending the file to an unknown place that doesn't
know what is getting.
3- Compression: There is no good standard for compression (Unix is not
really ZIP friendly unless you add some opensource or buy Zip product)
and Mainframe is another story. Even for Windows you need to buy the
product (or use open source that most companies don't like). Also why
make file size triple and then compress it?


Let me give you another example of coolness (sorry, it's a bit off
the topic but it's about coolness):

I got a job in telecommunication company (cell phone) to convert their
code from C to C++ because OO was so cool those days but application
was working with no problem.
I did my job, converted the code/building class library for one year,
and left the company.

One year later they hired bunch of other people to come and convert the
whole thing to Java because Java was the Best.

3 years later they hired me again to convert everything again to J2EE
because J2EE is (guess what) the Best.


Regards,

Homer
 
J

James McGill

I believe lots
of people are using XML because it's cool and new.

It's anything but "cool". And as for it being "new", XML isn't old
enough to vote, but SGML is. If you aren't seeing the benefits of
logical structure and validation, standardized processing, etc.,
that may be because you aren't exploiting those things in your
application.

One of your complaints is directly counter to an explicit design goal,
from the beginning of the XML spec: "Terseness in XML markup is of
minimal importance."

XML markup is deliberately intended to favor clarity to conciseness.

But most of your complaint seems to derive from the fact that you work
in a bureaucratic government situation, where you have no authority to
make decisions, and where there is a limited backchannel for your
recommendations. That is unfortunate, but isn't it a choice you made
when you went to work for a government?

I've always been led to believe that the Canadian government is a
prototype of efficiency and reason, one that should make Americans feel
ashamed. Are you suggesting that it too may be clogged with
bureaucratic nonsense? I would be shocked to hear that!
 
H

Homer

Very good guess but no, I don't work for government. All I am saying
is in these cases sender and receiver both knows the file format by
heart. They know and their application knows. That's how they were
moving files in past and if they want to establish a new file transfer
they will let each other know about upcoming file format for sure.
There is no reason to send the file format along with each file every
time they have a file transfer (unless you are wearing name tag in your
home so your family know your name).
 
J

James McGill

All I am saying
is in these cases sender and receiver both knows the file format by
heart. They know and their application knows.

The interesting thing with XML is that in its case, the *document*
knows. In a well designed system, the DTD can change and applications
can cope.
There is no reason to send the file format along with each file every
time they have a file transfer

But you aren't sending the file format. You're sending a notice with a
URI that locatest the format (schema, dtd, etc.), and then sending data
that's marked up according to that format.
(unless you are wearing name tag in your
home so your family know your name).

Or like wearing a badge at a workplace, perhaps?
 
M

Martin Gregorie

Homer said:
I guess these responses are proving of my point. You know all that the
best solution for transferring huge files between two parties is simple
flat file that both sender/receiver have agreed upon file format and
using secure line. But you still defend adding tons of tags to a file
that both sender/receiver are familiar with the format. I believe lots
of people are using XML because it's cool and new. And these people
give advise to companies and organizations.
Here's another thought: use ASN.1 encoding. Have a look here
<http://asn1.elibel.tm.fr/> if you haven't heard of it.

It does virtually everything XML does in terms of tagged fields and the
ability to completely omit optional fields and structures, but it uses
binary tags and can encapsulate binary data. Like XML you can take a
data description (written in BNF notation) and use it to generate file
encoders and decoders, or you can write fast interpretive decoders (as I
have). Its a standard in the telecoms industry, where its routinely used
to transfer multi-megabyte files as well as individual short messages.

Java ASN.1 schema compilers are available.

Translating a file between ASN.1 and XML should be a doddle: the site I
mentioned has a tool for doing just that.
 
J

Joe Attardi

Homer said:
I believe lots of people are using XML because it's cool and new. And these people
give advise to companies and organizations.
XML isn't new. It's been around almost ten years. The first working
draft for the XML spec was put together in November of 1996.
3- Compression: There is no good standard for compression (Unix is not
really ZIP friendly unless you add some opensource or buy Zip product)
Gzip? In fact IIRC, the gzip algorithm takes advantage of strings that
are repeated over and over (like the tag names) that help with its
compression.
(or use open source that most companies don't like).
That most companies don't like? I don't think you researched this much
before making this statement. Look how many of the huge players (Sun,
IBM, etc.) have strong support for open source. In addition, open
source is being adopted all over the place.
Let me give you another example of coolness (sorry, it's a bit off
the topic but it's about coolness):
It's not just because XML is "the cool thing". It's perfectly suited
for the exchange of data like this. The data describes itself!
 
M

Monique Y. Mudama

I guess these responses are proving of my point. You know all that
the best solution for transferring huge files between two parties is
simple flat file that both sender/receiver have agreed upon file
format and using secure line. But you still defend adding tons of
tags to a file that both sender/receiver are familiar with the
format.

I guess that you are wrong. I guess that the word "best" is meaningless
unless it is qualified by something. If you want a format that is best
at clarity, then flat files lose. I guess that you don't really
understand when to use XML, and that it doesn't really matter because
you don't have the authority to change things in the environment in
which it's causing you trouble, so you've developed a grudge against
XML rather than against whoever decided to use it inappropriately or
whoever decided to create an excessively verbose schema.
I believe lots of people are using XML because it's cool and
new. And these people give advise to companies and organizations.

XML isn't new enough to offer the glamour factor you think it has.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top