XML Not good for Big Files (vs Flat Files)

O

Oliver Wong

Joe Attardi said:
Yes but, now we know what all the data means. Your example is quite
clear, but what about this one:

Lawrence,David,Maynard,MA

Ah, obviously a list of 4 arbitrary strings, i.e. (in SQL terms):

CREATE TABLE foo {
bar VARCHAR(255)
}

INSERT INTO foo VALUES ("Lawrence"),("David"),("Maynard"),("MA").

Could mean several things:
(1) Lawrence David lives in Maynard, MA.

Oops, okay, it's one record. Well, maybe it means.

Lawrence D. Maynard, who has an Masters in Arts. (Or perhaps it uses last
name first, i.e. David M. Lawrence, Masters in Arts).

Or maybe (s)he's a Medical Assitant? Or (s)he lives in Madagascar?
(2) David Lawrence lives in Maynard, MA
(3) David Maynard lives in Lawrence, MA
(4) Maynard David lives in Lawrence, MA
etc. You see where I'm going with this.

Hmm, looks like I was way off... Not being an American, I am not
familiar with American city names, nor American State abbreviations. If only
you had used XML!

- Oliver
 
S

Steve Wampler

Oliver said:
Hmm, looks like I was way off... Not being an American, I am not
familiar with American city names, nor American State abbreviations. If
only you had used XML!

No problem:

<f1>John</f1>
<f2>Smith</f2>
<f3>5555555</f3>
<f4>37 Finch Ave.</f4>

There, that should make people happy :)
(Of course, given this group, maybe the tags should be in Klingon...)
 
R

Roedy Green

<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>

canada has a population of some 30 million. We are talking some
fairly fat files. Not ones you feed to Winzip.
 
R

Roedy Green

<FirstName>Lawrence</FirstName>
<LastName>David</LastName>
<City>Maynard</City>
<State>MA</State>
when you are transferring 30 million records, the level of detail you
need to specify is much deeper than that. The tags alone are not
really telling you anything important.

For an example have a look at the spec of the tape of postal codes the
government puts out. There is a HUGE amount of information other
than just the field names you need to interpret the tape.
 
R

Roedy Green

John,Smith,5555555,37 Finch Ave.

There are ways now given an XML schema to create the equivalent binary
ASN.1 that can be decoded up to 100 times faster than the orgininal
XML. Given the incompetence of the W3C in designing XML, I would not
entrust them to produce a binary equivalent. Let's just stick with
ASN.1. Unless it had built-in dictionary compression, it is not going
to be sufficiently better than ASN.1 to warrant a competing format.


http://asn1.elibel.tm.fr/xml/#schema-mapping

This standardized mapping takes as input any schema written in XML
Schema and produces an ASN.1 module containing a set of type
definitions in such a way that there is a one-to-one correspondence
between ASN.1 abstract values and valid XML instances.
ASN.1 standardized encoding rules such as DER (a canonical encoding
that allows digital signatures and encryption) or PER (to very
efficiently transmit data over a radio channel), or even specific
encoding rules that are described in ECN, can then be used.
One big benefit of using a binary encoding is speed. Decoding a binary
stream improves performance by a factor 100 or more. Another benefit
is size: a binary encoding may save up to 80% or even more relative to
corresponding XML text.
 
A

Alex Hunsley

RC said:
Homer wrote:



XML is never designed to replace database server.

You can use XML file transfer portion of data
from a database.
i.e.

SELECT lastname,fistname,phonenumber,address
FROM phonebook
WHERE state = 'NY' AND city = 'somewhere';

A flat file like this

William|John|12345678|84 5th Ave

I don't know which column is last name, first name.
3rd column is person ID or phone number?

That's what a header field would be for.
You need let the programmers know what column is what.

Next time if some one change flat file format to

85 5th Ave|John|William|12345678

Then your database will incorrect after updated.

Presumably the header field will reflect the change.
Yeah, it's an extra thing to go wrong, admittedly...
 
J

James McGill

I meant to say BINARY XML is still in the "it might be a good idea"
stage.

In your world, the scenario of routinely "moving 30 million records"
might be more common than it is for others.

XML turns out to be quite a good fit for many situations. It's probably
totally inappropriate for the one the OP was complaining about, of
course.
 
M

Monique Y. Mudama

Presumably the header field will reflect the change. Yeah, it's an
extra thing to go wrong, admittedly...

Yeah ... the markup format is nice if partial data is considered
better than no data at all ...
 
M

Monique Y. Mudama

There are ways now given an XML schema to create the equivalent
binary ASN.1 that can be decoded up to 100 times faster than the
orgininal XML. Given the incompetence of the W3C in designing XML,
I would not entrust them to produce a binary equivalent. Let's just
stick with ASN.1. Unless it had built-in dictionary compression, it
is not going to be sufficiently better than ASN.1 to warrant a
competing format.

Except that, apparently, it's not terribly well known or supported.
That does make a difference. One of the selling points of XML is that
it can allow diverse groups to share data.
 
M

Monique Y. Mudama

what part of the world does "doddle" derive from? It just means
"easy"?

I had a mental image of a toddler, er, toddling along. No idea if
that's actually what was meant. In the context of my brain, it meant
"so easy a toddler could do it."
 
J

Jon Martin Solaas

Homer said:
Very good guess but no, I don't work for government. All I am saying
is in these cases sender and receiver both knows the file format by
heart. They know and their application knows. That's how they were
moving files in past and if they want to establish a new file transfer
they will let each other know about upcoming file format for sure.
There is no reason to send the file format along with each file every
time they have a file transfer (unless you are wearing name tag in your
home so your family know your name).

Ofcourse, but in other cases, when the file-format has to be
communicated, nobody knows it by heart, the data need to be
hierarchical, the receiver need to validate and perhaps transform to
another format, and not to mention implementing the apps to do so, xml
is useful. When a new fileformat is to be used, xsd comes in handy, and
also allows for automatic validation. In many orgranisations
misunderstandings occur, bugs are made and so on, so validation is nice.

XML was cool when I was a student 10 years ago. Now it's just convenient.

Maybe you should get more out. It's the people outside that doesn't know
your name :)
 
J

Jon Martin Solaas

Roedy said:
canada has a population of some 30 million. We are talking some
fairly fat files. Not ones you feed to Winzip.

Why would anyone want to apply compression manually? Automate the rest
of the process and then use WinZip? It's hardly likely that the database
with all those records run on a platform that can run WinZip :)

Also, isn't it likely that the file would be split up?
 
D

Dag Sunde

Monique Y. Mudama said:
Except that, apparently, it's not terribly well known or supported.
That does make a difference. One of the selling points of XML is that
it can allow diverse groups to share data.

Not terribly well known at all...

Is there parsers or en-/decoders for VB, Python, JavaScript and all the
other languages I frequently have to use to interpret data from other
systems?

When organizations like goverments choose XML for data-exchange they don't
do it for the "coolness factor", but because they have the need to publish
data to 3rd parties not involved in the spec. at all.

I am frequently given the task of importing som goverment/large company
data into one app or another, and am very grateful each time I'm given
an xml format with, (Important!) a proper Schema file or DTD. With the
schema/DTD, I can make sure the data is valid and well formed, and
I can even automatically adapt to changes.

My point is (I think :) that a goverment is seldom in a situation where
they have a single counterpart where they can agree upon a fixed, flat
format...
 
P

Peter.Kriens

Interesting, I agree with your conclusion but for opposite reasonc :)

For computer-computer communications XML is quite good, though verbose.
If you have come to the scene in the last 5 years, you have no ideas
how many issues there were sending files between computers. Character
encodings, format changes, field length differences, imposisble to
transfer certain datatypes, nested data, I can assure you it was
usually hell. XML is not a good a solution, but it is sufficient for
this purpose and has become the best because it has become a standard.
This has created a large market for tools that can easily interwork.
Today, when an XML file must be transformed because of version
mismatch, it is a trivial task.

The size problem is relatively easy to solve: zip it. In Java it is
trivial to zip the XML in an JAR or ZIP stream. This usually reduces
the size to 10%. Obviously this trades off CPU cycles versus
bandwidth/storage so it should be used with care.

The reason I think XML is bad because lazy programmers have
standardized it for Human-Computer communication. Ant, Maven, WAR,
J2EE, XSLT, and too many others force humans to write XML, and we are
lousy at it. The verbosity hides the important elements making it very
difficult to understand without inspecting the code in detail. The sole
reason for this is because the programmer is too lazy (or, god forbid,
incompetent) to write a real grammar and parser for the task at hand.
The argument that we then all use the same language is wrong. XML is
used as a meta language, the real language is still effectively hidden
in its tags and attributes. Worse, often attributes introduce an
additional language (XPath for example) This means the burden is put on
the user and not the computer, and imho that is fundementally wrong.
I'd like my time optimized, not the computer's.
 
C

Chris Uppal

Steve said:
No problem:

<f1>John</f1>
<f2>Smith</f2>
<f3>5555555</f3>
<f4>37 Finch Ave.</f4>

There, that should make people happy :)

Slightly OT, but I believe that the Best Practise for handling addresses is
just have line1, line2, line3 and so on, rather than trying to identify the
"meaning" of each line. There is much less consistency across address formats
than most programmers (or schema designers) realise. So an XML format like
yours might be the best you can (or should) do.

-- chris
 
C

Chris Uppal

Martin said:
Here's another thought: use ASN.1 encoding. Have a look here
<http://asn1.elibel.tm.fr/> if you haven't heard of it.

I can't understand why something as simple as data exchange (not /information/
exchange which is vastly more difficult) should require nine standards
documents which between them add up to book length. Nor why it should require
a book written about it. Why do people have to make things so /complicated/ ?

XML is, if anything, even worse.

Even YAML is way too complicated, albeit not in the same league as ASN.1 or
XML.

-- chris
 
C

Chris Uppal

Monique said:
XML isn't new enough to offer the glamour factor you think it has.

Remember that we are talking about a government here. Being only a decade
behind the times is damned impressive !

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top