XML Not good for Big Files (vs Flat Files)

O

Oliver Wong

Chris Uppal said:
Doesn't the same thing apply to XML ?

I use XML "just for fun", in the sense that I've used it in situations
where my boss isn't paying me to use it (including the situations where I'm
my own boss). See many of my postings to this newsgroup for example. I'll
often use "xml-like" syntax to show what's Java code versus what's prose.

- Oliver
 
O

Oliver Wong

Chris Uppal said:
All the file formats I can think of have well-defined escape mechanisms
(in
CSV, unfortunately, you have a choice of about 10 and it's difficult to be
sure
that all parties are agreed on which is in use).

So to me, this means that CSV does NOT have well-defiend escape
mechanisms. That is, if your requirements are "support an 'export to CSV'
functionality", it wouldn't be unusual to forbid "crazy things" appearing in
your document model (or else just not worrying about it and letting bugs
creep into that functionality). With XML, it's much rarer, since a lot of
XML APIs will automatically handle the escaping for you.
XML has one too. That's
hardly an advantage for XML (especially when its mechanism is so crappy).

What the world needed, but didn't get, was a well-designed,
standardised[*]
escape mechanism which could be used in almost any file format....

([*] if only by convention)

Is this even possible? Wouldn't the escaping mechanism depend on what
the punctuations of the file format are?

- Oliver
 
O

Oliver Wong

Roedy Green said:
it is a bit tighter than XML, but it has no schemas.

The thing that replaces/augments XML needs a compact, validated, fast
to parse format.

If I were designing XML there would be a central repository of
official field names that would have validation guarantees on them as
well.

e.g. ZIP4 - with a URL o a service that would validate that a ZIP was
a legal number.

ZIP4vSTATE -- with a URL of a service/class you can download that
would validate ZIP cross state. the zip here is guaranteed to belong
to a nearby STATE field.

PHONEUS -- a validates American phone number that it actually exists
in a directory.

There are names for various levels of validation guaranteed.

The biggest problem is importing data from other sources is cleaning
up the crud, or making invalid assumptions about how clean it is.
Identifying the field names or extracting the data is only about 5% of
the problem.

The problem with these "standard fields" is that they may change over
time. In Quebec (possibly in the rest of North America, but I don't know),
we're changing from a 7 digit phone number to a 10 digit phone number. That
is, the area code is now considered required, and no longer implied as being
the same as the dialer's area code. Didn't phone nubmers in the US use to
contain letters in them? See http://ourwebhome.com/TENP/TENproject.html

- Oliver
 
M

Mark Thornton

Chris said:
Oliver Wong wrote:




Difficult to see how this is an advantage for production purposes.

Some data suppliers change their format very regularly. Using XML gives
fewer surprises of this kind and it is then easier to guess the meaning
of a change and easier to ignore irrelevant changes.

I get geographic mapping information in a variety of formats. Although
bulky, the XML based data is the easiest to use. The bulk is usually
dealt with by compression, which in the case of gzip is trivial to
handle in Java.

Mark Thornton
 
O

Oliver Wong

Stefan Ram said:
Say, "x" is a rose and belongs to Jack. The assertion is:

rose( x ) ^ owner( x, "Jack" )

This is written in XML as:

<rose owner="Jack" />
[...]
If a rose has two
owners, the following element is not allowed in XML:

<rose owner="Jack" owner="Jill" />

One is made to use representations such as the following:

<rose>
<owner>Jack</owner>
<owner>Jill</owner></rose>

Here the notion "element type" suggests that it is marked that
Jack is "an owner", in the sense that "owner" is supposed to
be the type (the kind) of Jack.

The intention of the author, however, is that "owner" is
supposed to give the /relation/ to the containing element
"rose". This is the natural field of application for
attributes, as the meaning of the word "attribute" outside of
XML makes clear, but it is not possible to use them for this
purpose in XML.

How about something like:

<rose id="x" ownedBy="Jack"/>
<rose id="x" ownedBy="Jill"/>

or

<ownership owned="rose" owner="Jack"/>
<ownership owned="rose" owner="Jill"/>

or

<Person id="Jack">
<belongings>
<rose id="x"/>
<!--Possibly other stuff-->
</belongings>
</Person>
<Person id="Jill">
<belongings>
<rose id="x"/>
<!--Possibly other stuff-->
</belongings>
</Person>

depending on what exactly is the main message being conveyed (i.e. the XML
different documents here all say the same thing, but they put emphasis on
different things: the roses, the persons, or the ownership-relationships
themselves).

- Oliver
 
M

Monique Y. Mudama

I don't use XML myself, but someone sent me this recently and it
might give you something to think about:

http://www.developer.com/xml/article.php/10929_3583081_1

/gordon

This quote from that link seems to address the complaint a lot of
people have about XML being used:

"But, if your data is not tree shaped, XML is not appropriate. A table
of temperature measurements taken on a 2D surface is perfectly happy
existing as a series of comma-separated values with column headings
"X," "Y," and "degrees celsius.""
 
K

Kent Paul Dolan

Monique Y. Mudama quoted:
"But, if your data is not tree shaped, XML is not appropriate. A table
of temperature measurements taken on a 2D surface is perfectly happy
existing as a series of comma-separated values with column headings
"X," "Y," and "degrees celsius.""

Except that a 2D table is trivial to represent as a tree,
and in an XML document. The appropriate example
is some dataset whose graph is a lattice, and even
then, there are linear, therefore tree-shaped, and
therefore XML representable, notations for
representing graphs of any shape.

xanthian.
 
O

Oliver Wong

Kent Paul Dolan said:
Monique Y. Mudama quoted:


Except that a 2D table is trivial to represent as a tree,
and in an XML document. The appropriate example
is some dataset whose graph is a lattice, and even
then, there are linear, therefore tree-shaped, and
therefore XML representable, notations for
representing graphs of any shape.

Any form of information that can be expressed on computer (and I use
"computer" in the generic, Turing Machine, sense) can "trivially" be
represented in an XML document, e.g.:

<binaryData>00101110110110101000101</binaryData>

The question is whether XML is the "best" way to represent that data.
For tabular data without fancy characters (i.e. only alphanumerics, no need
for content escaping), I admit "CSV with headers" looks very attractive.

- Oliver
 
R

Roedy Green

I have, and it isn't the most pleasant of
experiences. Even though the standard
is perfectly well defined, it is simply
beyond the comprehension of programmers
who are not entirely comfortable with math
and with hex/binary representation, a big
fraction of all programmers.

And parsers are over the head of most Java programmers. But it does
not matter. You don't interact with ASN.1 at the bit level any more
that you interact with XML at the character level.

Now that the two are interconvertible, its should be possible to
create libraries where the actual format is transparent.
 
R

Roedy Green

Didn't phone nubmers in the US use to
contain letters in them? See http://ourwebhome.com/TENP/TENproject.html

Yes. that is why you would need to have different names for fields
validated to different standards. or perhaps standards with a version
number.

One of the problems with managing such data is knowing which
transforms to apply to bring it up to date.

For example there was a major reorg here in BC when the province split
into 604 and 250 area codes. All the phone area codes had to be
updated using a table of exchange prefixes. But it was a one shot
conversion. After a while the table was no longer valid as exchanges
started to appear in both area codes.
 
R

Roedy Green

The question is whether XML is the "best" way to represent that data.
For tabular data without fancy characters (i.e. only alphanumerics, no need
for content escaping), I admit "CSV with headers" looks very attractive.

if you write your CSV file in UTF-8, the " convention handles
everything but control chars. In practice CSV is limited to printable
strings. It makes for a lot more readable and compact file than the
equivalent XML IF you don't need the tree structure.

I wrote a program called CSVAlign that aligns them in columns,
(numbers right justified) which makes them as easy to read and
proofread as spreadsheets.

CSV has the advantage you can import it into spreadsheets or most SQL
engines.

The ugliest part of CSV is the lack of an official standard. Swedes
use semicolon instead of comma as the separator. Some SQL engines
use ' instead of ". Some schemes allow multiline fields, others do
not.

Nobody decided for certain what the comment delimiter character is, if
any.
 
R

Roedy Green

For example there was a major reorg here in BC when the province split
into 604 and 250 area codes. All the phone area codes had to be
updated using a table of exchange prefixes. But it was a one shot
conversion. After a while the table was no longer valid as exchanges
started to appear in both area codes.

Because that was not handled by some standard class, thousand of BC
businesses rolled their own solutions, many by the technique of
waiting for subscribers to send in updates, or manually retyping all
the phone numbers. Businesses outside BC would simply have allowed
their databases to go stale.

If you centralise data validation code, you only have to do it once
and you ensure it is done properly.
 
S

Stefan Ram

Oliver Wong said:
<rose id="x" ownedBy="Jack"/>
<rose id="x" ownedBy="Jill"/>

While your suggestion might be possible for Prolog-like
databases of assertions, it might be difficult to apply
it to text markup, where one actually would like to write:

<p>He met
<span class="name" class="person">Peter Miller</span> in
<span class="name" class="town">London</span>.</p>

It could be written in XML as:

<p>He met
<span id="563">Peter Miller</span> in
<span id="564">London</span>.</p>
<attribute idref="563" class="name"/>
<attribute idref="563" class="person"/>
<attribute idref="564" class="name"/>
<attribute idref="564" class="town"/>

But this looks as if it might be more difficult to maintain.

NB: If "id" was declared as an »ID attribute« in the DTD, then
<rose id="x" ownedBy="Jack"/>
<rose id="x" ownedBy="Jill"/>

might not be valid XML, because in XML »ID values must
uniquely identify the elements which bear them« is a validity
constraint. But here, »id« might be declared as an »IDREF
attribute«.
depending on what exactly is the main message being conveyed
(i.e. the XML different documents here all say the same thing,
but they put emphasis on different things: the roses, the
persons, or the ownership-relationships themselves).

... and some of these choices then will be restricted by the
restrictions of XML. For example, when one wants to put
emphasis on the roses by mapping each rose to an XML element,
some of the restrictions mentioned in my previous post apply.
 
D

Dag Sunde

CSV has the advantage you can import it into spreadsheets or most SQL
engines.

The ugliest part of CSV is the lack of an official standard. Swedes
use semicolon instead of comma as the separator. Some SQL engines
use ' instead of ". Some schemes allow multiline fields, others do
not.

So does Norwegians, but then we don't call it CSV, but SDV.
(Semicolon Delimited Values)...
Nobody decided for certain what the comment delimiter character is, if
any.

No, I often mix ; and | and crlf when I need a compact 3 level deep
structure. (But only internally)

:)
 
T

Thomas Weidenfeller

Roedy said:
This standardized mapping takes as input any schema written in XML
Schema and produces an ASN.1 module containing a set of type
definitions in such a way that there is a one-to-one correspondence
between ASN.1 abstract values and valid XML instances.
ASN.1 standardized encoding rules such as DER (a canonical encoding
that allows digital signatures and encryption) or PER (to very
efficiently transmit data over a radio channel), or even specific
encoding rules that are described in ECN, can then be used.
One big benefit of using a binary encoding is speed. Decoding a binary
stream improves performance by a factor 100 or more. Another benefit
is size: a binary encoding may save up to 80% or even more relative to
corresponding XML text.

Be careful with ASN.1. That one is not for the faint of heart. ASN.1
grammar is difficult to parse. Some WSN.1 compilers need up to twelve
passes over the code to get something out. And this often excludes
support for ASN.1 macros - which are not macros, but a feature which
changes ASN.1 syntax during compilation.

BER, the basic encoding rules are used very often, but they are also
very inefficient for a binary format. Decoding ASN.1 BER is also not
much fun. Usage of IMPLICIT in the ASN.1 often creates different data
then what one would expect.

/Thomas
 
C

Chris Uppal

Oliver said:
Is this even possible? Wouldn't the escaping mechanism depend on what
the punctuations of the file format are?

I don't see why not. There are several broad categories of encoding[*]
techniques.

([*] don't take the word "encoding" to imply that the format is not normally
readable.)

One simply requires that the text format is self-delimiting and that /any/ text
should be interpreted according to the rules of the encoding. So the syntax of
the context is irrelevant. E.g. a length prefix, or a strong quoting
convention like the 'xyz' strings in Unix Bourne shell and its derivatives.

Another possibility is similar, but the encoding is parameterised. For instance
a C-like escape mechanism could be parameterised on
the Start character (defaults to ")
the End character (defaults to Start)
the Escape character (defaults to \)
the range of characters that need to be escaped (defaults to End and Escape
itself).

Another set of possibilities are like URL-encoding or the numerical character
entities in XML/HTML (I may have the name wrong, I mean things like &2345; but
not $amp;). In this case the mechanism is necessarily parameterised on the
surrounding format, since that determines what /has/ to be escaped.

And so on. My point is that it /could/ have been done (a "best practise" RFC
perhaps). Sad that it was not...

-- chris
 
C

Chris Uppal

Oliver said:
The question is whether XML is the "best" way to represent that data.
For tabular data without fancy characters (i.e. only alphanumerics, no
need for content escaping), I admit "CSV with headers" looks very
attractive.

Or even CSV without headers but with an XML description of the columns (and
applicable quoting conventions ;-).

-- chris
 
J

James McGill

I use XML "just for fun", in the sense that [...]

And I thought /I/ was strange !

Well, my point was that I use XML schema for things like configuring
games, communication between online game clients, the save game format,
the parameters of the model, etc. Strictly for fun. I know that ASN.1
(for example) offers some very formal grammars that happen to be
accepted as industry standards; but I am quite certain that it's
anything but a pleasant framework to design with. But I'm biased, since
pretty much all my messages are a few Kilobytes, and really, no amount
of bloat that results from the markup is going to make enough difference
that it overtakes RPC over HTTP or File IO as the limiting factor.

To be fair, the discussion of ASN.1 started in response to a proposition
to use XML for a degenerate case where it's probably not the appropriate
markup encoding to use.

Also, it's quite likely that when someone's golden hammer fails, he
might be tempted to reinvent the wheel (badly), rather than use a
different hammer for that problem. And that's why an amateur might need
to be nudged in the direction of another alternative that he might never
have heard about otherwise. I can respect that.

Now somebody is going to come out of the woodwork claiming that yacc is
fun.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,281
Latest member
Pedroaciny

Latest Threads

Top