newbie question on how to format xml data in logical/common way

G

Guest

Hello. Recently in my web development project I wish to keep these data
in xml form, but I am not sure of the best way to organize this data
(being xml newbie). The data is a list of business categories (business
fields) like this:

Raw Material, Forestry and Agricultural Products, Environmental Services (category A)
+--- Raw Materials (Mining incl.) (category A01)
+- Ores (category A0101)
+- Coal (category A0102)
+- Minerals, Precious Stones (category A0103)
+- ....
+--- Food, Agriculture (category A02)
+- ....
+--- ....
Processing Industry
+--- ....
Service and Trade
+--- ....

Basically, the data has 3 level, each level has a code repersenting it,
each string have 2 language versions (only one language version is give
in the above example). I am thinking of several possibilities to keep
this data in XML:

idea A:

<category english="Raw Material, Forestry and Agricultural Products, Environmental Services"
chinese="Ô­²ÄÁÏ¡¢ÁÖ²úÆ·ºÍÅ©²úÆ·¡¢»·¾³·þÎñ"
<category english="Raw Materials (Mining incl.)"
chinese="Ô­²ÄÁÏ£¨°üÀ¨²É¿óÒµ£©"
<category id="a0101"><english>Ores</english><chinese>¿óʯ</chinese>...</category>
...
</category>
...
</category>

idea B:
<category lang="english" id="a">
<name>Raw Material, Forestry and Agricultural Products, Environmental Services</name>
<category id="a01">
<name>Raw Materials (Mining incl.)</name>
<category id="a0101">Ores</category>
....
....
</category>
.....
</category>
<category lang="english" id="b">....</category>
<category lang="english" id="c">....</category>
<category lang="chinese" id="a">....</category>
<category lang="chinese" id="b">....</category>
<category lang="chinese" id="c">....</category>

Anyway, there are a lot of possibilities to store data in XML format,
and you can see through my example ideas that I am almost completely
blind on how we can organize data better in XML format. It's unlike
relational database that with XML we have many possibilities. I also
wish to store data in a format that is logically better rather then just
store it with the logic of the application that reads this xml data, so
that in the future even the application changes, the data is still
making sense.

How do you think?
 
M

Magnus Henriksson

ÕÅí|Îä said:
Hello. Recently in my web development project I wish to keep these data
in xml form, but I am not sure of the best way to organize this data
(being xml newbie). The data is a list of business categories (business
fields) like this:

Raw Material, Forestry and Agricultural Products, Environmental Services (category A)
+--- Raw Materials (Mining incl.) (category A01)
+- Ores (category A0101)
+- Coal (category A0102)
+- Minerals, Precious Stones (category A0103)
+- ....
+--- Food, Agriculture (category A02)
+- ....
+--- ....
Processing Industry
+--- ....
Service and Trade
+--- ....

Basically, the data has 3 level, each level has a code repersenting it,
each string have 2 language versions (only one language version is give
in the above example). I am thinking of several possibilities to keep
this data in XML:

idea A:

<category english="Raw Material, Forestry and Agricultural Products, Environmental Services"
chinese="Ô­²ÄÁÏ¡¢ÁÖ²úÆ·ºÍÅ©²úÆ·¡¢»·¾³·þÎñ"
<category english="Raw Materials (Mining incl.)"
chinese="Ô­²ÄÁÏ£¨°üÀ¨²É¿óÒµ£©"
<category id="a0101"><english>Ores</english><chinese>¿óʯ</chinese>...</category>
...
</category>
...
</category>

idea B:
<category lang="english" id="a">
<name>Raw Material, Forestry and Agricultural Products, Environmental Services</name>
<category id="a01">
<name>Raw Materials (Mining incl.)</name>
<category id="a0101">Ores</category>
....
....
</category>
....
</category>
<category lang="english" id="b">....</category>
<category lang="english" id="c">....</category>
<category lang="chinese" id="a">....</category>
<category lang="chinese" id="b">....</category>
<category lang="chinese" id="c">....</category>

Anyway, there are a lot of possibilities to store data in XML format,
and you can see through my example ideas that I am almost completely
blind on how we can organize data better in XML format. It's unlike
relational database that with XML we have many possibilities. I also
wish to store data in a format that is logically better rather then just
store it with the logic of the application that reads this xml data, so
that in the future even the application changes, the data is still
making sense.

How do you think?

What about:

<categories>
<category code="A">
<name xml:lang="en">Raw Material, Forestry and Agricultural
Products, Environmental Services</name>
<name xml:lang="zh">Ô­²ÄÁÏ¡¢ÁÖ²úÆ·ºÍÅ©²úÆ·¡¢»·¾³·þÎñ</name>
<category code="A01">
<name xml:lang="en">Raw Materials (Mining incl.)</name>
<name xml:lang="zh">Ô­²ÄÁÏ£¨°üÀ¨²É¿óÒµ)</name>
<category code="A0101">
<name xml:lang="en">Ores</name>
<name xml:lang="zh">¿óʯ</name>
<category>
<category code="A0102">
<name xml:lang="en">Coal</name>
<name xml:lang="zh">????</name>
<category>
.
.
.
</category>
.
.
.
</category>
<category code="B">
.
.
.
</category>
.
.
.
</categories>


// Magnus
 
P

p.lepin

Magnus said:
What about:

<categories>
<category code="A">
<name xml:lang="en">Raw Material, Forestry and
Agricultural Products, Environmental Services</name>
<name
xml:lang="zh">原ææ–™ã€æž—产å“和农产å“ã€çŽ¯å¢ƒæœåŠ¡</name>
<category code="A01">
<name xml:lang="en">Raw Materials (Mining
incl.)</name>
<name xml:lang="zh">原æ料(包括采矿业)</name>
<category code="A0101">
<name xml:lang="en">Ores</name>
<name xml:lang="zh">矿石</name>
<category>
<category code="A0102">
<name xml:lang="en">Coal</name>
<name xml:lang="zh">????</name>
<category>
</category>
</category>
</categories>

What's the point in keeping the category codes in XML? Even
if the OP oversimplified a bit, and actual codes might be
a bit more inconsistent than what we're seeing in this
example, there's definitely no need to store the *full*
category code for each subcategory -- that looks like
unnecessary information and a possible cause for data
integrity issues. This looks better to me:

<categories>
<category code="A">
<category code="01">
<category code="01"/>
<category code="02"/>
</category>
</category>
</categories>

Use 'real' id's to refer to elements if that's desirable
(codes are to be human-readable as far as I can tell).

There *might* be a problem with this approach: I'm not sure
whether you can specify code uniqueness among siblings
using XML Schema. If that's not possible, and if you want
to validate your XMLs using XML Schema, then the first
approach is probably better.
 
M

Magnus Henriksson

What's the point in keeping the category codes in XML? Even
if the OP oversimplified a bit, and actual codes might be
a bit more inconsistent than what we're seeing in this
example, there's definitely no need to store the *full*
category code for each subcategory -- that looks like
unnecessary information and a possible cause for data
integrity issues. This looks better to me:

<categories>
<category code="A">
<category code="01">
<category code="01"/>
<category code="02"/>
</category>
</category>
</categories>

By having the full code on each category, lookups are much simpler to
implement and will be faster. Also, depending on how inconsistent the
codes really are, your approach might not work at all.
Use 'real' id's to refer to elements if that's desirable
(codes are to be human-readable as far as I can tell).

I would think that the codes are not to be human readable. First, the
string 'A0101' does not tell me (as a human) anything, unless I'm very
familiar with the domain. Second, by providing descriptions (in
different languages no less), seems to validate that assumption.
There *might* be a problem with this approach: I'm not sure
whether you can specify code uniqueness among siblings
using XML Schema. If that's not possible, and if you want
to validate your XMLs using XML Schema, then the first
approach is probably better.

IDs are not scoped; they have to be unique within the document.

By having the full code on all categories, we can specify (in any schema
language) that the value of the code attribute is of type ID.

// Magnus
 
G

Guest

于 Thu, 19 Oct 2006 11:54:44 +0200,Magnus Henriksson写到:
By having the full code on each category, lookups are much simpler to
implement and will be faster. Also, depending on how inconsistent the
codes really are, your approach might not work at all.

I actually thought a bit about it before posting the question. The code at
this moment is 100% consistent, but the code is used in multiple
databaese, so if in the future they want to remove one category, chances
are they simply stop using one code, creating in-consistent situation in
the future.
I would think that the codes are not to be human readable. First, the
string 'A0101' does not tell me (as a human) anything, unless I'm very
familiar with the domain. Second, by providing descriptions (in
different languages no less), seems to validate that assumption.


IDs are not scoped; they have to be unique within the document.

As far as I can recall it's not possible to define id with type ID that
starts with a number, it must start with an english alphabet letter and
then possibilily followed by digit. Using ID would let me use
getElementById, otherwise finding an element would require using xpath or
other technologies that I am not capable of using...
 
P

p.lepin

Magnus said:
By having the full code on each category, lookups are
much simpler to implement and will be faster.

Certainly. But move/copy/link operations become a bit
trickier if you have to maintain sanity for codes, and
lookups are still easy with either XPath or real IDs.
Also, depending on how inconsistent the codes really are,
your approach might not work at all.

Certainly. But that might be a reason to actually rethink
the data model, too.
I would think that the codes are not to be human
readable. First, the string 'A0101' does not tell me (as
a human) anything, unless I'm very familiar with the
domain. Second, by providing descriptions (in different
languages no less), seems to validate that assumption.

That depends on details of the project the OP is working
on. From my practice, though, I can tell you that I've had
no idea what 'AMS' or 'LOS' or 'SFO' was when I started
working on my current project, but my users were constantly
using those ID's to refer to Schiphol in Amsterdam, Murtala
Muhammed in Lagos and San Francisco Intl Airport. Anyway,
if the codes *are* the IDs, and are not to be human
readable, there's no reason to construct them in such a way
as to contain the information on the node's position
related to other nodes. That information already *is* in
the XML, and there are tools specifically designed to
retrieve, alter or filter by that information.
IDs are not scoped; they have to be unique within the
document.

I've never used XML Schema, but it would seem to me that
3.11 in XML Schema Part 1
(http://www.w3.org/TR/xmlschema-1/) is describing precisely
what I had in mind. Of course, just as usual, W3C specs
drive me mad if I peruse them for longer that three minutes,
so I might be completely wrong here.

Anyway, I actually argued *against* using code as an ID,
because from my experience and in my very humble opinion
grabbing something in the data model that looks like an ID
and making it an ID is not a very good idea. YMMV.
By having the full code on all categories, we can specify
(in any schema language) that the value of the code
attribute is of type ID.

Of course. But the question is not whether we can do that
or not, but whether it is the Right Thing to do. Overall
I'd say this is getting too far-fetched simply because we
*don't* know what the OP needs. There might be constraints
in his project that make all of my points a hogwash, and
all of yours the only sane way to go. Or vice versa. Unless
the OP decides to elaborate, we may never know.
 
M

Magnus Henriksson

I'd say this is getting too far-fetched simply because we
*don't* know what the OP needs. There might be constraints
in his project that make all of my points a hogwash, and
all of yours the only sane way to go. Or vice versa. Unless
the OP decides to elaborate, we may never know.

That's true. I made several assumptions on how this format would be used
and how it would evolve. Well, more like wild guesses than assumptions,
really.

Anyway, I would like to give the following advice to the OP:

1) Only have one hierarchy of categories; do not have one hierarchy for
each language.

2) Keep the category code in an attribute, formated in a way thats makes
sense to you. Consider what will happen when categories are added and
removed.

3) Keep the human readable descriptions in elements.


// Magnus
 
J

Joe Kesselman

张韡武 said:
getElementById, otherwise finding an element would require using xpath or
other technologies that I am not capable of using...

.... or doing a simple tree walk, which I presume you're capable of writing.

Canned solutions are great, but they aren't the only ones.
 
G

Guest

于 Thu, 19 Oct 2006 14:11:21 +0200,Magnus Henriksson写到:
Anyway, I would like to give the following advice to the OP:

1) Only have one hierarchy of categories; do not have one hierarchy for
each language.

2) Keep the category code in an attribute, formated in a way thats makes
sense to you. Consider what will happen when categories are added and
removed.

3) Keep the human readable descriptions in elements.

Yeah that 3 guideline makes a lot of sense and through this conversation I
really got the knowledge I want. Thank you.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top