i18n'ed Character Set in DBMS and tables

A

Albretch

.. Can you define the Character Set for particular tables instead of
databases?
. Which DBMSs would let you do that?
. How do you store in a DBMS i18n'ed users' from input, coming over
the web (basically from everywhere) store it and properly serve it
back to users, . . .?
. Can you point me to info on this?

I would preferably use Java/JDBC drivers.
 
P

Paul Lutus

Albretch said:
. Can you define the Character Set for particular tables instead of
databases?

Yes, sometimes in the entire world of computer science, but since you have
cross-posted so widely, you need to choose a single newsgroup and ask again
for a specific answer.
  . Which DBMSs would let you do that?

That depends.
  . How do you store in a DBMS i18n'ed users' from input, coming over
the web (basically from everywhere) store it and properly serve it
back to users, . . .?

Unicode. Choose one newsgroup, make one post.
 
M

Mark Yudkin

Albretch said:
. Can you define the Character Set for particular tables instead of
databases?
Depends on the DBMS
. Which DBMSs would let you do that?
Various. With such a massive crosspost, there's little point in anybody's
answering with any specific set of DBMS's
. How do you store in a DBMS i18n'ed users' from input, coming over
the web (basically from everywhere) store it and properly serve it
back to users, . . .?
You don't. I18n is designed to permit you to SELECT ONE SPECIFIC language
and process that correctly. To store and retrieve "ALL" languages, you use
Unicode.
. Can you point me to info on this? www.unicode.org

I would preferably use Java/JDBC drivers.
Many databases support these APIs.
 
A

Albretch

Mark Yudkin said:
You don't. I18n is designed to permit you to SELECT ONE SPECIFIC language
and process that correctly. To store and retrieve "ALL" languages, you use
Unicode.
.. . .

Well, I see 'issues' right there. I think your approach to it is
wrong 'by design and implementation' and this is what I am trying to
avoid.

Let's say you can set the character set and collation all the way
down to the column. Now, if you 'use unicode to store and retrieve
"ALL" languages' (as you suggest) and since, naturally and per SQL
ANSI's spex, you can only set a character set and a related collation
on a column (and AFAIK (and I could see why it is not so) you can not
specify/change collation on the fly as you run a 'select' query with
an 'order by' clause)

Unless in Unicode, which by the way I see as a good technical example
of a waste by trying to keep extensively ASCII'ing all nat langs (this
is the weakest/silliest 'standard' I know of), collation is not
necessary since there is a 1=1 map between character set and collation
orders for all langs which to me sounds really unnatural

Say you have Korean names and Swahili ones in a table, do people 'use
unicode to store and retrieve "ALL" languages and since' and then keep
an extra columns specifying the character set, . . . and then 'SELECT
ONE SPECIFIC language and process that correctly' a la':

SELECT SrName, FName from NamesTable
WHERE(CHAR_SET_Col='Korean_CHAR_SET') SORT BY SrName, FName;

and/or

SELECT SrName, FName from NamesTable
WHERE(CHAR_SET_Col='Swahili_CHAR_SET') SORT BY SrName, FName;

this would be -way- slower than having the two tables collation
sensitive columns set to the correct char_set + collation pair,
keeping an index on them and periodically physically sorting them.

Or?
 
M

Mark Yudkin

Or?

I18n is about Internationalization = I(18 chars)n. Even the name by itself
should make it obvious that it's not about multilingual text (German:
Internationalisierung = I19g?). See the details at
www.w3.org/International/. I18n is about processing, not about storage
representation. Unicode is about multilingual storage (www.unicode.org).
Much of I18n applies to Unicode storage, and issues related to choosing one
encoding over another (www.w3.org/TR/i18n-html-tech-char/)

If you need to store Korean, you have to use some form of 2 byte
representation. DBCS is the older mechanism, Unicode the standardized
technique, both are supported by DB2 (DBCS is called GRAPHIC).

Issues of multilingual collation are linguistic, not representational.
Knowledge of the language is essential. German sorts ö equivalently to o,
whereas Swedish sorts it after z, and English writers tend to drop the
umlaut complete, leading to nonsensical pronunciation (schon and schön are
two different words with different meaning). Collation order is an
intractable problem: consider the problem of how to sort a table containing
both German and Swedish names - any choice you make will be totally wrong to
the other nationality, and hence you need to consider not the origin of the
data but the origin of the consumer! On the other hand, Swahili is correctly
sorted using the general Unicode sort.

You need to read up on both I18n and Unicode and understand what each is
about and how they complement each other. I've given you URLs above.

You also need to do some thinking about your problem, and understand its
ramifications. Getting hold of people with experience in other languages,
including complex scripts (Arabic, Chinese, Japanese, Korean, Thai) and
other alphabets (Hebrew, Russian), as well as differing rules when using
accented latin characters (most continental European, Vietnamese) will
undoubtedly help you understand the consequence of multilinguage databases.
You'll also learn that issues such as sorting (your "way slower" comment)
are in many ways the least of your problems. Fortunately, most systems
provide highly efficient processing of a range of common languages, but not
all languages. DB2 or SQL Server, for example, do not have native Swahili
support (your example); Windows XP introduced Swahili, Windows 2000 didn't
support it (other than through Unicode).
 
A

Albretch

I18n is about processing, not about storage representation
AM: Yeah, sure! But ultimately storage, representation, reencoding .
.. . are all part of the functional pipe of 'processing'. These are
all naturally related as you serve the data to clients.
consider the problem of how to sort a table containing both German and Swedish names - any choice you make will be totally wrong to the other nationality, and hence you need to consider not the origin of the data but the origin of the consumer!
AM: This is what I think is exaclty wrong; you do not store in a
table German and Swedish names together in the same column!
Say someone hits your site and you get from the language User-Agent
headers her browser is setting to preferably handle Hungarian then you
have a large database with foreign names data from a wild array of
nationalities/character sets (like the immigrant database at Ellis
Island). Now, say these guys at Ellis Island for
'let-me-know-about-your-life' issues decide to include the original
names in their langs. (an incredible lot of people's names where
change to the 'gringo' 'Mary' and 'John')
Would you design a monster table with all names and then set the
column char set as Unicode or have different tables for the different
charset+collation pairs + a primary/foreign Key design of the
database, making your Hungarian user life faster/easier?

If you read into daddy's E.F. Codd defined well-structured 'normal
forms' of relations and 'normalization', you should not mix 'different
type of data' in the same column (and the charset+collation makes
these data differ, don't they?)

And I do think all 'theories' are 'unpractical' until proven
otherwise
. . . Getting hold of people with experience in other languages . . .
AM: I do have 'experience in other languages', threee of them,
although they are all western ones :-(
I think this problem can be reduced to a simple mathematical one. You
don't need to know may people to understand this problem.
You'll also learn that issues such as sorting (your "way slower" comment) are in many ways the least of your problems.
AM: . . . 'least of your problems' . . . when you will have to set
indexes in these types of columns, since they are 'free text'
searchable ones?
For me, a developer, "way slower" means anything that would run more
than 30% slower. Being ready/able to have three 'customer' instead of
two makes a difference in business as well as in life. DBMS issues
(and their related IO ones) is the number one performance issue in
large DBMS-based software development
 
M

Mark Yudkin

How many of your 3 languages do you speak, read and write fluently? How many
of these use non-Latin characters? Are you aware of how many languages there
are in the world? Have you considered that some languages have multiple
writing systems, even going so far as to use different alphabets, in
different locales? What about users whose language and locale don't mix?
Here I am, posting in English, and living in the Swiss German locale. My
keyboard is a Swiss German one, my Windows is an English one. Swiss German
don't even use the same alphabetic characters for writing as Germany does
for "German German".

Also, did you actually understand Codd's normalization? Why are you
confusing type and interpretation?

There is absolutely no way I would use a separate table for each (language,
locale) combination. And I'm speaking as somebody who develops software that
supports 4 languages (simultaneously) for a living (the three main national
languages of this country: German, French and Italian, plus English), and
stores additional languages in the database (international financial and
economic data from OECD, BIS, World Bank, etc.).

You want to support all languages, locales and scripts. But you don't appear
to have the faintest idea of the problems involved.

Albretch said:
I18n is about processing, not about storage representation
AM: Yeah, sure! But ultimately storage, representation, reencoding .
. . are all part of the functional pipe of 'processing'. These are
all naturally related as you serve the data to clients.
consider the problem of how to sort a table containing both German and
[/QUOTE]
Swedish names - any choice you make will be totally wrong to the other
nationality, and hence you need to consider not the origin of the data but
the origin of the consumer!
AM: This is what I think is exaclty wrong; you do not store in a
table German and Swedish names together in the same column!
Say someone hits your site and you get from the language User-Agent
headers her browser is setting to preferably handle Hungarian then you
have a large database with foreign names data from a wild array of
nationalities/character sets (like the immigrant database at Ellis
Island). Now, say these guys at Ellis Island for
'let-me-know-about-your-life' issues decide to include the original
names in their langs. (an incredible lot of people's names where
change to the 'gringo' 'Mary' and 'John')
Would you design a monster table with all names and then set the
column char set as Unicode or have different tables for the different
charset+collation pairs + a primary/foreign Key design of the
database, making your Hungarian user life faster/easier?

If you read into daddy's E.F. Codd defined well-structured 'normal
forms' of relations and 'normalization', you should not mix 'different
type of data' in the same column (and the charset+collation makes
these data differ, don't they?)

And I do think all 'theories' are 'unpractical' until proven
otherwise

AM: I do have 'experience in other languages', threee of them,
although they are all western ones :-(
I think this problem can be reduced to a simple mathematical one. You
don't need to know may people to understand this problem.
comment) are in many ways the least of your problems.
 
P

P.Hill

Mark said:
And I'm speaking as somebody who develops software that
supports 4 languages (simultaneously) for a living (the three main national
languages of this country: German, French and Italian, plus English),

What no Rumantsch?

-Paul
 
A

Albretch

Usually I stop paying attention to people when they start getting
personal. However I think our talk has been constructive for the most
part.
How many of your 3 languages do you speak, read and write fluently?
AM: I would say the three of them. Spanish is my mother tongue; I
studied in Germany graduating with a Master's in Math/Physics and have
lived in the US for ten years.
How many of these use non-Latin characters?
AM: Do you mean latin-1/ISO-8859-1? Spanish and German use a few.
Are you aware of how many languages there are in the world?
AM: Pretty much, if 'a whole lot' would qualify as an answer to you
:)
Have you considered that some languages have multiple writing systems, even going so far as to use different alphabets, in
different locales?
AM: Yes, I have.
What about users whose language and locale don't mix?
AM: What do you f*ck&ng mean? Are you using the terms 'language' and
'locale' as a free speech kind of thing or as defined technical
standards?

The Java API did a fine job at functionally describing both terms

http://java.sun.com/j2se/1.5.0/docs/api/java/util/Locale.html

if you understand Java/OOP; the fact that there is no Locale
constructor without a specified language would tell you something.

The language argument is a valid ISO Language Code. These codes are
the lower-case, two-letter codes as defined by ISO-639. You can find a
full list of these codes at a number of sites, such as:
http://www.loc.gov/standards/iso639-2/englangn.html
The country argument is a valid ISO Country Code. These codes are the
upper-case, two-letter codes as defined by ISO-3166. You can find a
full list of these codes at a number of sites, such as:
http://www.iso.ch/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html
Here I am, posting in English, and living in the Swiss German locale. My
keyboard is a Swiss German one, my Windows is an English one.
AM: . . . and your browser settings are?

Swiss German
don't even use the same alphabetic characters for writing as Germany does
for "German German".
AM: the differences are very minimal indeed. I have spoken to Swiss
German people and we have understood each other 'einwandfrei'. I would
even dare to say American and Brittish English differ more and people
use both without paying much attention to the differences
Also, did you actually understand Codd's normalization? Why are you
confusing type and interpretation?
AM: . . . because it affects the sort order (even down to a
physical level) and how fast the table is sampled with select stats
that include these columns in the 'order by' clause.

Dr. Codd's Rule #1: All information in a relational database is
represented explicitly at the logical level in exactly one way: by
values in tables. And, the data in each field is assumed to be atomic;
that is, the smallest bit of useful information -- a single value.

Think about this phrase in his stat "the smallest bit of useful
information" . . . and you will understand what I mean
There is absolutely no way I would use a separate table for each (language,
locale) combination. And I'm speaking as somebody who develops software that
supports 4 languages (simultaneously) for a living (the three main national
languages of this country: German, French and Italian, plus English), and
stores additional languages in the database (international financial and
economic data from OECD, BIS, World Bank, etc.).

AM: Wow! If you pay me to and/or I had the time to do it, I would
technically prove my point to you, but since it is you the one gaining
something by understanding it and I think you are pretty much capable
of showing it to yourself (if you 'want to' see it), I will leave it
to you as 'homework'
 
S

steve

AM: Yeah, sure! But ultimately storage, representation, reencoding .
. . are all part of the functional pipe of 'processing'. These are
all naturally related as you serve the data to clients.

AM: This is what I think is exaclty wrong; you do not store in a
table German and Swedish names together in the same column!
Say someone hits your site and you get from the language User-Agent
headers her browser is setting to preferably handle Hungarian then you
have a large database with foreign names data from a wild array of
nationalities/character sets (like the immigrant database at Ellis
Island). Now, say these guys at Ellis Island for
'let-me-know-about-your-life' issues decide to include the original
names in their langs. (an incredible lot of people's names where
change to the 'gringo' 'Mary' and 'John')
Would you design a monster table with all names and then set the
column char set as Unicode or have different tables for the different
charset+collation pairs + a primary/foreign Key design of the
database, making your Hungarian user life faster/easier?

If you read into daddy's E.F. Codd defined well-structured 'normal
forms' of relations and 'normalization', you should not mix 'different
type of data' in the same column (and the charset+collation makes
these data differ, don't they?)

And I do think all 'theories' are 'unpractical' until proven
otherwise

AM: I do have 'experience in other languages', threee of them,
although they are all western ones :-(
I think this problem can be reduced to a simple mathematical one. You
don't need to know may people to understand this problem.

AM: . . . 'least of your problems' . . . when you will have to set
indexes in these types of columns, since they are 'free text'
searchable ones?
For me, a developer, "way slower" means anything that would run more
than 30% slower. Being ready/able to have three 'customer' instead of
two makes a difference in business as well as in life. DBMS issues
(and their related IO ones) is the number one performance issue in
large DBMS-based software development



look it is very simple.

If your app is ONLY EVER going to use a language that can be encoded by
using ascii and single byte data, then that is fine.

if you are going to support multi languages then you MUST use some sort of
multi byte system 2-3 bytes per character, FOR ALL DATA.

so yes you have 1 multi byte enabled table. and you put all the shit in 1
table.

let's consider for a moment how you would handle your system of multi
tables.

say your application deals with single byte and muilti byte data, how for
example are you Going to :

1.spit & con-cat strings ? is it 1 byte or is it 2 bytes, or 3
2.how do you check the length of a string?, is it 200 bytes for a single line
of address data or is it 400, or 600.

3.how are you going to build your sql strings to select the different tables?
4. what happens if the user puts english & chinese characters together on the
same line of data?
( in china & asia we mix asian glifs & roman together)
which table does the resulting string go in?

are you going to check which country the user is in and make dangerous
assumptions?
( origin of the consumer?), say i'm in England, using Chinese windows?


or are you going to scan the string & see if you can stick it in a single
byte table or a multi byte table?


you are going to bury yourself in the logistics of trying to deal with
different encoding lengths.

using 1 table is NOT slower, tables can be partitioned ( well in oracle they
can)
so you could store the data in 1 table, then partition the table on a
"language column"

the only time i have found using multi tables a good idea, is when i have 2
clearly defined languages English & Chinese,.
When i have to store the same address in both forms ( 1 for the english
staff, 1 for the asian staff) AND not every English address has a Chinese
equivalent. ( or visa versa, depending on your point of view)

but i still have to have 2 sets of duplicate edits screens, ( luckily its is
only 10 data fields), and that is such a pain , i am trying to reduce it
back to 1 set of screens.


you need to get over your fixation with "how big the data is" & how slow is
my database going to be.

storage is just too cheap these days.
top end databases , are so optimized these days that they spend a large part
of their time "asleep"
 
M

Mark Yudkin

You may speak 3 western European languages, but all three of these can be
encoded within a single "Western European" ASCII character set, and do not
have seriously conflicting sort orders. Mixing German and Hebrew would be
somewhat messier, although rather common - even within a single document. Of
course, bidirectionality adds yet another complexity that you haven't
considered.

We use the 2 character ISO country code (ISO 3166); that's what is used
generally by international financial reporting. We also use GESMES
(UN/EDIFACT) (www.unece.org, c.f.
http://www.unece.org/trade/untdid/d99b/trmd/gesmes_c.htm), but that's less
concerned with languages, and more with data exchange. Locales, as we use
them, are those from I18n (http://www.w3.org/International/). The "problem"
(e.g. in Java's model) is that just because a user lives in some country /
locale, he does not necessarily use the language(s) defined for these, hence
he is forced to lie about his locale in order to get the desired language.
Microsoft, BTW, worked this out, and fixed the problem in Windows 2000. My
IE6 browser is set up for "German (Switzerland) [de-ch]" by default, with
the browser language set to English. There is no "English (Switzerland)
[en-ch]". Except for a few sites which seem to believe that I have to be
presented with the language of "my locale", I have no problems (google uses
it as a default, but lets me override it, saving my configuration).

Fortunately, you don't work with me, so I have no need to explain that mixed
language documents prevent separate tables, even if the underlying design of
vertically partitioning information without maintaining the partitioning key
were not totally wrong. It will be your boss's problem to clean up the chaos
you leave behind.

Since this conversation is a waste of time, this is my last response.

Dr Mark Yudkin

Albretch said:
Usually I stop paying attention to people when they start getting
personal. However I think our talk has been constructive for the most
part.

AM: I would say the three of them. Spanish is my mother tongue; I
studied in Germany graduating with a Master's in Math/Physics and have
lived in the US for ten years.

AM: Do you mean latin-1/ISO-8859-1? Spanish and German use a few.

AM: Pretty much, if 'a whole lot' would qualify as an answer to you
:)
even going so far as to use different alphabets, in
 
A

Albretch

I am/was posting to the general public.

Now I see what you meant, when you said:
. . . The "problem"
(e.g. in Java's model) is that just because a user lives in some country /
locale, he does not necessarily use the language(s) defined for these, hence
he is forced to lie about his locale in order to get the desired language.
Microsoft, BTW, worked this out, and fixed the problem in Windows 2000.

Nowhere either written in the 'standards' or in the rationale behind
it (as I understand it) says or is implied that a user would be
'lying' about their language-locale pair settings if it does not
geographically reflect the place where they physically are.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top