Ruby, Unicode - ever?

dseverin · Jan 9, 2006

Well, as I could search the web so far, since about 2001 or even early,
once in a while appears question: why ruby does not support Unicode???
Why can't ruby use at least ICU libs?
(current state of UTF8 in Ruby, even with regexps, is too far away from
proper Unicode support, don't try to cheat me, that it's OK and enough,
it is not!)

And usual answer is (for years!): m17n will be in Ruby 2.0 (Rite) as
Unicode can't handle enough chars and Han unification is unacceptable.

But...

As for me, there are two big problems:
1. Ruby String class in current state is TOO MUCH OVERLOADED : it mixes
byte-array and character-text string behaviour at the same time. That is
definitely and absolutely wrong design decision. These are different
paradigms, which must not be mixed ever.

2. My impession about rite m17n is that for each string it will be
possible to set different encoding. I don't get it. As for byte array -
encoding is senseless - this is plain bit stream. And for text - how
will one compare/regexp/search using strings in different encodings???
(BTW, Unicode codepoint space is 10^21 - but do we really have over
million of *different* characters?) What is the sense to create
text-handling support code for all that multitude of encodings? (look in
oniguruma - each encoding plugin sets own procedures and char properties
to deal with multibyte encodings)

Well, I think, String class must be REMOVED from Rite.
Instead, two incompatible classes must be introduced: ByteArray and Text
with well-separated semantics and behaviour. Else it will never end but
eventually crash into crap ruins someday...

Austin Ziegler · Jan 9, 2006

Well, as I could search the web so far, since about 2001 or even
early, once in a while appears question: why ruby does not support
Unicode???

Ruby *does* support Unicode. It just doesn't treat it specially.

Why can't ruby use at least ICU libs?

It could, if you wrote a wrapper for them.

(current state of UTF8 in Ruby, even with regexps, is too far away
from proper Unicode support, don't try to cheat me, that it's OK and
enough, it is not!)

For 99% of cases, in fact, is *is* sufficient. What do you think is
missing?

And usual answer is (for years!): m17n will be in Ruby 2.0 (Rite) as
Unicode can't handle enough chars and Han unification is unacceptable.

That is not correct. m17n strings will be in Ruby 2.0, but it is not
because of "enough chars" (which wouldn't be true in any case) or Han
unification. It is mostly because of legacy data.

As for me, there are two big problems:
1. Ruby String class in current state is TOO MUCH OVERLOADED : it
mixes byte-array and character-text string behaviour at the same time.
That is definitely and absolutely wrong design decision. These are
different paradigms, which must not be mixed ever.

Sorry, but I don't actually agree. There's very little evidence that the
Ruby String mixes byte array and character string behaviour in a way
that matters *most of the time*. The only time it matters is when you
want to do str[0] and get just the first *character*, and you quickly
learn to do str[0, 1] instead. That is something that will be changing
with m17n strings, but it won't be a big deal.

2. My impession about rite m17n is that for each string it will be
possible to set different encoding. I don't get it.

That would suggest that you really haven't done a lot of looking at
character set issues overall. Those of us who *do* have to deal with
legacy encodings *will* appreciate this.

As for byte array - encoding is senseless - this is plain bit stream.

And a String without a byte array will be treated just as a byte vector.

And for text - how will one compare/regexp/search using strings in
different encodings???

Generally, one wouldn't want to. However, I'm sure that it would be
possible to upconvert or downconvert as appropriate for comparison. If
you have something in EUC-JP and need to compare it against SJIS, you
can convert from one to the other or convert both to UTF-16 for
comparison.

(BTW, Unicode codepoint space is 10^21 - but do we really have over
million of *different* characters?) What is the sense to create
text-handling support code for all that multitude of encodings? (look
in oniguruma - each encoding plugin sets own procedures and char
properties to deal with multibyte encodings)

*shrug* Welcome to the real world of encoding hell where we have to deal
with legacy data.

Well, I think, String class must be REMOVED from Rite. Instead, two
incompatible classes must be introduced: ByteArray and Text with
well-separated semantics and behaviour. Else it will never end but
eventually crash into crap ruins someday...

You're welcome to submit an RCR on it. I am 99.999% certain it will be
shot down, though.

I would certainly oppose it. There are things that I disagree with Matz
on the design of Ruby 2.0 -- and have told him so in discussions. The
m17n String, however, is one where I more than agree with him. It's a
much better solution than I think you will find in most other languages.
Especially since, for most purposes, you as a Ruby programmer won't care
one way or another.

-austin

David Vallner · Jan 11, 2006

Austin said:
For 99% of cases, in fact, is *is* sufficient. What do you think is
missing?

How would the regexp engine match multibyte UTF8 characters that have
what is ASCII whitespace as one of the lower bytes? Or how would
/\w{2,4}/ react to a single three-byte UTF-8 character? I didn't yet
stumble upon this in the rather spartan kcode documentation, does the
UTF8 support for Japanese input cater for these perks?

You're welcome to submit an RCR on it. I am 99.999% certain it will be
shot down, though.

Shot, hung, drawn, and quartered probably. The slight abmuguity of
strings might be baffling for people with a Java or similar background,
but Dmitry, if your arguments are to hold water, I want you to give
examples or real-life code where it isn't possible to tell when a String
is used to store binary data, and when it is storing text, in a
situation where this distinction is necessary to process the string.
Otherwise, carry on rambling emptily into a Notepad window.

As for the legacy encoding support, I -wish- I saw that more often. Try
getting any work done on an English Windows XP, with Slovak regional
settings for the odd ancient non-Unicode tool, and a German keyboard,
and you start wanting to access the encoding of the consoles real soon.

David Vallner

dseverin · Jan 11, 2006

Ok, I have to admit, that I'm wrong and just an ignorant idiot.

It is because of my Java experience and some annoying bugs in Rails and
Text::Format where developers didn't even mention that their code will
work only if your strings are pure ASCII (or single-byte encoded).

Fritz Heinrichmeyer · Jan 11, 2006

There are problems with string formatting when using non 1 byte fonts.
i. E.

print "%-40s\n" % name

where name contains german umlauts.

It is broken when german umlauts are utf8-coded (2 bytes). This was true
at least with 1.8.2. I went back to an 8 bit locale for this reason.

Will this work with 1.8.4 and utf8 locale?

Austin Ziegler · Jan 11, 2006

How would the regexp engine match multibyte UTF8 characters that have
what is ASCII whitespace as one of the lower bytes? Or how would
/\w{2,4}/ react to a single three-byte UTF-8 character? I didn't yet
stumble upon this in the rather spartan kcode documentation, does the
UTF8 support for Japanese input cater for these perks?

I donno. I suspect that if $KCODE =3D 'u', it will work rather
surprisingly well. I don't know to be honest, though.

-austin

Austin Ziegler · Jan 11, 2006

Ok, I have to admit, that I'm wrong and just an ignorant idiot.

Didn't say that. I want specific examples of why it isn't sufficient.

It is because of my Java experience and some annoying bugs in Rails and
Text::Format where developers didn't even mention that their code will
work only if your strings are pure ASCII (or single-byte encoded).

Ah. So you have a problem with Text::Format? Did you post a bug?
(Quick scan of my project. I see a feature request posted by "Nobody";
I will assume that's you. Personally, I would have considered it a
bug, but that's just me.)

I didn't mention that Text::Format is for single-byte strings because
I'm generally doing *console* output, where multibyte characters are
handled poorly.

I *think* that Text::Hyphen will handle UTF-8 hyphenation correctly,
but I'm honestly not sure. I do know that I converted a lot of the
hyphenation codes to UTF-8 instead of the godawful mess that is TeX
encoding.

-austin

David Vallner · Jan 11, 2006

Mark said:
Have you ever looked at how UTF-8 works? Any byte whose decimal value is under
128, anywhere in a UTF-8 string, can ONLY be a genuine occurrance of the ASCII
character with that code point. Any byte anywhere in a multibyte character
will always have the high bit set.

Ye gods, I'm being hopelessly stupid again - I -knew- that.. *goes off
to get brain examined*
Thanks for stopping me confusing people.

David Vallner

Jim Weirich · Jan 11, 2006

David said:
Ye gods, I'm being hopelessly stupid again - I -knew- that.. *goes off
to get brain examined*
Thanks for stopping me confusing people.

For those who didn't know, here is an interesting link on the history of
UTF-8: http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

And, of course, WikiPedia has a good writeup as well:
http://en.wikipedia.org/wiki/UTF-8

-- Jim Weirich

Bruce D'Arcus · Jan 11, 2006

Austin said:
Didn't say that. I want specific examples of why it isn't sufficient.

I''d like to be able to do this and have it just work:

x = ["z", "a", "ó", "o", "b"]

puts x.sort.inspect

Maybe I'm missing something (and if I am, please tell me), but even
when using jcode, the multi-byte character always is last.

Bruce

Levin Alexander · Jan 11, 2006

T24gMS8xMS8wNiwgQnJ1Y2UgRCdBcmN1cyA8YmRhcmN1cy5saXN0c0BnbWFpbC5jb20+IHdyb3Rl
OgoKPiBJJydkIGxpa2UgdG8gYmUgYWJsZSB0byBkbyB0aGlzIGFuZCBoYXZlIGl0IGp1c3Qgd29y
azoKPgo+IHggPSBbInoiLCAiYSIsICLDsyIsICJvIiwgImIiXQo+IHB1dHMgeC5zb3J0Lmluc3Bl
Y3QKCkNvcnJlY3Qgc29ydGluZyBvZiB0ZXh0IGlzIGEgcXVpdGUgdHJpY2t5IHByb2JsZW0sIGJl
Y2F1c2UgdGhlIGNvcnJlY3QKc29ydCBvcmRlciB3aWxsIHByb2JhYmx5IGJlIGxhbmd1YWdlIG9y
IGFwcGxpY2F0aW9uIHNwZWNpZmljLgoKPiBNYXliZSBJJ20gbWlzc2luZyBzb21ldGhpbmcgKGFu
ZCBpZiBJIGFtLCBwbGVhc2UgdGVsbCBtZSksIGJ1dCBldmVuCj4gd2hlbiB1c2luZyBqY29kZSwg
dGhlIG11bHRpLWJ5dGUgY2hhcmFjdGVyIGFsd2F5cyBpcyBsYXN0LgoKPGh0dHA6Ly93d3cudW5p
Y29kZS5vcmcvcmVwb3J0cy90cjEwLz4KCi1MZXZpbgo=

Bruce D'Arcus · Jan 11, 2006

Levin said:
I''d like to be able to do this and have it just work:

x = ["z", "a", "ó", "o", "b"]
puts x.sort.inspect

Click to expand...

Correct sorting of text is a quite tricky problem, because the correct
sort order will probably be language or application specific.

Maybe I'm missing something (and if I am, please tell me), but even
when using jcode, the multi-byte character always is last.

Click to expand...

<http://www.unicode.org/reports/tr10/>

Yes, I'm aware of that, but I'd rather have a default collation used
that gets very close, than for Ruby not even try to sort extended
characters.

Indeed, this is how xpath/xslt 2.0 works; there's the required default
collation you cite above (which works fine for my use cases), and one
can plug-in alternate collations if needed.

Bruce

David Vallner · Jan 11, 2006

Bruce said:
Levin Alexander wrote:
=20

=20

I''d like to be able to do this and have it just work:

x =3D ["z", "a", "=F3", "o", "b"]
puts x.sort.inspect
=20

Click to expand...

Correct sorting of text is a quite tricky problem, because the correct
sort order will probably be language or application specific.

=20

Maybe I'm missing something (and if I am, please tell me), but even
when using jcode, the multi-byte character always is last.
=20

Click to expand...

<http://www.unicode.org/reports/tr10/>
=20

Click to expand...

Yes, I'm aware of that, but I'd rather have a default collation used
that gets very close, than for Ruby not even try to sort extended
characters.

Indeed, this is how xpath/xslt 2.0 works; there's the required default
collation you cite above (which works fine for my use cases), and one
can plug-in alternate collations if needed.

Bruce

=20

Hmm. Would a change to String#<=3D> to respect that document when $KCODE=20
=3D=3D 'u' break too much code?

David Vallner

Yohanes Santoso · Jan 11, 2006

Bruce D'Arcus said:
Levin said:

I''d like to be able to do this and have it just work:

x =3D ["z", "a", "=F3", "o", "b"]
puts x.sort.inspect

Click to expand...

Correct sorting of text is a quite tricky problem, because the correct
sort order will probably be language or application specific.

Maybe I'm missing something (and if I am, please tell me), but even
when using jcode, the multi-byte character always is last.

Click to expand...

<http://www.unicode.org/reports/tr10/>

Click to expand...

Yes, I'm aware of that, but I'd rather have a default collation used

How can you have a default collation if there is no default culture.
Collating by character binary value (what ruby does) is as neutral a
collation as it can get. Else, people would be shouting 'favouritism'.

that gets very close, than for Ruby not even try to sort extended
characters.

YS.

David Vallner · Jan 11, 2006

Yohanes said:
=20

Levin Alexander wrote:
=20

=20

I''d like to be able to do this and have it just work:

x =3D ["z", "a", "=F3", "o", "b"]
puts x.sort.inspect
=20

Correct sorting of text is a quite tricky problem, because the correct
sort order will probably be language or application specific.

=20

Maybe I'm missing something (and if I am, please tell me), but even
when using jcode, the multi-byte character always is last.
=20

<http://www.unicode.org/reports/tr10/>
=20

Click to expand...

Yes, I'm aware of that, but I'd rather have a default collation used
=20

Click to expand...

How can you have a default collation if there is no default culture.
Collating by character binary value (what ruby does) is as neutral a
collation as it can get. Else, people would be shouting 'favouritism'.

=20

That would be making the default behaviour one that is wrong for any=20
culture. That's not solving the problem, that's avoiding it.

David Vallner

dseverin · Jan 12, 2006

Levin said:
I''d like to be able to do this and have it just work:

x = ["z", "a", "Ã³", "o", "b"]
puts x.sort.inspect

Click to expand...

Correct sorting of text is a quite tricky problem, because the correct
sort order will probably be language or application specific.

Maybe I'm missing something (and if I am, please tell me), but even
when using jcode, the multi-byte character always is last.

Click to expand...

<http://www.unicode.org/reports/tr10/>

-Levin

I've just found fresh ICU4R project on ruby-forge (and currently play
with it), which seems to take care about Unicode issues including
collation.
Maybe it's what I need, need to play more

Some sorting fun:

require 'ustring'
["z", "a", "Ã³", "o", "b", "y"].collect {|s| s.to_u}.sort {|a,b|
UString::strcoll(a,b, "lv")}.collect {|v| v.to_s}
=> ["a", "b", "y", "o", "\303\263", "z"]
UString::strcoll("Ã†SS".u, "aeÃŸ".u, "de", 0)
=> 0
"IÃ±tÃ«rnÃ¢tiÃ´nÃ lizÃ¦tiÃ¸n".u.chars("").sort{|a,b| UString::strcoll(a,
b)}.uniq.join("|")
=> Ã |Ã¢|Ã¦|Ã«|i|I|l|n|Ã±|Ã´|Ã¸|r|t|z
puts a = "IÃ±tÃ«rnÃ¢tiÃ´nÃ lizÃ¦tiÃ¸n".u.upcase!
=> IÃ‘TÃ‹RNÃ‚TIÃ”NÃ€LIZÃ†TIÃ˜N
v = a.search("national".u, "en", 0, nil, nil)
=> [5..12]
puts a[v[0]]
=> NÃ‚TIÃ”NÃ€L

Shot - Piotr Szotkowski · Jan 13, 2006

Hello.

Yohanes Santoso:

How can you have a default collation if there is no default culture.

Do what other languages do, ask the underlying operating
system for LC_COLLATE (or ask it to sort the data by itself).

Collating by character binary value (what ruby does) is as
neutral a collation as it can get. Else, people would be shouting
'favouritism'.

No, Ruby could implement a counterpart of MySQL's
utf8_general_ci and/or utf8_unicode_ci collations:
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

Cheers,
-- Shot

Austin Ziegler · Jan 13, 2006

Yohanes Santoso:
Do what other languages do, ask the underlying operating
system for LC_COLLATE (or ask it to sort the data by itself).

Which is not really appropriate for all operating systems, and is one
of the *dumbest* things about POSIX.

No, Ruby could implement a counterpart of MySQL's
utf8_general_ci and/or utf8_unicode_ci collations:
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

Given how much else MySQL gets wrong, why should this be trusted to be
anything close to right?

-austin

Shot - Piotr Szotkowski · Jan 13, 2006

Hello.

Austin Ziegler:

Which is not really appropriate for all operating
systems, and is one of the *dumbest* things about POSIX.

I'm not a language developer, but a programmer (doing a lot of i18n and
l10n work lately) and user, and to me using LC_COLLATE when available
seems much better than defaulting to binary collation 'just because'.
BTW: PostgreSQL defaults to system's LC_COLLATE if not told explicitely
on cluster init as well.

Given how much else MySQL gets wrong, why should
this be trusted to be anything close to right?

Have you actually followed the link? Quoting: The utf8_unicode_ci
collation is implemented according to the Unicode Collation
Algorithm (UCA) described at http://www.unicode.org/reports/tr10/
The collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt

I won't start discussing MySQL's quality[1], but I don't see a reason
why Unicode's own collation algorithm shouldn't at least be considered.

[1] Question of the Week: Who in their right mind documents 'CREATE
USER' in a shell for version 4.1, but does not actually introduce
the command until 5.0?

Cheers,
-- Shot

Bruce D'Arcus · Jan 13, 2006

Austin Ziegler wrote:

[... snip ...]

Given how much else MySQL gets wrong, why should this be trusted to be
anything close to right?

I have no position on the technical details of any of this, but I can
tell you MySQL gets a lot closer to "right" sorting my list of names (a
few of which happen to start with multi-byte characters) than Ruby
does. I really hope this issue is going to be addressed.

Bruce

Ruby 'C' Extensions and Unicode	10	Feb 9, 2010
Unicode Support in Ruby, Perl, Python, Emacs Lisp	6	Oct 7, 2010
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Ruby, Unicode, and HTML Entities Problem	4	Sep 26, 2010
Unicode (UTF-8) in C	13	Mar 16, 2014
YAML + ASCII Encoded Unicode	1	Feb 9, 2009
Is Unicode support so hard...	12	Apr 20, 2013
Unicode questions	17	Oct 19, 2010

Ruby, Unicode - ever?

dseverin

Austin Ziegler

David Vallner

dseverin

Fritz Heinrichmeyer

Austin Ziegler

Austin Ziegler

David Vallner

Jim Weirich

Bruce D'Arcus

Levin Alexander

Bruce D'Arcus

David Vallner

Yohanes Santoso

David Vallner

dseverin

Shot - Piotr Szotkowski

Austin Ziegler

Shot - Piotr Szotkowski

Bruce D'Arcus

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads