Unicode/multibyte string support in Ruby1.9/Ruby summary?

David Garamond · Jan 15, 2005

If someone could summarize the recent Unicode/multibyte string
discussion on a wiki, that would be nice (and _very_ useful). It will
help programmers prepare their code for Unicode support and backward
compatibility in the future. Topics should include:

- how will strings be stored in memory (which probably be different
between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

- how to check a string's charset, encoding;

- how to do various operations in the new multibyte sring, especially
those which will be done differently compared to the classic string;

- what will happen to the classic string (e.g. will it perhaps be
renamed to ByteArray or something);

- comparison rules for cross-encoding and cross-charset strings;

- regexes;

- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
string support (especially since Ruby is a pretty latecomer in the
Unicode scene);

Regards,
dave

Florian Gross · Jan 15, 2005

David said:
If someone could summarize the recent Unicode/multibyte string
discussion on a wiki, that would be nice (and _very_ useful). It will
help programmers prepare their code for Unicode support and backward
compatibility in the future. Topics should include:

Note that lots of this was recently discussed in [ruby-core:04146]. I'll
try to answer the questions as accurately as possible.

- how will strings be stored in memory (which probably be different
between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
bytes for one character.) Note that the RString record of Ruby will get
a new field for the encoding.

- how to check a string's charset, encoding;

String#encoding. It will return a String.

- how to do various operations in the new multibyte sring, especially
those which will be done differently compared to the classic string;

Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.

- what will happen to the classic string (e.g. will it perhaps be
renamed to ByteArray or something);

The String interface will remain the same. Strings will just get added
the encoding facilities, but will remain largely backwards compatible AFAIK.

- comparison rules for cross-encoding and cross-charset strings;

Strings that have the same encoding and the same bytes are equivalent.
Strings that have ASCII compatible, but different encodings and only
ASCII characters are equivalent.
Everything else is different.

I think there will be ways for converting from one encoding to another
one, but I don't know the details.

- regexes;

Regexp#encoding is introduced, matching uses similar rules as String
comparison.

- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
string support (especially since Ruby is a pretty latecomer in the
Unicode scene);

I can't really do an in-depth comparison here, because I don't know the
other languages.

Note that str[0] will return a one-character String and that ?x will do
the same. There will be a new method like String#code point for getting
the underlying raw bytes. I think the one-character Strings can later
still be optimized fairly easily so that they can be immediate Objects.

ts · Jan 15, 2005

F> AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
F> bytes for one character.) Note that the RString record of Ruby will get
F> a new field for the encoding.

Are you sure ? or I've not understood what you are trying to say.

Guy Decoux

Yukihiro Matsumoto · Jan 15, 2005

Hi,

In message "Re: Unicode/multibyte string support in Ruby1.9/Ruby summary?"

|F> AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
|F> bytes for one character.) Note that the RString record of Ruby will get
|F> a new field for the encoding.
|
| Are you sure ? or I've not understood what you are trying to say.

He's right, except that the encoding will be stored using the FL_USER
flags or an instance variable of the string.

matz.

David Garamond · Jan 15, 2005

Florian said:
David said:

If someone could summarize the recent Unicode/multibyte string
discussion on a wiki, that would be nice (and _very_ useful). It will
help programmers prepare their code for Unicode support and backward
compatibility in the future. Topics should include:

Click to expand...

Note that lots of this was recently discussed in [ruby-core:04146]. I'll
try to answer the questions as accurately as possible.

Thanks for the answers, Florian. Yes I was following the thread on
ruby-core too, but forgot that this is ruby-talk.

I have created the first draft in RubyGarden:

http://www.rubygarden.org/ruby?UnicodeInRuby2

It's very raw and bare-bones (plus I'm an ASCII guy and totally clueless
regarding multibyte/Unicode). I invite people to improve on it.

Thanks.

Regards,
dave

ts · Jan 16, 2005

Y> He's right, except that the encoding will be stored using the FL_USER
Y> flags or an instance variable of the string.

My question was precisely about

"RString record of Ruby will get a new field"

i.e. I've read ruby_m17n

Guy Decoux

gabriele renzi · Jan 16, 2005

Florian Gross ha scritto:

David said:
David said:

If someone could summarize the recent Unicode/multibyte string
discussion on a wiki, that would be nice (and _very_ useful). It will
help programmers prepare their code for Unicode support and backward
compatibility in the future. Topics should include:

Click to expand...

Note that lots of this was recently discussed in [ruby-core:04146]. I'll
try to answer the questions as accurately as possible.

- how will strings be stored in memory (which probably be different
between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

Click to expand...

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
bytes for one character.) Note that the RString record of Ruby will get
a new field for the encoding.

- how to check a string's charset, encoding;

Click to expand...

String#encoding. It will return a String.

- how to do various operations in the new multibyte sring, especially
those which will be done differently compared to the classic string;

Click to expand...

Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.

- what will happen to the classic string (e.g. will it perhaps be
renamed to ByteArray or something);

Click to expand...

The String interface will remain the same. Strings will just get added
the encoding facilities, but will remain largely backwards compatible
AFAIK.

- comparison rules for cross-encoding and cross-charset strings;

Click to expand...

Strings that have the same encoding and the same bytes are equivalent.
Strings that have ASCII compatible, but different encodings and only
ASCII characters are equivalent.
Everything else is different.

I think there will be ways for converting from one encoding to another
one, but I don't know the details.

- regexes;

Click to expand...

Regexp#encoding is introduced, matching uses similar rules as String
comparison.

- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
string support (especially since Ruby is a pretty latecomer in the
Unicode scene);

Click to expand...

I can't really do an in-depth comparison here, because I don't know the
other languages.

Note that str[0] will return a one-character String and that ?x will do
the same. There will be a new method like String#code point for getting
the underlying raw bytes. I think the one-character Strings can later
still be optimized fairly easily so that they can be immediate Objects.

an addition and two questions: the encoding of the source file will be
indicated with the same approach of python:
#!/usr/bin/ruby
# -*- coding: <encoding name> -*-

or command line option (maybe -K ) or compile time configuration time.
But I wonder: why can't we keep using $KCODE for this and have to use
that ugly magic string?

Also, not that I am an espert, but is localization supposed to work?
i.e. accented letters which are common in european languages are
supposed to be able to be capitalized and such?
Is'nt this related to a charset property of the string different from
encoding ?
IIRC in parrot-land a string is a <stream of
bytes>+<encoding>+<charset>+<language>, how happens that we just care
about one of this things?

Also, given that this seem a huge work.. will it spin off in a proper
indipendent libm17n library ?

Yukihiro Matsumoto · Jan 16, 2005

Hi,

In message "Re: Unicode/multibyte string support in Ruby1.9/Ruby summary?"

|Y> He's right, except that the encoding will be stored using the FL_USER
|Y> flags or an instance variable of the string.
|
| My question was precisely about
|
| "RString record of Ruby will get a new field"
|
| i.e. I've read ruby_m17n

I know that you know. It's just for rest of us.

matz.

nobu.nokada · Jan 16, 2005

Hi,

At Sun, 16 Jan 2005 20:41:08 +0900,
gabriele renzi wrote in [ruby-talk:126677]:

an addition and two questions: the encoding of the source file will be
indicated with the same approach of python:
#!/usr/bin/ruby
# -*- coding: <encoding name> -*-

or command line option (maybe -K ) or compile time configuration time.
But I wonder: why can't we keep using $KCODE for this and have to use
that ugly magic string?

Since encodings may vary per files, so -K would not enough.

Ruby 'C' Extensions and Unicode	10	Feb 9, 2010
Ruby, Unicode - ever?	19	Jan 9, 2006
FAQ 6.23 How can I match strings with multibyte characters?	0	Jan 11, 2011
Unicode Support in Ruby, Perl, Python, Emacs Lisp	6	Oct 7, 2010
How should I handle the multibyte char set string in C++?	10	Apr 29, 2007
[SUMMARY] Twisting a Rope (#137)	5	Sep 6, 2007
[SUMMARY] Word Chains (#44)	12	Sep 1, 2005
os.lisdir, gets unicode, returns unicode... USUALLY?!?!?	13	Nov 16, 2006

Unicode/multibyte string support in Ruby1.9/Ruby summary?

David Garamond

Florian Gross

ts

Yukihiro Matsumoto

David Garamond

ts

gabriele renzi

Yukihiro Matsumoto

nobu.nokada

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads