Unicode/multibyte string support in Ruby1.9/Ruby summary?

D

David Garamond

If someone could summarize the recent Unicode/multibyte string
discussion on a wiki, that would be nice (and _very_ useful). It will
help programmers prepare their code for Unicode support and backward
compatibility in the future. Topics should include:

- how will strings be stored in memory (which probably be different
between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

- how to check a string's charset, encoding;

- how to do various operations in the new multibyte sring, especially
those which will be done differently compared to the classic string;

- what will happen to the classic string (e.g. will it perhaps be
renamed to ByteArray or something);

- comparison rules for cross-encoding and cross-charset strings;

- regexes;

- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
string support (especially since Ruby is a pretty latecomer in the
Unicode scene);

Regards,
dave
 
F

Florian Gross

David said:
If someone could summarize the recent Unicode/multibyte string
discussion on a wiki, that would be nice (and _very_ useful). It will
help programmers prepare their code for Unicode support and backward
compatibility in the future. Topics should include:

Note that lots of this was recently discussed in [ruby-core:04146]. I'll
try to answer the questions as accurately as possible.
- how will strings be stored in memory (which probably be different
between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
bytes for one character.) Note that the RString record of Ruby will get
a new field for the encoding.
- how to check a string's charset, encoding;

String#encoding. It will return a String.
- how to do various operations in the new multibyte sring, especially
those which will be done differently compared to the classic string;

Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.
- what will happen to the classic string (e.g. will it perhaps be
renamed to ByteArray or something);

The String interface will remain the same. Strings will just get added
the encoding facilities, but will remain largely backwards compatible AFAIK.
- comparison rules for cross-encoding and cross-charset strings;

Strings that have the same encoding and the same bytes are equivalent.
Strings that have ASCII compatible, but different encodings and only
ASCII characters are equivalent.
Everything else is different.

I think there will be ways for converting from one encoding to another
one, but I don't know the details.
- regexes;

Regexp#encoding is introduced, matching uses similar rules as String
comparison.
- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
string support (especially since Ruby is a pretty latecomer in the
Unicode scene);

I can't really do an in-depth comparison here, because I don't know the
other languages.

Note that str[0] will return a one-character String and that ?x will do
the same. There will be a new method like String#code point for getting
the underlying raw bytes. I think the one-character Strings can later
still be optimized fairly easily so that they can be immediate Objects.
 
T

ts

F> AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
F> bytes for one character.) Note that the RString record of Ruby will get
F> a new field for the encoding.

Are you sure ? or I've not understood what you are trying to say.


Guy Decoux
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: Unicode/multibyte string support in Ruby1.9/Ruby summary?"

|F> AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
|F> bytes for one character.) Note that the RString record of Ruby will get
|F> a new field for the encoding.
|
| Are you sure ? or I've not understood what you are trying to say.

He's right, except that the encoding will be stored using the FL_USER
flags or an instance variable of the string.

matz.
 
D

David Garamond

Florian said:
David said:
If someone could summarize the recent Unicode/multibyte string
discussion on a wiki, that would be nice (and _very_ useful). It will
help programmers prepare their code for Unicode support and backward
compatibility in the future. Topics should include:

Note that lots of this was recently discussed in [ruby-core:04146]. I'll
try to answer the questions as accurately as possible.

Thanks for the answers, Florian. Yes I was following the thread on
ruby-core too, but forgot that this is ruby-talk.

I have created the first draft in RubyGarden:

http://www.rubygarden.org/ruby?UnicodeInRuby2

It's very raw and bare-bones (plus I'm an ASCII guy and totally clueless
regarding multibyte/Unicode). I invite people to improve on it.

Thanks.

Regards,
dave
 
T

ts

Y> He's right, except that the encoding will be stored using the FL_USER
Y> flags or an instance variable of the string.

My question was precisely about

"RString record of Ruby will get a new field"

i.e. I've read ruby_m17n :)


Guy Decoux
 
G

gabriele renzi

Florian Gross ha scritto:
David said:
If someone could summarize the recent Unicode/multibyte string
discussion on a wiki, that would be nice (and _very_ useful). It will
help programmers prepare their code for Unicode support and backward
compatibility in the future. Topics should include:


Note that lots of this was recently discussed in [ruby-core:04146]. I'll
try to answer the questions as accurately as possible.
- how will strings be stored in memory (which probably be different
between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);


AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
bytes for one character.) Note that the RString record of Ruby will get
a new field for the encoding.
- how to check a string's charset, encoding;


String#encoding. It will return a String.
- how to do various operations in the new multibyte sring, especially
those which will be done differently compared to the classic string;


Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.
- what will happen to the classic string (e.g. will it perhaps be
renamed to ByteArray or something);


The String interface will remain the same. Strings will just get added
the encoding facilities, but will remain largely backwards compatible
AFAIK.
- comparison rules for cross-encoding and cross-charset strings;


Strings that have the same encoding and the same bytes are equivalent.
Strings that have ASCII compatible, but different encodings and only
ASCII characters are equivalent.
Everything else is different.

I think there will be ways for converting from one encoding to another
one, but I don't know the details.
- regexes;


Regexp#encoding is introduced, matching uses similar rules as String
comparison.
- how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
string support (especially since Ruby is a pretty latecomer in the
Unicode scene);


I can't really do an in-depth comparison here, because I don't know the
other languages.

Note that str[0] will return a one-character String and that ?x will do
the same. There will be a new method like String#code point for getting
the underlying raw bytes. I think the one-character Strings can later
still be optimized fairly easily so that they can be immediate Objects.


an addition and two questions: the encoding of the source file will be
indicated with the same approach of python:
#!/usr/bin/ruby
# -*- coding: <encoding name> -*-

or command line option (maybe -K ) or compile time configuration time.
But I wonder: why can't we keep using $KCODE for this and have to use
that ugly magic string?

Also, not that I am an espert, but is localization supposed to work?
i.e. accented letters which are common in european languages are
supposed to be able to be capitalized and such?
Is'nt this related to a charset property of the string different from
encoding ?
IIRC in parrot-land a string is a <stream of
bytes>+<encoding>+<charset>+<language>, how happens that we just care
about one of this things?

Also, given that this seem a huge work.. will it spin off in a proper
indipendent libm17n library ? :)
 
Y

Yukihiro Matsumoto

Hi,

In message "Re: Unicode/multibyte string support in Ruby1.9/Ruby summary?"

|Y> He's right, except that the encoding will be stored using the FL_USER
|Y> flags or an instance variable of the string.
|
| My question was precisely about
|
| "RString record of Ruby will get a new field"
|
| i.e. I've read ruby_m17n :)

I know that you know. It's just for rest of us.

matz.
 
N

nobu.nokada

Hi,

At Sun, 16 Jan 2005 20:41:08 +0900,
gabriele renzi wrote in [ruby-talk:126677]:
an addition and two questions: the encoding of the source file will be
indicated with the same approach of python:
#!/usr/bin/ruby
# -*- coding: <encoding name> -*-

or command line option (maybe -K ) or compile time configuration time.
But I wonder: why can't we keep using $KCODE for this and have to use
that ugly magic string?

Since encodings may vary per files, so -K would not enough.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top