Unicode/multibyte string support in Ruby1.9/Ruby summary?

Discussion in 'Ruby' started by David Garamond, Jan 15, 2005.

  1. If someone could summarize the recent Unicode/multibyte string
    discussion on a wiki, that would be nice (and _very_ useful). It will
    help programmers prepare their code for Unicode support and backward
    compatibility in the future. Topics should include:

    - how will strings be stored in memory (which probably be different
    between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

    - how to check a string's charset, encoding;

    - how to do various operations in the new multibyte sring, especially
    those which will be done differently compared to the classic string;

    - what will happen to the classic string (e.g. will it perhaps be
    renamed to ByteArray or something);

    - comparison rules for cross-encoding and cross-charset strings;

    - regexes;

    - how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
    string support (especially since Ruby is a pretty latecomer in the
    Unicode scene);

    Regards,
    dave
     
    David Garamond, Jan 15, 2005
    #1
    1. Advertising

  2. David Garamond wrote:

    > If someone could summarize the recent Unicode/multibyte string
    > discussion on a wiki, that would be nice (and _very_ useful). It will
    > help programmers prepare their code for Unicode support and backward
    > compatibility in the future. Topics should include:


    Note that lots of this was recently discussed in [ruby-core:04146]. I'll
    try to answer the questions as accurately as possible.

    > - how will strings be stored in memory (which probably be different
    > between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);


    AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
    bytes for one character.) Note that the RString record of Ruby will get
    a new field for the encoding.

    > - how to check a string's charset, encoding;


    String#encoding. It will return a String.

    > - how to do various operations in the new multibyte sring, especially
    > those which will be done differently compared to the classic string;


    Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.

    > - what will happen to the classic string (e.g. will it perhaps be
    > renamed to ByteArray or something);


    The String interface will remain the same. Strings will just get added
    the encoding facilities, but will remain largely backwards compatible AFAIK.

    > - comparison rules for cross-encoding and cross-charset strings;


    Strings that have the same encoding and the same bytes are equivalent.
    Strings that have ASCII compatible, but different encodings and only
    ASCII characters are equivalent.
    Everything else is different.

    I think there will be ways for converting from one encoding to another
    one, but I don't know the details.

    > - regexes;


    Regexp#encoding is introduced, matching uses similar rules as String
    comparison.

    > - how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
    > string support (especially since Ruby is a pretty latecomer in the
    > Unicode scene);


    I can't really do an in-depth comparison here, because I don't know the
    other languages.

    Note that str[0] will return a one-character String and that ?x will do
    the same. There will be a new method like String#code point for getting
    the underlying raw bytes. I think the one-character Strings can later
    still be optimized fairly easily so that they can be immediate Objects.
     
    Florian Gross, Jan 15, 2005
    #2
    1. Advertising

  3. David Garamond

    ts Guest

    >>>>> "F" == Florian Gross <> writes:

    F> AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
    F> bytes for one character.) Note that the RString record of Ruby will get
    F> a new field for the encoding.

    Are you sure ? or I've not understood what you are trying to say.


    Guy Decoux
     
    ts, Jan 15, 2005
    #3
  4. Hi,

    In message "Re: Unicode/multibyte string support in Ruby1.9/Ruby summary?"
    on Sun, 16 Jan 2005 02:58:20 +0900, ts <> writes:

    |F> AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
    |F> bytes for one character.) Note that the RString record of Ruby will get
    |F> a new field for the encoding.
    |
    | Are you sure ? or I've not understood what you are trying to say.

    He's right, except that the encoding will be stored using the FL_USER
    flags or an instance variable of the string.

    matz.
     
    Yukihiro Matsumoto, Jan 15, 2005
    #4
  5. Florian Gross wrote:
    > David Garamond wrote:
    >
    >> If someone could summarize the recent Unicode/multibyte string
    >> discussion on a wiki, that would be nice (and _very_ useful). It will
    >> help programmers prepare their code for Unicode support and backward
    >> compatibility in the future. Topics should include:

    >
    > Note that lots of this was recently discussed in [ruby-core:04146]. I'll
    > try to answer the questions as accurately as possible.


    Thanks for the answers, Florian. Yes I was following the thread on
    ruby-core too, but forgot that this is ruby-talk.

    I have created the first draft in RubyGarden:

    http://www.rubygarden.org/ruby?UnicodeInRuby2

    It's very raw and bare-bones (plus I'm an ASCII guy and totally clueless
    regarding multibyte/Unicode). I invite people to improve on it.

    Thanks.

    Regards,
    dave
     
    David Garamond, Jan 15, 2005
    #5
  6. David Garamond

    ts Guest

    >>>>> "Y" == Yukihiro Matsumoto <> writes:

    Y> He's right, except that the encoding will be stored using the FL_USER
    Y> flags or an instance variable of the string.

    My question was precisely about

    "RString record of Ruby will get a new field"

    i.e. I've read ruby_m17n :)


    Guy Decoux
     
    ts, Jan 16, 2005
    #6
  7. Florian Gross ha scritto:
    > David Garamond wrote:
    >
    >> If someone could summarize the recent Unicode/multibyte string
    >> discussion on a wiki, that would be nice (and _very_ useful). It will
    >> help programmers prepare their code for Unicode support and backward
    >> compatibility in the future. Topics should include:

    >
    >
    > Note that lots of this was recently discussed in [ruby-core:04146]. I'll
    > try to answer the questions as accurately as possible.
    >
    >> - how will strings be stored in memory (which probably be different
    >> between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

    >
    >
    > AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple
    > bytes for one character.) Note that the RString record of Ruby will get
    > a new field for the encoding.
    >
    >> - how to check a string's charset, encoding;

    >
    >
    > String#encoding. It will return a String.
    >
    >> - how to do various operations in the new multibyte sring, especially
    >> those which will be done differently compared to the classic string;

    >
    >
    > Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.
    >
    >> - what will happen to the classic string (e.g. will it perhaps be
    >> renamed to ByteArray or something);

    >
    >
    > The String interface will remain the same. Strings will just get added
    > the encoding facilities, but will remain largely backwards compatible
    > AFAIK.
    >
    >> - comparison rules for cross-encoding and cross-charset strings;

    >
    >
    > Strings that have the same encoding and the same bytes are equivalent.
    > Strings that have ASCII compatible, but different encodings and only
    > ASCII characters are equivalent.
    > Everything else is different.
    >
    > I think there will be ways for converting from one encoding to another
    > one, but I don't know the details.
    >
    >> - regexes;

    >
    >
    > Regexp#encoding is introduced, matching uses similar rules as String
    > comparison.
    >
    >> - how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte
    >> string support (especially since Ruby is a pretty latecomer in the
    >> Unicode scene);

    >
    >
    > I can't really do an in-depth comparison here, because I don't know the
    > other languages.
    >
    > Note that str[0] will return a one-character String and that ?x will do
    > the same. There will be a new method like String#code point for getting
    > the underlying raw bytes. I think the one-character Strings can later
    > still be optimized fairly easily so that they can be immediate Objects.



    an addition and two questions: the encoding of the source file will be
    indicated with the same approach of python:
    #!/usr/bin/ruby
    # -*- coding: <encoding name> -*-

    or command line option (maybe -K ) or compile time configuration time.
    But I wonder: why can't we keep using $KCODE for this and have to use
    that ugly magic string?

    Also, not that I am an espert, but is localization supposed to work?
    i.e. accented letters which are common in european languages are
    supposed to be able to be capitalized and such?
    Is'nt this related to a charset property of the string different from
    encoding ?
    IIRC in parrot-land a string is a <stream of
    bytes>+<encoding>+<charset>+<language>, how happens that we just care
    about one of this things?

    Also, given that this seem a huge work.. will it spin off in a proper
    indipendent libm17n library ? :)
     
    gabriele renzi, Jan 16, 2005
    #7
  8. Hi,

    In message "Re: Unicode/multibyte string support in Ruby1.9/Ruby summary?"
    on Sun, 16 Jan 2005 19:59:30 +0900, ts <> writes:

    |Y> He's right, except that the encoding will be stored using the FL_USER
    |Y> flags or an instance variable of the string.
    |
    | My question was precisely about
    |
    | "RString record of Ruby will get a new field"
    |
    | i.e. I've read ruby_m17n :)

    I know that you know. It's just for rest of us.

    matz.
     
    Yukihiro Matsumoto, Jan 16, 2005
    #8
  9. David Garamond

    Guest

    Hi,

    At Sun, 16 Jan 2005 20:41:08 +0900,
    gabriele renzi wrote in [ruby-talk:126677]:
    > an addition and two questions: the encoding of the source file will be
    > indicated with the same approach of python:
    > #!/usr/bin/ruby
    > # -*- coding: <encoding name> -*-
    >
    > or command line option (maybe -K ) or compile time configuration time.
    > But I wonder: why can't we keep using $KCODE for this and have to use
    > that ugly magic string?


    Since encodings may vary per files, so -K would not enough.

    --
    Nobu Nakada
     
    , Jan 16, 2005
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Billow
    Replies:
    2
    Views:
    1,356
    Billow
    Dec 1, 2005
  2. Zygmunt Krynicki

    Multibyte string length

    Zygmunt Krynicki, Oct 9, 2003, in forum: C Programming
    Replies:
    19
    Views:
    715
    Dan Pop
    Oct 14, 2003
  3. miner49er
    Replies:
    1
    Views:
    617
    Ron Natalie
    Mar 14, 2006
  4. Iñaki Baz Castillo
    Replies:
    14
    Views:
    253
    Iñaki Baz Castillo
    Apr 5, 2009
  5. Iñaki Baz Castillo
    Replies:
    7
    Views:
    653
    Rick DeNatale
    Dec 3, 2009
Loading...

Share This Page