Converting to UCS-2 or UTF-16 for use by a C extension

Discussion in 'Ruby' started by Wincent Colaiuta, Jun 7, 2007.

  1. I'm working on a C extension that embeds an ANTLR parser, and I need
    to convert a Ruby input string into UCS-2 or possibly UTF-16 encoding.

    I've got a working implementation but I suspect that it is flawed and
    just wanted to ask if this is the right way to do it. The basic idea
    is as follows (in pseudo-code):

    // 1. unpack to array of UTF8 characters
    utf8 = input.unpack("C*");

    // 2. repack
    packed = utf8.pack("U*");

    // 3. convert using Iconv
    ucs2 = Iconv.iconv("UCS-2", "UTF-8", packed).first

    // 4. freeze
    ucs2.freeze

    // 5. get pointer, and length (in 16 bit words)
    pointer = StringValuePtr(ucs2); // this bit in C
    count = ucs.length / 2;

    // 6. hand off to the parser...

    My doubts are basically as follows:

    - I'm doing the unpack/repack because I am not sure that my string is
    encoded internally as UTF-8... it *seems* to be, because if I type a
    string like "€" in irb then I can see that it's composed of three
    bytes in UTF-8 ("\342\202\254")

    - Is it in UTF-8 only because my system's locale is set that way?
    might it be different on other people's machines? (and if so, how
    would I find out what the encoding is?)

    - In the case that the encoding is *not* UTF-8, does my "round-trip"
    unpack/pack trick actually get it into UTF-8? (I don't think it will!
    In which case the rount-trip is a waste of time)

    - And once I've got the String in UCS-2, does StringValuePtr give me
    access to the raw UCS-2 encoded data like I think it does? (seems to)

    - Does calling length on the UCS-2 encoded string always give the
    result in bytes? (I am almost certain that it does)

    - Is there some more elegant way to get an arbitrary Ruby string into
    UCS-2 so that it can be handed off the C parser?

    Cheers,
    Wincent
     
    Wincent Colaiuta, Jun 7, 2007
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. wael

    unicode (UCS-2 encoded)

    wael, Aug 22, 2003, in forum: C++
    Replies:
    10
    Views:
    1,902
  2. Jimmy Shaw

    Converting from UTF-16 to UTF-32

    Jimmy Shaw, Jul 31, 2006, in forum: C++
    Replies:
    7
    Views:
    1,394
    P.J. Plauger
    Aug 1, 2006
  3. rahul
    Replies:
    0
    Views:
    274
    rahul
    Apr 27, 2009
  4. rahul
    Replies:
    2
    Views:
    291
    Gabriel Genellina
    Apr 27, 2009
  5. Kioko --
    Replies:
    3
    Views:
    347
    Walton Hoops
    Mar 24, 2010
Loading...

Share This Page