Problem with String encoding when modifying it in C method

  • Thread starter Iñaki Baz Castillo
  • Start date
I

Iñaki Baz Castillo

Hi, I've added a method "multi_capitalize" to String class. This
method is done in C and basically modifies the string:

"record-roUTE".multi_capitalize =3D> "Record-Route"

The problem is that after the method execution, the new String has
ASCII-8BIT encoding, while the original string had UTF-8 (using Ruby
1.9.1).

---------------------------------------------------------------------------=
-----
irb> hname =3D "record-rouTE-=E2=82=AC"
"record-rouTE-=E2=82=AC"

irb> hname.encoding
#<Encoding:UTF-8>

irb> hname2 =3D hname.multi_capitalize
"Record-Route-\xE2\x82\xAC" <------- !!!

irb> hname2.encoding
#<Encoding:ASCII-8BIT> <------- !!!

irb> hname2.force_encoding("utf-8")
"Record-Route-=E2=82=AC"

irb> hname2.encoding
#<Encoding:UTF-8>
---------------------------------------------------------------------------=
-----

What should I add to my C method to mantain the UTF-8 codification
after the changes in the string?
Could I invoke the C "force_encoding()" function from the C code
before returning the modified string? How to invoke it?

Thanks a lot.


--=20
I=C3=B1aki Baz Castillo
<[email protected]>
 
A

Andre Nathan

Could I invoke the C "force_encoding()" function from the C code
before returning the modified string? How to invoke it?

You can call it as (untested):

rb_funcall(str, rb_intern("force_encoding"), 1, rb_str_new2("utf-8"));

I'm not sure how to make your multi-capitalize method do the right
thing, but maybe reading the source of rb_str_capitalize_bang in
string.c helps.

Best,
Andre
 
I

Iñaki Baz Castillo

El Viernes 03 Abril 2009, Andre Nathan escribi=C3=B3:
You can call it as (untested):

rb_funcall(str, rb_intern("force_encoding"), 1, rb_str_new2("utf-8"));

I'm not sure how to make your multi-capitalize method do the right
thing, but maybe reading the source of rb_str_capitalize_bang in
string.c helps.

Thanks a lot, I will check it.

=2D-=20
I=C3=B1aki Baz Castillo <[email protected]>
 
I

Iñaki Baz Castillo

El Viernes 03 Abril 2009, I=C3=B1aki Baz Castillo escribi=C3=B3:
El Viernes 03 Abril 2009, Andre Nathan escribi=C3=B3:

Thanks a lot, I will check it.

Yes, rb_str_capitralize_bang handles a lot of stuf realted to encoding:

c =3D rb_enc_codepoint(s, send, enc);
if (rb_enc_islower(c, enc)) {
rb_enc_mbcput(rb_enc_toupper(c, enc), s, enc);
modify =3D 1;
}
s +=3D rb_enc_codelen(c, enc);

so this is the way :)

Thanks a lot.

=2D-=20
I=C3=B1aki Baz Castillo <[email protected]>
 
K

KUBO Takehiro

Hi,

Hi, I've added a method "multi_capitalize" to String class. This
method is done in C and basically modifies the string:

=A0"record-roUTE".multi_capitalize =3D> "Record-Route"

The problem is that after the method execution, the new String has
ASCII-8BIT encoding, while the original string had UTF-8 (using Ruby
1.9.1).

rb_encoding *enc =3D rb_enc_get(original_string)

/* create a new string with the encoding same with the original string =
*/
return rb_enc_str_new(char_pointer, length, enc);

rb_str_new() makes a ASCII-8BIT string.
 
I

Iñaki Baz Castillo

El S=E1bado 04 Abril 2009, KUBO Takehiro escribi=F3:
Hi,

rb_encoding *enc =3D rb_enc_get(original_string)

/* create a new string with the encoding same with the original string
*/ return rb_enc_str_new(char_pointer, length, enc);

rb_str_new() makes a ASCII-8BIT string.

Thanks.

=2D-=20
I=F1aki Baz Castillo <[email protected]>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top