to_yaml and international characters

h3raLd · Oct 23, 2007

Hello,

I noticed some weird behavior when converting a string containing
international characters to YAML:

irb(main):002:0> 'test òùè'.to_yaml
=> "--- \"test \\x95\\x97\\x8A\"\n"
irb(main):003:0>

....but:

irb(main):001:0> 'test òùè'
=> "test \225\227\212"

Basically, the to_yaml method seems to use some strange hex escape
sequences which do not correspond to ANSI, UTF-8 or windows-1252...
The funny part is that when I load the same string from YAML, it is
displayed correctly in the console. This would be fine, except that
when I tried to save it to a file the international characters are not
displayed properly (or better, they are converted to the corresponding
ANSI/UTF-8 characters). What's going on here? What encoding does
to_yaml use to escape international characters?
According to the docs it should be UTF-8, but apparently it is not.

Ruby version: 1.8.6
OS: Windows XP

Any ideas?

Luis Parravicini · Oct 23, 2007

I noticed some weird behavior when converting a string containing
international characters to YAML:

irb(main):002:0> 'test =F2=F9=E8'.to_yaml
=3D> "--- \"test \\x95\\x97\\x8A\"\n"
irb(main):003:0>

...but:

irb(main):001:0> 'test =F2=F9=E8'
=3D> "test \225\227\212"

\225\227\212 is the same as \x95\x97\x8A, the former in octal, and the
latter in hex.

irb(main):002:0> 0x95.to_s(8)
=3D> "225"
irb(main):003:0> 0x97.to_s(8)
=3D> "227"
irb(main):004:0> 0x8a.to_s(8)
=3D> "212"

Bye

--=20
Luis Parravicini
http://ktulu.com.ar/blog/

h3raLd · Oct 23, 2007

\225\227\212 is the same as \x95\x97\x8A, the former in octal, and the
latter in hex.

irb(main):002:0> 0x95.to_s(8)
=> "225"
irb(main):003:0> 0x97.to_s(8)
=> "227"
irb(main):004:0> 0x8a.to_s(8)
=> "212"

Bye

Thanks a lot, this solves part of the mystery!

I figured out the other half, unfortunately: the reason why I can't
view the characters in ANSI or UTF8 is because I'm inputting from DOS,
which means, unfortunately, "Code Page 437" (http://en.wikipedia.org/
wiki/Code_page_437).

Richard Conroy · Oct 23, 2007

Hello,

I noticed some weird behavior when converting a string containing
international characters to YAML:

irb(main):002:0> 'test =F2=F9=E8'.to_yaml
=3D> "--- \"test \\x95\\x97\\x8A\"\n"
irb(main):003:0>

IIRC the various YAML implementations in each language can choose
to output UTF-8, or unicode-escaped ASCII. I think a YAML implementation
has to be able to read either.

Jamal Bengeloun · Oct 30, 2007

Sorry but I do not get it. Plus I am not sure it is only related to
YAML.

I am working on something similar and the only answers I can relate are
those in Python (such as:
http://www.reportlab.com/i18n/python_unicode_tutorial.html). I mean I
got so far as understanding that:

Ã© gets translated to \202
Ã¨ gets translated to \212
Ã gets translated to \205
Ã§ gets translated to \207
Ã¢ gets translated to \203
Ãª gets translated to \210
Ã® gets translated to \214
Ã´ gets translated to \223
Ã» gets translated to \226
Ã¤ gets translated to \204
Ã« gets translated to \211
Ã¯ gets translated to \213
Ã¶ gets translated to \224
Ã¹ gets translated to \227

But why?

The app I am working on gets its data from different sources (yaml
files, dBaseIV files, MS Access files) and then produces xml files (via
builder).

When using print you get the original character. When using p, you get
the escaped equivalent.

And that's only the start of your problems! When trying to get those
characters into utf-8

Ã© gets translated to \202 that then gets translated to ‚
Ã¨ gets translated to \212 that then gets translated to Š
Ã gets translated to \205 that then gets translated to …
Ã§ gets translated to \207 that then gets translated to ‡
Ã¢ gets translated to \203 that then gets translated to ƒ
Ãª gets translated to \210 that then gets translated to ˆ
Ã® gets translated to \214 that then gets translated to Œ
Ã´ gets translated to \223 that then gets translated to “
Ã» gets translated to \226 that then gets translated to –
Ã¤ gets translated to \204 that then gets translated to „
Ã« gets translated to \211 that then gets translated to ‰
Ã¯ gets translated to \213 that then gets translated to ‹
Ã¶ gets translated to \224 that then gets translated to ”
Ã¹ gets translated to \227 that then gets translated to —

Does someone have an explanation?

Does anyone know how to get those characters into the final xml files?

Any help would be greatly appreciated.

Jamal

Konrad Meyer · Oct 30, 2007

--nextPart1384615.vNGKNoWs82
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth Jamal Bengeloun:

...
=20
The app I am working on gets its data from different sources (yaml=20
files, dBaseIV files, MS Access files) and then produces xml files (via=20
builder).
=20
When using print you get the original character. When using p, you get=20
the escaped equivalent.
=20
And that's only the start of your problems! When trying to get those=20
characters into utf-8
=20
...
=20
Does someone have an explanation?
=20
Does anyone know how to get those characters into the final xml files?
=20
Any help would be greatly appreciated.
=20
Jamal

In short, you're asking what the difference between "\303\251", "=C3=A9",=
=20
and "‚" are.

The first is an octal sequence embedded in a string (it happens to be the=
=20
same as utf-8 '=C3=A9'). The second is also utf-8 '=C3=A9'. These two are t=
he same=20
string ("\303\251" =3D=3D "=C3=A9"). The last, '‚' is the html-escape=
d notation=20
for a '=C3=A9' (I'm trusting your email for the correct number here). That =
is,=20
literally "‚" !=3D "=C3=A9", but they should render the same to a bro=
wser=20
capable of displaying utf-8.

HTH,
=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart1384615.vNGKNoWs82
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBHJo8DCHB0oCiR2cwRAvd4AKCbJqFvY5oJpu8E+ca0nG3l5+rvTQCeKIVc
/7R9O+FLq1w5/rG+os0R6k0=
=e0p9
-----END PGP SIGNATURE-----

--nextPart1384615.vNGKNoWs82--

mortee · Oct 30, 2007

Jamal said:
Sorry but I do not get it. Plus I am not sure it is only related to
YAML.

I am working on something similar and the only answers I can relate are
those in Python (such as:
http://www.reportlab.com/i18n/python_unicode_tutorial.html). I mean I
got so far as understanding that:

Ã© gets translated to \202
Ã¨ gets translated to \212
Ã gets translated to \205
Ã§ gets translated to \207
Ã¢ gets translated to \203
Ãª gets translated to \210
Ã® gets translated to \214
Ã´ gets translated to \223
Ã» gets translated to \226
Ã¤ gets translated to \204
Ã« gets translated to \211
Ã¯ gets translated to \213
Ã¶ gets translated to \224
Ã¹ gets translated to \227

But why?

I guess that your understanding is just wrong. I'm not really sure from
where your program gets those accented chars that are translated to
those specific escaped octal sequences. But if you're specifying them in
string constants in your program, then it all depends on according to
what encodig your editor displays it.

For instance, I usually edit my scripts as UTF-8 text files, and I treat
my sting constants that way too. In that case, if I put an Ã© in a string
constant, it gets interpreted as \303\251, and not as \202. It's just
the octal representation of the byte(s) your editor displays as a
specific accented character.

mortee

Jamal Bengeloun · Oct 30, 2007

Probably. I am a beginner in ruby.

The program gets the accented characters from a dBaseIV file, a MS
Access File and some YAML files.

I use Komodo Edit as my editor and it does handle UTF-8 correctly.

I know! That's why I did not understand why I got \202. I do not know
which charset ruby uses to convert the characters. I tried iconv and
jcode but ended up with the same results. At first I thought it was
because of the library I used (builder for example). The only
explanation I found was on that python tutorial.

Thanks.

Jamal

Jamal Bengeloun · Oct 30, 2007

Thanks a lot for your help. I thought I will be going mad with this. I
thought it had something to do with ruby being C based (I saw something
on the internet about the difference between Python and JPython and the
accented characters were encoded in UTF-8 and not html escaped).

What if the end rendering engine is not a browser (I checked and you're
absolutely right, it does work in a browser)? How to get true UTF-8
encoded characters instead of HTML escaped ones? I am using builder to
generate XML files from the data I get.

Thanks a lot for your explanation (it really did enlighten me) and your
help.

Jamal

Jimmy Kofler · Oct 30, 2007

Jamal said:
Thanks a lot for your help. I thought I will be going mad with this. I
thought it had something to do with ruby being C based (I saw something
on the internet about the difference between Python and JPython and the
accented characters were encoded in UTF-8 and not html escaped).

What if the end rendering engine is not a browser (I checked and you're
absolutely right, it does work in a browser)? How to get true UTF-8
encoded characters instead of HTML escaped ones? I am using builder to
generate XML files from the data I get.

Thanks a lot for your explanation (it really did enlighten me) and your
help.

Jamal

It should be possible to convert CP437 -
http://en.wikipedia.org/wiki/Code_page_437 - to UTF-8 using iconv.

iconv -l | grep -i CP437 # => 437 CP437 IBM437 CSPC8CODEPAGE437

"How to get true UTF-8 encoded characters instead of HTML escaped ones?"

This should be doable with http://htmlentities.rubyforge.org .

(For a Ruby & UTF-8 snippet btw see
http://snippets.dzone.com/posts/show/4527 ).

Cheers,

j. k.

Konrad Meyer · Oct 30, 2007

--nextPart2197406.k2ny99hFUk
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth Jamal Bengeloun:

Thanks a lot for your help. I thought I will be going mad with this. I=20
thought it had something to do with ruby being C based (I saw something=20
on the internet about the difference between Python and JPython and the=20
accented characters were encoded in UTF-8 and not html escaped).
=20
What if the end rendering engine is not a browser (I checked and you're=20
absolutely right, it does work in a browser)? How to get true UTF-8=20
encoded characters instead of HTML escaped ones? I am using builder to=20
generate XML files from the data I get.
=20
Thanks a lot for your explanation (it really did enlighten me) and your=20
help.
=20
Jamal
=20

If I'm not mistaken, HTML and XML encoding is the same. So you're good for=
=20
those chars.

HTH,
=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart2197406.k2ny99hFUk
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBHJzvUCHB0oCiR2cwRAg6nAJ9HEXRlTZLDsRvWx/aPAb45GEoFegCZASuq
pN2jznr6yYf8QWZyMjAERFk=
=s1oo
-----END PGP SIGNATURE-----

--nextPart2197406.k2ny99hFUk--

Jamal Bengeloun · Oct 30, 2007

Thanks, I am going to try with html entities.

However, I recheked with my browsers and:

when the accented character comes from Ã YAML file, it is correctly HTML
encoded, however when it comes from the dBase file, it goes again
through:

Ã© gets translated to \202 that then gets translated to ‚
Ã¨ gets translated to \212 that then gets translated to Š
Ã gets translated to \205 that then gets translated to …
Ã§ gets translated to \207 that then gets translated to ‡
Ã¢ gets translated to \203 that then gets translated to ƒ
Ãª gets translated to \210 that then gets translated to ˆ
Ã® gets translated to \214 that then gets translated to Œ
Ã´ gets translated to \223 that then gets translated to “
Ã» gets translated to \226 that then gets translated to –
Ã¤ gets translated to \204 that then gets translated to „
Ã« gets translated to \211 that then gets translated to ‰
Ã¯ gets translated to \213 that then gets translated to ‹
Ã¶ gets translated to \224 that then gets translated to ”
Ã¹ gets translated to \227 that then gets translated to —

like the behavior seen on this page (python behavior however
(http://www.reportlab.com/i18n/python_unicode_tutorial.html))

For example:

[dBase > XML] Ã© gets translated to \202 that then gets translated to
‚ (single low-9 quotation mark)
[YAML > XML] Ã© gets translated to é

Thanks for your help!

Jamal

mortee · Oct 30, 2007

Jamal said:
Probably. I am a beginner in ruby.

The program gets the accented characters from a dBaseIV file, a MS
Access File and some YAML files.

I use Komodo Edit as my editor and it does handle UTF-8 correctly.

I know! That's why I did not understand why I got \202. I do not know
which charset ruby uses to convert the characters.

When converting some accented characters to \nnn then it doesn't use any
encoding. It just represents the verbatim non-ascii bytes it sees in the
string it gets. Encodig/decoding happens mainly when you input accented
chars on your keyboard, and they get converted to some byte (-sequence)
to be stored in a string, and when those strings are displayed, and they
are converted back to some printable characters.

Problems arise when the displaying code interprets the same string
according to a different charset than what it was encoded according to.

For example, when you puts a string in irb, then it's your terminal's
current charset which determines how the bytes in the string are
actually displayed. In contrast, when you use p (or, for that matter,
inspect), then non-ascii characters get ascaped as \nnn.

mortee

Jamal Bengeloun · Nov 13, 2007

Sorry for the delay,

You are totally right. In order to get what I want I used

formated_value = Iconv.new('UTF-8', 'CP850').iconv(input.to_s)

And... It worked.

But at the end I simply used the ADODB wrapper to open my dbf file and I
did not have any character encoding problems after that.

Thanks a lot.

Jamal Abdou-Karim Bengeloun

Jamal Bengeloun · Nov 13, 2007

I am not sure about that, I'll have to check. What I noticed though is
that builder converted the accented characters correctly when coming
from yaml files, but had problems (it did convert them but... See my
previous post) getting those coming from the dbf file to hit the target.

It seems that the problem was coming from the page code encoding.

Thanks

Jamal Abdou-Karim Bengeloun

object.to_yaml works but [ object ].to_yaml fails	1	Nov 17, 2009
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Windows, Dir class and special characters	1	Jun 21, 2010
Escaping characters	7	Nov 7, 2007
Problem with special characters	1	Feb 19, 2009
Problem with yaml and lines beginning with a colon	5	May 11, 2004
YAML's handling of Bignum's in 1.8.4	1	Feb 21, 2006
Characters and strings oddness	3	Jun 14, 2007

to_yaml and international characters

h3raLd

Luis Parravicini

h3raLd

Richard Conroy

Jamal Bengeloun

Konrad Meyer

mortee

Jamal Bengeloun

Jamal Bengeloun

Jimmy Kofler

Konrad Meyer

Jamal Bengeloun

mortee

Jamal Bengeloun

Jamal Bengeloun

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads