how to convert string to binary and back in Ruby 1.9?

J

Joe

I'm using Ruby 1.9.1-p243 on Mac OS X 10.5.8.

I have this UTF-8 string that I want to turn into binary, and then
from binary into ISO-8859-1. The result should be some garbage
string, which I need for debugging purposes. For the sake of an
example, my UTF-8 string is "помоник" (Russian for "helper"). After
looking at the documentation, it seemed like String#force_encoding
would do what I need.

But when I go to irb, I get this:

irb(main):060:0> "помоник".encoding
=> #<Encoding:UTF-8>
irb(main):061:0> "помоник".bytes.to_a
=> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
186]
irb(main):062:0> "помоник".force_encoding("ISO-8859-1")
=> "помоник"
irb(main):063:0> "помоник".force_encoding("ISO-8859-1").encoding
=> #<Encoding:ISO-8859-1>
irb(main):064:0> "помоник".force_encoding("ISO-8859-1").bytes.to_a
=> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
186]

So apparently, it changes the encoding, leaves the bytes unchanged,
but also leaves the decoded characters unchanged? Is this a bug or
what?

Note also:

irb(main):066:0> "помоник".encode('BINARY')
Encoding::UndefinedConversionError: "\xD0\xBF" from UTF-8 to
ASCII-8BIT
from (irb):66:in `encode'
from (irb):66
from /usr/local/bin/irb:12:in `<main>'

So apparently in Ruby 1.9, binary isn't really binary?

I banged my head for a while, and then tried it in python3.
Completely easy:
b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'

So can I do the same thing in Ruby 1.9? How do I deal with binary
data? How to I convert a string to a manageable byte sequence? Is
there a way to turn an array of bytes into a string of a specified
encoding?
 
I

Iñaki Baz Castillo

El Mi=C3=A9rcoles, 2 de Septiembre de 2009, Joe escribi=C3=B3:
I'm using Ruby 1.9.1-p243 on Mac OS X 10.5.8.
=20
I have this UTF-8 string that I want to turn into binary, and then
from binary into ISO-8859-1. The result should be some garbage
string, which I need for debugging purposes. For the sake of an
example, my UTF-8 string is "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA" = (Russian for "helper"). After
looking at the documentation, it seemed like String#force_encoding
would do what I need.
=20
But when I go to irb, I get this:
=20
irb(main):060:0> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA".encoding
=3D> #<Encoding:UTF-8>
irb(main):061:0> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA".bytes.to_a
=3D> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
186]
irb(main):062:0> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA".force_encod= ing("ISO-8859-1")
=3D> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA"
irb(main):063:0> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA".force_encod= ing("ISO-8859-1").encoding
=3D> #<Encoding:ISO-8859-1>
irb(main):064:0> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA".force_encod= ing("ISO-8859-1").bytes.to_a
=3D> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
186]
=20
So apparently, it changes the encoding, leaves the bytes unchanged,
but also leaves the decoded characters unchanged? Is this a bug or
what?
=20
Note also:
=20
irb(main):066:0> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA".encode('BIN= ARY')
Encoding::UndefinedConversionError: "\xD0\xBF" from UTF-8 to
ASCII-8BIT
from (irb):66:in `encode'
from (irb):66
from /usr/local/bin/irb:12:in `<main>'
=20
So apparently in Ruby 1.9, binary isn't really binary?
=20
I banged my head for a while, and then tried it in python3.
=20
Completely easy:
=20
'=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA'
=20
=20
b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'
=20 latin_1')
=20
'=C3=90=C2=BF=C3=90=C2=BE=C3=90=C2=BC=C3=90=C2=BE=C3=90=C2=BD=C3=90=C2=B8= =C3=90=C2=BA'
=20 latin_1')
=20
'=C3=90=C2=BF=C3=90=C2=BE=C3=90=C2=BC=C3=90=C2=BE=C3=90=C2=BD=C3=90=C2=B8= =C3=90=C2=BA'
=20
latin_1').encode('latin_1')
=20
b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'
=20
So can I do the same thing in Ruby 1.9? How do I deal with binary
data? How to I convert a string to a manageable byte sequence? Is
there a way to turn an array of bytes into a string of a specified
encoding?
=20


AFAIK String#force_encoding doesn't re-encode the string, but just changes =
its=20
properties (the encoding).

In the other way, #encode does change the encoding, and it fails if the=20
conversion is not possible.


=2D-=20
I=C3=B1aki Baz Castillo <[email protected]>
 
J

Joe

El Miércoles, 2 de Septiembre de 2009, Joe escribió:
In the other way, #encode does change the encoding, and it fails if the
conversion is not possible.

OK, so String#force_encoding just changes the encoding, but does not
alter the string. But how does it decide to print as the same
sequence of Cyrillic characters, when it thinks its encoding is
ISO-8859-1? How does ruby1.9 decide what characters to display when
printing a String? Surely it must adhere to the encoding of that
String? Is ruby storing the ISO-8859-1 encoded string as a sequence
of unicode characters, or what?

This seems crazy to me.

OK, so maybe String#force_encoding is crazy and broken or just won't
be able to do what I want. Your suggestion was that String#encode is
the method for changing the string. Of course I tried that one, and
it errors because there is no Cyrillic alphabet in ISO-8859-1.

Is there really no way to go from bytes to string? That's all I want!
 
J

Joe

Is there really no way to go from bytes to string?  That's all I want!


OK, I found the Array#pack method. At first glance, it seemed to be
exactly what I was looking for. I could do str.bytes.to_a to turn a
String into raw bytes, and Array#pack will turn them right back into a
String.

But go to

http://ruby-doc.org/core-1.9/classes/Array.html

The method is missing from the 1.9 documentation. Has it been
deprecated? The 1.8 documentation doesn't help much, because it seems
the function is entirely unaware of the String encoding.

I guess Ruby's m17n is brand spanking new, and it shows, huh? I'm
finding it pretty frustrating. :-(
 
M

Marc Heiler

I'm finding it pretty frustrating. :-(

It is, especially as Ruby 1.8 behaviour is less annoying IMHO in this
regard.
 
P

Phrogz

OK, I found the Array#pack method.  At first glance, it seemed to be
exactly what I was looking for.  I could do str.bytes.to_a to turn a
String into raw bytes, and Array#pack will turn them right back into a
String.

But go to

http://ruby-doc.org/core-1.9/classes/Array.html

The method is missing from the 1.9 documentation.  Has it been
deprecated?

I don't believe so. I don't know why it's not in the docs there, but
it's in my local ri:

Slim2:~ phrogz$ ri -T Array#pack
-------------------------------------------------------------
Array#pack
arr.pack ( aTemplateString ) -> aBinaryString

From Ruby 1.9.1
------------------------------------------------------------------------
Packs the contents of _arr_ into a binary sequence according to
the
directives in _aTemplateString_ (see the table below) Directives
``A,'' ``a,'' and ``Z'' may be followed by a count, which gives
the
width of the resulting field. The remaining directives also may
take a count, indicating the number of array elements to convert.
If the count is an asterisk (``+*+''), all remaining array
elements
will be converted. Any of the directives ``+sSiIlL+'' may be
followed by an underscore (``+_+'') to use the underlying
platform's native size for the specified type; otherwise, they
use
a platform-independent size. Spaces are ignored in the template
string. See also +String#unpack+.

a = [ "a", "b", "c" ]
n = [ 65, 66, 67 ]
a.pack("A3A3A3") #=> "a b c "
a.pack("a3a3a3") #=> "a\000\000b\000\000c\000\000"
n.pack("ccc") #=> "ABC"

Directives for +pack+.

Directive Meaning
---------------------------------------------------------------
@ | Moves to absolute position
A | arbitrary binary string (space padded, count is
width)
a | arbitrary binary string (null padded, count is
width)
B | Bit string (descending bit order)
b | Bit string (ascending bit order)
C | Unsigned byte (C unsigned char)
c | Byte (C char)
D, d | Double-precision float, native format
E | Double-precision float, little-endian byte order
e | Single-precision float, little-endian byte order
F, f | Single-precision float, native format
G | Double-precision float, network (big-endian) byte
order
g | Single-precision float, network (big-endian) byte
order
H | Hex string (high nibble first)
h | Hex string (low nibble first)
I | Unsigned integer
i | Integer
L | Unsigned long
l | Long
M | Quoted printable, MIME encoding (see RFC2045)
m | Base64 encoded string (see RFC 2045, count is
width)
| (if count is 0, no line feed are added, see RFC
4648)
N | Long, network (big-endian) byte order
n | Short, network (big-endian) byte-order
P | Pointer to a structure (fixed-length string)
p | Pointer to a null-terminated string
Q, q | 64-bit number
S | Unsigned short
s | Short
U | UTF-8
u | UU-encoded string
V | Long, little-endian byte order
v | Short, little-endian byte order
w | BER-compressed integer\fnm
X | Back up a byte
x | Null byte
Z | Same as ``a'', except that null is added with *
 
P

Patrick Okui

OK, so String#force_encoding just changes the encoding, but does not
alter the string. But how does it decide to print as the same
sequence of Cyrillic characters, when it thinks its encoding is
ISO-8859-1? How does ruby1.9 decide what characters to display when
printing a String? Surely it must adhere to the encoding of that
String? Is ruby storing the ISO-8859-1 encoded string as a sequence
of unicode characters, or what?

Brian Candler did a pretty thorough documentation of 1.9's M17N at =
http://github.com/candlerb/string19=20
. There are also multiple sources of documentation on the subject at =
http://blog.grayproductions.net/articles/what_ruby_19_gives_us=20
(Edward Gray) and elsewhere.

I'm also more comfortable with how 1.8 behaves but then again I'm a =20
newbie here.

Patrick=
 
B

Brian Candler

Joe said:
I have this UTF-8 string that I want to turn into binary, and then
from binary into ISO-8859-1.

UTF-8 is a binary encoding of Unicode codepoints, so it's a sequence of
binary bytes by definition. And you get the same as your Python code:

irb(main):001:0> 'помоник'
=> "помоник"
irb(main):002:0> 'помоник'.bytes.each { |x| print "%02x " % x }
d0 bf d0 be d0 bc d0 be d0 bd d0 b8 d0 ba => "помоник"
irb(main):004:0> 'помоник'.force_encoding("BINARY")
=> "\xD0\xBF\xD0\xBE\xD0\xBC\xD0\xBE\xD0\xBD\xD0\xB8\xD0\xBA"

I think what's confusing you is this:

irb(main):005:0> str = 'помоник'
=> "помоник"
irb(main):006:0> str.force_encoding("ISO-8859-1")
=> "помоник"

Here, Ruby is doing something strange. The string is tagged as a
sequence of ISO-8859-1 characters, but this sequence of bytes is being
squirted as-is to a UTF-8 terminal, and so the UTF-8 terminal is
displaying them as the original characters.

You can get the behaviour you want like this, by transcoding to UTF-8:

irb(main):009:0> str.encode("UTF-8")
=> "ÿþüþýøú"

Given that irb is running in a UTF-8 environment, it is arguable that
STDOUT should have an external encoding of UTF-8, which means text
should be transcoded to UTF-8 automatically.

That is, you can also get the behaviour you want from this standalone
program:

# encoding: UTF-8
STDOUT.set_encoding "UTF-8" # << THE MAGIC BIT

str = 'помоник'
str.force_encoding("ISO-8859-1")
puts str

It seems inconsistent to me that STDOUT doesn't get its
external_encoding set automatically.
So apparently in Ruby 1.9, binary isn't really binary?

Correct. In Ruby 1.9, binary is ASCII. I hate this.

I have documented a lot of the gory details at
http://github.com/candlerb/string19

Thanks for bringing another anomoly to my attention.
 
B

Brian Candler

I think I understand it now. The following was confusing me initially:
=> "gro�\x9F"

It appears this is just an artefact of String#inspect. String#inspect
"knows" that \x80 to \x9F are not printable characters in ISO-8859-1, so
converts them to the backslash hex form. This breaks the UTF-8 display
by splitting the character, but of course only for strings which contain
bytes in that range.

You still get the string displayed as UTF-8 using puts without inspect:
groß
=> nil

It works if you set the encoding for STDOUT inside irb, in which case
you'll get everything transcoded to your terminal's character set.
über
=> nil
 
R

Rüdiger Bahns

Joe said:
I'm using Ruby 1.9.1-p243 on Mac OS X 10.5.8.

I have this UTF-8 string that I want to turn into binary, and then
from binary into ISO-8859-1.

What means to "turn a string from binary to ISO-8859-1"?
'ÿþüþýøú'

What Python does here is it encodes the string (from its internal
unicode format) to an utf-8 binary-string and then converts it again
into its internal unicode-format (interpreting it as latin-1 string).
Finally it puts it out to the console which means that it converts it
again (to probably utf-8) for the Mac Os Terminal. This is an important
point you should keep in mind.

So this is quiet similar to what ruby does except that ruby makes no
conversion to an general internal format and no special conversion for
the terminal.

If you would put the results out to a file you would have the same result.

Regards, R.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,024
Latest member
ARDU_PROgrammER

Latest Threads

Top