how to convert string to binary and back in Ruby 1.9?

Discussion in 'Ruby' started by Joe, Sep 1, 2009.

  1. Joe

    Joe Guest

    I'm using Ruby 1.9.1-p243 on Mac OS X 10.5.8.

    I have this UTF-8 string that I want to turn into binary, and then
    from binary into ISO-8859-1. The result should be some garbage
    string, which I need for debugging purposes. For the sake of an
    example, my UTF-8 string is "помоник" (Russian for "helper"). After
    looking at the documentation, it seemed like String#force_encoding
    would do what I need.

    But when I go to irb, I get this:

    irb(main):060:0> "помоник".encoding
    => #<Encoding:UTF-8>
    irb(main):061:0> "помоник".bytes.to_a
    => [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
    186]
    irb(main):062:0> "помоник".force_encoding("ISO-8859-1")
    => "помоник"
    irb(main):063:0> "помоник".force_encoding("ISO-8859-1").encoding
    => #<Encoding:ISO-8859-1>
    irb(main):064:0> "помоник".force_encoding("ISO-8859-1").bytes.to_a
    => [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
    186]

    So apparently, it changes the encoding, leaves the bytes unchanged,
    but also leaves the decoded characters unchanged? Is this a bug or
    what?

    Note also:

    irb(main):066:0> "помоник".encode('BINARY')
    Encoding::UndefinedConversionError: "\xD0\xBF" from UTF-8 to
    ASCII-8BIT
    from (irb):66:in `encode'
    from (irb):66
    from /usr/local/bin/irb:12:in `<main>'

    So apparently in Ruby 1.9, binary isn't really binary?

    I banged my head for a while, and then tried it in python3.
    Completely easy:

    >>> 'помоник'

    'помоник'
    >>> 'помоник'.encode('utf_8')

    b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'
    >>> 'помоник'.encode('utf_8').decode('latin_1')

    'ÿþüþýøú'
    >>> 'помоник'.encode('utf_8').decode('latin_1')

    'ÿþüþýøú'
    >>> 'помоник'.encode('utf_8').decode('latin_1').encode('latin_1')

    b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'

    So can I do the same thing in Ruby 1.9? How do I deal with binary
    data? How to I convert a string to a manageable byte sequence? Is
    there a way to turn an array of bytes into a string of a specified
    encoding?
     
    Joe, Sep 1, 2009
    #1
    1. Advertising

  2. El Mi=C3=A9rcoles, 2 de Septiembre de 2009, Joe escribi=C3=B3:
    > I'm using Ruby 1.9.1-p243 on Mac OS X 10.5.8.
    >=20
    > I have this UTF-8 string that I want to turn into binary, and then
    > from binary into ISO-8859-1. The result should be some garbage
    > string, which I need for debugging purposes. For the sake of an
    > example, my UTF-8 string is "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA" =

    (Russian for "helper"). After
    > looking at the documentation, it seemed like String#force_encoding
    > would do what I need.
    >=20
    > But when I go to irb, I get this:
    >=20
    > irb(main):060:0> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA".encoding
    > =3D> #<Encoding:UTF-8>
    > irb(main):061:0> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA".bytes.to_a
    > =3D> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
    > 186]
    > irb(main):062:0> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA".force_encod=

    ing("ISO-8859-1")
    > =3D> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA"
    > irb(main):063:0> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA".force_encod=

    ing("ISO-8859-1").encoding
    > =3D> #<Encoding:ISO-8859-1>
    > irb(main):064:0> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA".force_encod=

    ing("ISO-8859-1").bytes.to_a
    > =3D> [208, 191, 208, 190, 208, 188, 208, 190, 208, 189, 208, 184, 208,
    > 186]
    >=20
    > So apparently, it changes the encoding, leaves the bytes unchanged,
    > but also leaves the decoded characters unchanged? Is this a bug or
    > what?
    >=20
    > Note also:
    >=20
    > irb(main):066:0> "=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA".encode('BIN=

    ARY')
    > Encoding::UndefinedConversionError: "\xD0\xBF" from UTF-8 to
    > ASCII-8BIT
    > from (irb):66:in `encode'
    > from (irb):66
    > from /usr/local/bin/irb:12:in `<main>'
    >=20
    > So apparently in Ruby 1.9, binary isn't really binary?
    >=20
    > I banged my head for a while, and then tried it in python3.
    >=20
    > Completely easy:
    > >>> '=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA'

    >=20
    > '=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA'
    >=20
    > >>> '=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA'.encode('utf_8')

    >=20
    > b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'
    >=20
    > >>> '=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA'.encode('utf_8').decode('=

    latin_1')
    >=20
    > '=C3=90=C2=BF=C3=90=C2=BE=C3=90=C2=BC=C3=90=C2=BE=C3=90=C2=BD=C3=90=C2=B8=

    =C3=90=C2=BA'
    >=20
    > >>> '=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA'.encode('utf_8').decode('=

    latin_1')
    >=20
    > '=C3=90=C2=BF=C3=90=C2=BE=C3=90=C2=BC=C3=90=C2=BE=C3=90=C2=BD=C3=90=C2=B8=

    =C3=90=C2=BA'
    >=20
    > >>> '=D0=BF=D0=BE=D0=BC=D0=BE=D0=BD=D0=B8=D0=BA'.encode('utf_8').decode('=

    latin_1').encode('latin_1')
    >=20
    > b'\xd0\xbf\xd0\xbe\xd0\xbc\xd0\xbe\xd0\xbd\xd0\xb8\xd0\xba'
    >=20
    > So can I do the same thing in Ruby 1.9? How do I deal with binary
    > data? How to I convert a string to a manageable byte sequence? Is
    > there a way to turn an array of bytes into a string of a specified
    > encoding?
    >=20



    AFAIK String#force_encoding doesn't re-encode the string, but just changes =
    its=20
    properties (the encoding).

    In the other way, #encode does change the encoding, and it fails if the=20
    conversion is not possible.


    =2D-=20
    I=C3=B1aki Baz Castillo <>
     
    Iñaki Baz Castillo, Sep 2, 2009
    #2
    1. Advertising

  3. Joe

    Joe Guest

    On Sep 1, 4:18 pm, Iñaki Baz Castillo <> wrote:
    > El Miércoles, 2 de Septiembre de 2009, Joe escribió:


    > In the other way, #encode does change the encoding, and it fails if the
    > conversion is not possible.
    >
    > --
    > Iñaki Baz Castillo <>


    OK, so String#force_encoding just changes the encoding, but does not
    alter the string. But how does it decide to print as the same
    sequence of Cyrillic characters, when it thinks its encoding is
    ISO-8859-1? How does ruby1.9 decide what characters to display when
    printing a String? Surely it must adhere to the encoding of that
    String? Is ruby storing the ISO-8859-1 encoded string as a sequence
    of unicode characters, or what?

    This seems crazy to me.

    OK, so maybe String#force_encoding is crazy and broken or just won't
    be able to do what I want. Your suggestion was that String#encode is
    the method for changing the string. Of course I tried that one, and
    it errors because there is no Cyrillic alphabet in ISO-8859-1.

    Is there really no way to go from bytes to string? That's all I want!
     
    Joe, Sep 2, 2009
    #3
  4. Joe

    Joe Guest

    On Sep 1, 4:53 pm, Joe <> wrote:

    >
    > Is there really no way to go from bytes to string?  That's all I want!



    OK, I found the Array#pack method. At first glance, it seemed to be
    exactly what I was looking for. I could do str.bytes.to_a to turn a
    String into raw bytes, and Array#pack will turn them right back into a
    String.

    But go to

    http://ruby-doc.org/core-1.9/classes/Array.html

    The method is missing from the 1.9 documentation. Has it been
    deprecated? The 1.8 documentation doesn't help much, because it seems
    the function is entirely unaware of the String encoding.

    I guess Ruby's m17n is brand spanking new, and it shows, huh? I'm
    finding it pretty frustrating. :-(
     
    Joe, Sep 2, 2009
    #4
  5. Joe

    Marc Heiler Guest

    > I'm finding it pretty frustrating. :-(

    It is, especially as Ruby 1.8 behaviour is less annoying IMHO in this
    regard.
    --
    Posted via http://www.ruby-forum.com/.
     
    Marc Heiler, Sep 2, 2009
    #5
  6. Joe

    Phrogz Guest

    On Sep 1, 6:38 pm, Joe <> wrote:
    > OK, I found the Array#pack method.  At first glance, it seemed to be
    > exactly what I was looking for.  I could do str.bytes.to_a to turn a
    > String into raw bytes, and Array#pack will turn them right back into a
    > String.
    >
    > But go to
    >
    > http://ruby-doc.org/core-1.9/classes/Array.html
    >
    > The method is missing from the 1.9 documentation.  Has it been
    > deprecated?


    I don't believe so. I don't know why it's not in the docs there, but
    it's in my local ri:

    Slim2:~ phrogz$ ri -T Array#pack
    -------------------------------------------------------------
    Array#pack
    arr.pack ( aTemplateString ) -> aBinaryString

    From Ruby 1.9.1
    ------------------------------------------------------------------------
    Packs the contents of _arr_ into a binary sequence according to
    the
    directives in _aTemplateString_ (see the table below) Directives
    ``A,'' ``a,'' and ``Z'' may be followed by a count, which gives
    the
    width of the resulting field. The remaining directives also may
    take a count, indicating the number of array elements to convert.
    If the count is an asterisk (``+*+''), all remaining array
    elements
    will be converted. Any of the directives ``+sSiIlL+'' may be
    followed by an underscore (``+_+'') to use the underlying
    platform's native size for the specified type; otherwise, they
    use
    a platform-independent size. Spaces are ignored in the template
    string. See also +String#unpack+.

    a = [ "a", "b", "c" ]
    n = [ 65, 66, 67 ]
    a.pack("A3A3A3") #=> "a b c "
    a.pack("a3a3a3") #=> "a\000\000b\000\000c\000\000"
    n.pack("ccc") #=> "ABC"

    Directives for +pack+.

    Directive Meaning
    ---------------------------------------------------------------
    @ | Moves to absolute position
    A | arbitrary binary string (space padded, count is
    width)
    a | arbitrary binary string (null padded, count is
    width)
    B | Bit string (descending bit order)
    b | Bit string (ascending bit order)
    C | Unsigned byte (C unsigned char)
    c | Byte (C char)
    D, d | Double-precision float, native format
    E | Double-precision float, little-endian byte order
    e | Single-precision float, little-endian byte order
    F, f | Single-precision float, native format
    G | Double-precision float, network (big-endian) byte
    order
    g | Single-precision float, network (big-endian) byte
    order
    H | Hex string (high nibble first)
    h | Hex string (low nibble first)
    I | Unsigned integer
    i | Integer
    L | Unsigned long
    l | Long
    M | Quoted printable, MIME encoding (see RFC2045)
    m | Base64 encoded string (see RFC 2045, count is
    width)
    | (if count is 0, no line feed are added, see RFC
    4648)
    N | Long, network (big-endian) byte order
    n | Short, network (big-endian) byte-order
    P | Pointer to a structure (fixed-length string)
    p | Pointer to a null-terminated string
    Q, q | 64-bit number
    S | Unsigned short
    s | Short
    U | UTF-8
    u | UU-encoded string
    V | Long, little-endian byte order
    v | Short, little-endian byte order
    w | BER-compressed integer\fnm
    X | Back up a byte
    x | Null byte
    Z | Same as ``a'', except that null is added with *
     
    Phrogz, Sep 2, 2009
    #6
  7. Joe

    Patrick Okui Guest

    On Sep 2, 2009, at 2:55 AM, Joe wrote:

    > On Sep 1, 4:18 pm, I=F1aki Baz Castillo <> wrote:
    >> El Mi=E9rcoles, 2 de Septiembre de 2009, Joe escribi=F3:

    >
    >> In the other way, #encode does change the encoding, and it fails if =20=


    >> the
    >> conversion is not possible.
    >>
    >> --
    >> I=F1aki Baz Castillo <>

    >
    > OK, so String#force_encoding just changes the encoding, but does not
    > alter the string. But how does it decide to print as the same
    > sequence of Cyrillic characters, when it thinks its encoding is
    > ISO-8859-1? How does ruby1.9 decide what characters to display when
    > printing a String? Surely it must adhere to the encoding of that
    > String? Is ruby storing the ISO-8859-1 encoded string as a sequence
    > of unicode characters, or what?


    Brian Candler did a pretty thorough documentation of 1.9's M17N at =
    http://github.com/candlerb/string19=20
    . There are also multiple sources of documentation on the subject at =
    http://blog.grayproductions.net/articles/what_ruby_19_gives_us=20
    (Edward Gray) and elsewhere.

    I'm also more comfortable with how 1.8 behaves but then again I'm a =20
    newbie here.

    Patrick=
     
    Patrick Okui, Sep 2, 2009
    #7
  8. Joe wrote:
    > I have this UTF-8 string that I want to turn into binary, and then
    > from binary into ISO-8859-1.


    UTF-8 is a binary encoding of Unicode codepoints, so it's a sequence of
    binary bytes by definition. And you get the same as your Python code:

    irb(main):001:0> 'помоник'
    => "помоник"
    irb(main):002:0> 'помоник'.bytes.each { |x| print "%02x " % x }
    d0 bf d0 be d0 bc d0 be d0 bd d0 b8 d0 ba => "помоник"
    irb(main):004:0> 'помоник'.force_encoding("BINARY")
    => "\xD0\xBF\xD0\xBE\xD0\xBC\xD0\xBE\xD0\xBD\xD0\xB8\xD0\xBA"

    I think what's confusing you is this:

    irb(main):005:0> str = 'помоник'
    => "помоник"
    irb(main):006:0> str.force_encoding("ISO-8859-1")
    => "помоник"

    Here, Ruby is doing something strange. The string is tagged as a
    sequence of ISO-8859-1 characters, but this sequence of bytes is being
    squirted as-is to a UTF-8 terminal, and so the UTF-8 terminal is
    displaying them as the original characters.

    You can get the behaviour you want like this, by transcoding to UTF-8:

    irb(main):009:0> str.encode("UTF-8")
    => "ÿþüþýøú"

    Given that irb is running in a UTF-8 environment, it is arguable that
    STDOUT should have an external encoding of UTF-8, which means text
    should be transcoded to UTF-8 automatically.

    That is, you can also get the behaviour you want from this standalone
    program:

    # encoding: UTF-8
    STDOUT.set_encoding "UTF-8" # << THE MAGIC BIT

    str = 'помоник'
    str.force_encoding("ISO-8859-1")
    puts str

    It seems inconsistent to me that STDOUT doesn't get its
    external_encoding set automatically.

    > So apparently in Ruby 1.9, binary isn't really binary?


    Correct. In Ruby 1.9, binary is ASCII. I hate this.

    I have documented a lot of the gory details at
    http://github.com/candlerb/string19

    Thanks for bringing another anomoly to my attention.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Sep 2, 2009
    #8
  9. I think I understand it now. The following was confusing me initially:

    >> str = "über"

    => "über"
    >> str.force_encoding("ISO-8859-1")

    => "über"
    >> str = "groß"

    => "groß"
    >> str.force_encoding("ISO-8859-1")

    => "gro�\x9F"

    It appears this is just an artefact of String#inspect. String#inspect
    "knows" that \x80 to \x9F are not printable characters in ISO-8859-1, so
    converts them to the backslash hex form. This breaks the UTF-8 display
    by splitting the character, but of course only for strings which contain
    bytes in that range.

    You still get the string displayed as UTF-8 using puts without inspect:

    >> puts str

    groß
    => nil

    It works if you set the encoding for STDOUT inside irb, in which case
    you'll get everything transcoded to your terminal's character set.

    >> STDOUT.set_encoding "locale"

    => #<IO:<STDOUT>>
    >> str = "über"

    => "über"
    >> str.force_encoding("ISO-8859-1")

    => "über"
    >> puts str

    über
    => nil
    >>

    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Sep 2, 2009
    #9
  10. Joe schrieb:
    > I'm using Ruby 1.9.1-p243 on Mac OS X 10.5.8.
    >
    > I have this UTF-8 string that I want to turn into binary, and then
    > from binary into ISO-8859-1.


    What means to "turn a string from binary to ISO-8859-1"?

    >>>> 'помоник'.encode('utf_8').decode('latin_1')

    > 'ÿþüþýøú'


    What Python does here is it encodes the string (from its internal
    unicode format) to an utf-8 binary-string and then converts it again
    into its internal unicode-format (interpreting it as latin-1 string).
    Finally it puts it out to the console which means that it converts it
    again (to probably utf-8) for the Mac Os Terminal. This is an important
    point you should keep in mind.

    So this is quiet similar to what ruby does except that ruby makes no
    conversion to an general internal format and no special conversion for
    the terminal.

    If you would put the results out to a file you would have the same result.

    Regards, R.
     
    Rüdiger Bahns, Sep 2, 2009
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    5
    Views:
    1,935
    Wesley Hall
    Nov 4, 2006
  2. Harlin Seritt
    Replies:
    29
    Views:
    1,187
    Paul Rubin
    Feb 24, 2007
  3. Kelsey Bjarnason
    Replies:
    6
    Views:
    360
    Richard Tobin
    Feb 6, 2008
  4. Nebiru
    Replies:
    1
    Views:
    103
    Trans
    Sep 16, 2006
  5. Wu Nan
    Replies:
    3
    Views:
    216
    Daniel Berger
    Dec 26, 2007
Loading...

Share This Page