Encoding problems .. ruby 1.9.2

Discussion in 'Ruby' started by Bhay Zone, Sep 26, 2010.

  1. Bhay Zone

    Bhay Zone Guest

    I am pretty new to ruby and am trying to read text data coming from a
    backend which can only be queried using proprietary Command Line
    Interface commands.

    The problem is that this text data contains non-ascii characters...I
    don't know what these characters are .. and nor do I know the encoding.

    Earlier, when we were using ruby 1.8.7 we had some code that handled
    these characters pretty well. Now after switching to ruby 1.9.2, the
    same code breaks with encoding errors like "invalid multibyte sequence"
    in gsub.

    Here is the code we were using to replace the non-ascii characters which
    is breaking now. The code it breaks at the first line.

    content.gsub!( "\221", '')
    content.gsub!( "\222", '')
    content.gsub!( "\223", '')
    content.gsub!( "\224", '')
    content.gsub!( "\246", '')
    content.gsub!( "\247", '')
    content.gsub!( "\237", '')
    content.gsub!( "\377", '')
    content.gsub!( "\226", '')
    content.gsub!( "\227", '')
    content.gsub!( "\\000", "?")
    content.gsub!( "\\001", "?")
    content.gsub!( "\FB01", "")
    content.gsub!(/[\x80-\xFF]/,'')
    content.gsub!(/[\x00-\x08]/,'')
    content.gsub!(/[\x0B-\x0C]/,'')
    content.gsub!(/[\x0E-\x1F]/,'')

    I just cannot figure how to fix this problem and any help would be
    greatly appreciated.
    --
    Posted via http://www.ruby-forum.com/.
    Bhay Zone, Sep 26, 2010
    #1
    1. Advertising

  2. On 9/26/10, Bhay Zone <> wrote:
    > I am pretty new to ruby and am trying to read text data coming from a
    > backend which can only be queried using proprietary Command Line
    > Interface commands.
    >
    > The problem is that this text data contains non-ascii characters...I
    > don't know what these characters are .. and nor do I know the encoding.
    >
    > Earlier, when we were using ruby 1.8.7 we had some code that handled
    > these characters pretty well. Now after switching to ruby 1.9.2, the
    > same code breaks with encoding errors like "invalid multibyte sequence"
    > in gsub.
    >
    > Here is the code we were using to replace the non-ascii characters which
    > is breaking now. The code it breaks at the first line.
    >
    > content.gsub!( "\221", '')
    > content.gsub!( "\222", '')
    > content.gsub!( "\223", '')
    > content.gsub!( "\224", '')
    > content.gsub!( "\246", '')
    > content.gsub!( "\247", '')
    > content.gsub!( "\237", '')
    > content.gsub!( "\377", '')
    > content.gsub!( "\226", '')
    > content.gsub!( "\227", '')
    > content.gsub!( "\\000", "?")
    > content.gsub!( "\\001", "?")
    > content.gsub!( "\FB01", "")
    > content.gsub!(/[\x80-\xFF]/,'')
    > content.gsub!(/[\x00-\x08]/,'')
    > content.gsub!(/[\x0B-\x0C]/,'')
    > content.gsub!(/[\x0E-\x1F]/,'')
    >
    > I just cannot figure how to fix this problem and any help would be
    > greatly appreciated.


    In 1.9, every string (and regular expression) has an encoding attached
    to it. If there are any byte sequences in your string that don't match
    the encoding, it causes errors. 1.8 was much more permissive about its
    strings, allowing arbitrary binary data in any string, which is why it
    worked better for you. You can get back the 1.8 behavior under 1.9 by
    setting the encoding of your string objects to 'binary'.

    My first suggestion would be to set the encoding of the string in the
    variable content to binary before doing any of the gsub!s:
    content.force_encoding('binary')

    However, a better way would be to set the encoding of the IO object
    the strings are read from. That way you don't need to force_encoding
    each string as it comes in.

    Even better is to figure out what the encoding this external tool is
    using and set the IO's encoding to that. Then perhaps a lot of this
    hacky string manglich could go away.

    But this is still only half the story. You also have to consider the
    encoding of the strings and regexps which get passed as the first
    argument to gsub. Those string (and regexp) literals default to the
    same encoding as the source file they're contained in. If no explicit
    encoding is declared for a specific source file, ruby guesses an
    encoding based on your environment (using the LOCALE env var and some
    others that I can't remember right now). Often, this means ruby
    assumes your sources
    are utf-8 encoded.

    You can declare a specific encoding explicitly by putting something
    like this as the very first line in your source:
    #encoding: binary
    (or the second line if the first line is a shebang line).

    I used the binary encoding in the example line above because that's
    probably the one which will work best for you under the circumstances.
    Declaring the source encoding to be binary is a bit hackish, but
    probably the easiest way to get you where you want to go. If you
    figure out what encoding your data is in, you're probably better off
    declaring the source encoding to be the same thing, but there may be
    more work involved there.

    PS: there is some redundancy in the sequence of gsub!s you posted. The
    first 10 (for "\221" thru "\227") are special cases of the 14th (for
    /[\x80-\xFF]/) and can safely be deleted. Also, "\FB01" is the same
    thing as "FB01" in both ruby 1.8 and 1.9 and probably not what you
    wanted. (Maybe "\xFB\x01" is what you actually meant?)

    HTH
    Caleb Clausen, Sep 26, 2010
    #2
    1. Advertising

  3. Bhay Zone wrote:
    > I am pretty new to ruby and am trying to read text data coming from a
    > backend which can only be queried using proprietary Command Line
    > Interface commands.
    >
    > The problem is that this text data contains non-ascii characters...I
    > don't know what these characters are .. and nor do I know the encoding.


    How are you interfacing with this interface - a TCP socket? IO.popen?
    Backticks? Something else? If you show the code which opens the
    connection, we can show how to fix it.

    TCP sockets default to "ASCII-8BIT" encoding, but for other methods,
    unless you tell ruby what encoding to use, it will guess based on
    environment variables on your PC. That is, the same program may work
    fine on one PC but fail on another.

    To avoid these problems, there are magic incantations you can add to
    force ruby not to guess. e.g.

    IO.popen: add "b" to the mode string

    Backticks or %x: res = `foo`; res.force_encoding("ASCII-8BIT")

    Or try running ruby with -Kn flag.

    > I just cannot figure how to fix this problem and any help would be
    > greatly appreciated.


    It's probably possible to fix your code, as above. However, sticking
    with ruby 1.8.7 is also a reasonable solution if you don't want to have
    to deal with this sort of nonsense.

    I had a go at reverse-engineering the string encoding behaviour of ruby
    1.9. I gave up after documenting about 200 behaviours:
    http://github.com/candlerb/string19/blob/master/string19.rb

    I'm sticking with 1.8, because 1.9 makes my brain hurt.
    --
    Posted via http://www.ruby-forum.com/.
    Brian Candler, Sep 27, 2010
    #3
  4. Bhay Zone

    Bhay Zone Guest

    Caleb, Brian - Thank you for your replies.

    The source of this data is a bug tracking tool known as GNATS. Now this
    tool also comes with a client which provides a command line util known
    as query-pr to query GNATS. The output of query-pr is delimited text. If
    you run query-pr from the linux shell, it prints the output on the
    screen.

    I invoke query-pr from my ruby program as follows (note the opening and
    closing (``) characters.

    result=`query-pr --expr 'Status="closed"'`
    # parse the result and take appropriate action.

    I am not very sure, but my guess is that the GNATS client uses TCP
    sockets to interface with the GNATS DB.

    Thanks for pointing out the redundancy, i'll fix that in my code.

    Right now I have "# coding: utf-8" as the first line in the ruby file. I
    found that while trying to figure out this problem and hoped it would
    make magic ... but well ... :-(

    I'll also try out the "# coding: binary" to see if that works for my
    case.

    I'm not sure if going back to ruby 1.8.7 is an option .. will keep that
    as a last option.
    --
    Posted via http://www.ruby-forum.com/.
    Bhay Zone, Sep 27, 2010
    #4
  5. Bhay Zone wrote:
    > I invoke query-pr from my ruby program as follows (note the opening and
    > closing (``) characters.
    >
    > result=`query-pr --expr 'Status="closed"'`
    > # parse the result and take appropriate action.


    That's backticks. Follow that line with:

    result.force_encoding("ASCII-8BIT")

    when running with ruby 1.9, before you start doing your substitutions.

    > I am not very sure, but my guess is that the GNATS client uses TCP
    > sockets to interface with the GNATS DB.


    Maybe, but that's irrelevant here. Ruby is reading the output of
    query-pr, as a string, and has decided to give it some arbitrary guessed
    encoding.

    > Right now I have "# coding: utf-8" as the first line in the ruby file. I
    > found that while trying to figure out this problem and hoped it would
    > make magic ... but well ... :-(
    >
    > I'll also try out the "# coding: binary" to see if that works for my
    > case.


    It won't. It will only affect the coding of quoted string literals
    within your code.
    --
    Posted via http://www.ruby-forum.com/.
    Brian Candler, Sep 27, 2010
    #5
  6. Bhay Zone

    Bhay Zone Guest

    Bhay Zone, Sep 27, 2010
    #6
  7. Bhay Zone wrote:
    > After 'result.force_encoding("ASCII-8BIT"), are the gsubs necessary?


    Why do you do them in the ruby 1.8.7 version? If they served a purpose
    there, then presumably they still serve a purpose.

    All the force_encoding business is doing is preventing these lines from
    crashing ruby 1.9. The bytes in the string from query-pr will still be
    the same.
    --
    Posted via http://www.ruby-forum.com/.
    Brian Candler, Sep 28, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,801
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Guillermo Rosich Capablanca

    encoding problems (utf-8)

    Guillermo Rosich Capablanca, Jul 13, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    20,543
  3. Replies:
    1
    Views:
    23,318
    Real Gagnon
    Oct 8, 2004
  4. mdiam
    Replies:
    6
    Views:
    149
    Hidetoshi NAGAI
    Jan 12, 2009
  5. Replies:
    2
    Views:
    354
Loading...

Share This Page